Detailed Action

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after April 25, 2019, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/25/2019. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
In regards to claim 1, the claim recites the limitation ‘determining, over time, where bias is occurring in the semi-supervised training based on the merging of the bias weights with the non-bias weights in the artificial neural network’ in line 9-11. The claim is indefinite as it is unclear how the 
For purpose of examination that claim is being interpreted as: Perform calculation (i.e. Mean Square Error) using both bias training data and de-biased training data.
The limitation of ‘generating a deep reinforcement learning model that decreases reliance on the bias weights based on determined bias to increase fairness’ in line 12-13, is indefinite as it is unclear what constitutes generating a model that decreases the reliance on the bias. What constitutes the model? How the model functions to decrease reliance on the bias weight?
For purpose of examination that claim is being interpreted as: generating a de-biased training set.
Claim 11 and 13 have similar limitations to method claim 1 above. Therefore, they are rejected under the same rational as of claim 1 above. 
Claim 2-10 depend on claim 1 and inherits the same deficiency. Therefore, rejected by the same reasoning as claim 1.
Claim 12 and 14-20 depend on claim 11 and claim 13, and inherits the same deficiency. Therefore, rejected by the same reasoning as claim 11 and 13.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1-20 is/are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1,
2A Prong 1: The limitation of observing a microstate of an environment and reaction of items in a plurality of microstates within the environment after an agent performs an action in the environment, is a process that, under its broadest reasonable interpretation, covers observing changes of environment caused by an action of an agent. It is a mental process, because the limitation encompasses the user observing change of environment caused by human or robot, which can be performed in human mind. The limitation of determine bias weights corresponding to the action and the reaction of the items in the plurality of microstates, is a mental process, as it encompasses user calculating bias weights manually based on reactions of items by using pen and paper or mathematical function. The limitation of merging the bias weights with non-bias weights using an artificial neural network is a mental process, because it encompasses a user using a dataset coming from an algorithm and another dataset to find out where the bias (i.e. error) happens. The limitation of determining where bias is occurring based on the merging of the bias weights with the non-bias weights, is a mental process, because determining where the error (i.e. bias) is, can be done using pen and paper or mathematical function. The limitation of generating a model that decreases reliance on the bias weights based on determined bias, is a mathematical concept, as the model itself is a group of mathematical function and generating it is equivalent to building mathematical relationship between input and output.
2A Prong 2: This judicial exception is not integrated into a practical application. The claim does not recite any additional element.
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim does not recite any additional element. The limitation of utilizing semi-supervised training and deep-reinforcement model merely says which MPEP 2106.05(h)). The claim is not patent eligible. 
Regarding claim 11, the limitation of a computer system for providing fair deep reinforcement learning comprising a bus system, a storage, and a processor are generic computer component. Claim 11 is a system claim having similar limitation to claim 1 above. Therefore, it is an abstract idea under the same rational as of claim 1 above.
Regarding claim 13, the limitation of a computer program product for providing fair deep reinforcement learning comprising a computer readable storage, is a generic computer component that stores generic computer instruction. Claim 13 is a program product claim having similar limitation to claim 1 above. Therefore, it is an abstract idea under the same rational as of claim 1 above.

Regarding claim 2,
2A Prong 1: The limitation of training the agent to perform the action in the environment of the set of two or more environments based on the learning model, is a mental process, as it encompasses the user training to perform action in specific environment. And the limitation of mapping the action to be performed by the agent in the environment to a reward using a Q-table, is a mental process, because the limitation encompasses the user mapping the result of specific action to a table of reward scores using pen and paper.
2A Prong 2: This judicial exception is not integrated into a practical application. The claim does not recite any additional element. The limitation of receiving a learning model corresponding to a set of two or more environments, is a form of insignificant extra-solution activity.
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim does not recite any additional element. The claim recites the limitation of receiving a learning model corresponding to a set of two or more environments, which was considered to be insignificant extra-solution activity in Step 2A Prong 2, and thus it is re-evaluated in Step 2B to determine if it is more than what is well-understood routine and conventional activity in the field. The limitation is mere data gathering (MPEP 2106.05(g)). The limitation of utilizing semi-supervised training and deep-reinforcement model merely says which particular technological field or environment the abstract idea is performed in (MPEP 2106.05(h)). The claim is not patent eligible. 
Claim 5 has similar limitation to claim 2 above. Therefore, it is an abstract idea under the same rational as of claim 2 above. 
Claim 12 is a system claim having similar limitation to claim 2 above. Therefore, it is an abstract idea under the same rational as of claim 2 above. 
Claim 14 and 17 is/are program product claim having similar limitation to claim 2 above. Therefore, it is an abstract idea under the same rational as of claim 2 above. 

Regarding claim 3,
2A Prong 1: The limitation of Docket No. P201810187US01Page 25 of 31analyzing the multimedia data of the agent performing the action to accomplish the task in the environment, is a mental process, because analyzing the data can be done in human mind or pen and paper. The limitation of determining change in state of the environment based on analysis of the agent performing the action to accomplish the task in the environment, is a mental process, as ‘determining change in state’ can be done in human mind.
2A Prong 2: This judicial exception is not integrated into a practical application. The claim does not recite any additional element. The limitation of capturing multimedia data of the agent performing the action to accomplish a task in the environment based on the reward, is an insignificant extra-solution activity.
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim does not recite any additional element. The limitation of using a first set of sensors and using the artificial neural network merely says which particular technological field or environment the abstract idea is performed in (MPEP 2106.05(h)). The limitation of capturing multimedia data of the agent performing the action to accomplish a task in the environment based on the reward, which was considered to be insignificant extra-solution activity in Step 2A Prong 2, and thus it is re-evaluated in Step 2B to determine if it is more than what is well-understood routine and conventional activity in the field. The limitation is mere data gathering (MPEP 2106.05(g)) The claim is not patent eligible. 
Claim 6 has similar limitation to claim 3 above. Therefore, it is an abstract idea under the same rational as of claim 3 above. 
Claim 15 and 18 is/are program product claim having similar limitation to claim 3 above. Therefore, it is an abstract idea under the same rational as of claim 3 above. 

Regarding claim 4, 
2A Prong 1: The limitation of identifying equal opportunity and disparate impact on protected attributes by the agent to weight degree of bias based on a determined change in state of the environment, is a mental process, as ‘identifying based on change in state of the environment’ can be done in human mind. The limitation of recalculating a reward corresponding to the action by the agent, is a mental process, as calculating reward can be done using pen and paper. The limitation of updating a Q-table with the recalculated reward corresponding to the action, is a mental process, because it encompasses the user updating the Q-table with updated values.
2A Prong 2: This judicial exception is not integrated into a practical application. The claim does not recite any additional element.
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim does not recite any additional element. The claim is not patent eligible. 
Claim 16 is a program product claim having similar limitation to claim 4 above. Therefore, it is an abstract idea under the same rational as of claim 4.

Regarding claim 7,
2A Prong 1: The limitation of identifying equal opportunity and disparate impact on protected attributes by the swarm of agents to weight degree of bias based on determined change in state of the one or more other environments, is a mental process, as identifying equal opportunity and impact based on change of environment can be done in human mind. The limitation of post processing a weighted degree of bias to decrease bias by merging biased nodes and non-biased nodes and limiting bias weights, is a mathematical concept, as processing weighted degree of bias (i.e. number or matrices added or multiplied to the result data or input data) merely a mathematical calculation.
2A Prong 2: This judicial exception is not integrated into a practical application. The claim does not recite any additional element.
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim does not recite any additional element. The claim is not patent eligible. 
Claim 19 is a program product claim having similar limitation to claim 7 above. Therefore, it is an abstract idea under the same rational as of claim 7.

Regarding claim 8,
2A Prong 1: The limitation of relabeling data of the semi-supervised learning model, is a mental process, because the limitation encompasses a user relabeling data based on specific score or criteria, using pen and paper. The limitation of retraining the agent to modify performance of the action using the relabeled training data is also an abstract idea, because the limitation encompasses the user trained to change action based on the data.
2A Prong 2: This judicial exception is not integrated into a practical application. The claim does not recite any additional element.
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim does not recite any additional element. The claim is not patent eligible. 
Claim 20 is a program product claim having similar limitation to claim 8 above. Therefore, it is an abstract idea under the same rational as of claim 8.

Regarding claim 9, 
the limitation of wherein the agent is selected from a group consisting of a robot, a chatbot, an artificial intelligence entity, and a human, merely says which particular technological field or environment the idea is performed in (MPEP 2106.05(h)), i.e. specifying which agent performs the action doesn’t add a practical application nor an inventive concept to the “fair deep learning system”. The claim is not patent eligible.

Regarding claim 10,
 the limitation of artificial neural network is a convolutional neural network, merely says which particular technological field or environment the idea is performed in (MPEP 2106.05(h)), i.e. specifying 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 1, 5-7, 9-11, 13, and 17-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hasselt (Hasselt et al, 2015, “Deep Reinforcement Learning with Double Q-learning”) in view of Li (US 20200226489 A1).
	
	Regarding claim 1, Hasselt teaches 
A method for providing fair deep reinforcement learning, the method comprising: 
observing a microstate of an environment and reaction of items in a plurality of microstates within the environment after an agent performs an action in the environment ([Hasselt, page 1, 1st paragraph of Background; page 5, Figure 3] “To solve sequential decision problems we can learn estimates for the optimal value of each action, defined as the expected sum of future rewards when taking that action and following the optimal policy thereafter. Under a given policy π, the true value of an action a in a state s is … Estimates for the optimal action values can be learned using Q-learning, a form of temporal difference learning”); 
[Hasselt, page 6, Robustness to Human starts, 2nd paragraph] “We obtained 100 starting points sampled for each game from a human expert’s trajectory, as proposed by Nair et al (2015). We start an evaluation episode from each of these starting points and run the emulator for up to 108,000 frames (30 mins at 60Hz including the trajectory before the starting point).”, corresponds to labeled data, and [Hasselt, page 2, Double Q-learning, 2nd paragraph] “In the original Double Q-learning algorithm, two value functions are learned by assigning each experience randomly to update one of the two value functions, such that there are two sets of weights, Θ and Θ’. For each update, one set of weights is used to determine the greedy policy and the other to determine its value” corresponds to learning using unlabeled data. [Hasselt, Figure 1] “The orange bars show the bias in a single Q-learning update when the action values are Q(s, a) = V∗(s) + a and the errors {a}ma=1 are independent standard normal random variables. The second set of action values Q0, used for the blue bars, was generated identically and independently”);
Hasselt does not specifically teaches merging the bias weights from the semi-supervised training with non-bias weights using an artificial neural network; determining, over time, where bias is occurring in the semi-supervised training; and generating a deep reinforcement learning model that decreases reliance on the bias weights based on determined bias to increase fairness.
Li teaches merging the bias weights from the semi-supervised training with non-bias weights using an artificial neural network ([Li, 0037] “The training inputs of the training data 120 may contain various user attributes including at least one bias attribute and non-bias attributes. As discussed above, the impact of the bias attributes might have been propagated to the training data 120 including the non-bias attributes and the training outputs”, discloses non-bias and bias data go into neural network as training inputs, [Li, 0033] “In order for the access-facilitation server 104 to determine the access flag for a user using the access predictive model 106, the access predictive model 106 need to be trained using training data 120”, discloses that the training input goes into the access predictive model which is a neural network);
determining, over time, where bias is occurring in the semi-supervised training based on the merging of the bias weights with the non-bias weights in the artificial neural network ([Li, 0047] “The group discriminative model 410, on the other hand, aims to achieve group fairness by reducing the group bias of the de-biased training data 122. One way to achieve group fairness is to obfuscate the bias attribute S from the de-biased training data 122 (X′,Y′) thereby removing the dependency or association between S and (X′,Y′). The group discriminative model 410, denoted as D2, can thus be configured to distinguish between samples from groups where S has different values, i.e. P [G(z)|S=1] and P [G(z)|S=0] for a binary S, and the generator G(.) can be configured to generate samples from each group with probability as similar as possible”, discloses finding bias by finding where S has different values, and [Li, 0051] “The de-biasing model 114 also aims to reduce the MSE between the training data 120 and the generated de-biased training data 122 to control data distortion. It should be understood that the overall loss function L(G, D1, D2) of the de-biasing model 114 can be defined in various other ways to achieve different goals”, also discloses calculating Mean Square Error between non-bias training data and generated de-biased training data that includes biases); 
 and generating a deep reinforcement learning model that decreases reliance on the bias weights based on determined bias to increase fairness ([Li, 0045; Generative Model 404 of the Figure 4] “The generative model 404 can use the extracted latent features z as an input and generate de-biased training data 122, i.e. G(z)=(X′,Y′), where G is the generative model 404, (X′,Y′) is the de-biased training data 122, X′ is the transformed training inputs and Y′ is the transformed training outputs. Note that the transformed training inputs X′ correspond to the non-bias attributes X and thus does not include the bias attribute S”, the generative model uses de-biased training data, [Li, 0052] “The iterative adjustments can include adjusting the parameters of the de-biasing model 114, including the generative model 404, the statistical discriminative model 408, and the group discriminative model 410, so that a value of the overall loss function in a current iteration is smaller than the value of the overall loss function in another iteration”, disclosing the adjustment of the parameter of the generative model, which corresponds to the generation of a learning model, [Li, 0049] “The individual bias model 406 can be configured to reduce or remove pointwise any large deviations between the training data 120 and the de-biased training data 122. This pointwise constraint helps to maintain individual fairness, because for every individual user, the de-biased training data 122 are maintained to be as close as possible to the training data 120”, discloses the model is generated to increase fairness).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Hasselt and Li, to use the process of determining where bias is occurring in the learning model and generating learning model with decreased bias of Li to reinforcement learning model of Hasselt. The suggestion and/or motivation for doing so is that determining where the bias occurs, for example, if it is in dataset level or model level, can helps fixing bias of the model and increase correctness of the reinforcement learning model.

	Regarding claim 11, Hasselt in view of Li teaches a computer system for providing fair deep reinforcement learning, the computer system comprising: a bus system; a storage device connected to the bus system, wherein the storage device stores program instructions; and a processor connected to the bus system, wherein the processor executes the program instructions ([Hasselt, 4th page, Empirical results, 3rd paragraph] “On each game, the network is trained on a single GPU for 200M frames, or approximately 1 week”, every type of GPU has memory (i.e. storage device) and buses embedded in it). Claim 11 is a computer system claim having similar limitations to method claim 1 above. Therefore, they are rejected under the same rational as of claim 1 above.
	Regarding claim 13, Hasselt in view of Li teaches a computer program product for providing fair deep reinforcement learning, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising ([Hasselt, 4th page, Empirical results, 3rd paragraph] “On each game, the network is trained on a single GPU for 200M frames, or approximately 1 week”, every type of GPU has memory (i.e. storage medium) and buses embedded in it, which can store program instructions). Claim 13 is a computer program product claim having similar limitations to method claim 1 above. Therefore, they are rejected under the same rational as of claim 1 above.

	Regarding claim 5, Hasselt in view of Li teaches the method of claim 1 further comprising: receiving a semi-supervised learning model corresponding to a set of environments ([Hasselt, page 4, Empirical results, 2nd paragraph] “Our testbed consists of Atari 2600 games, using the Arcade Learning Environment (Bellemare et al., 2013). The goal is for a single algorithm, with a fixed set of hyperparameters, to learn to play each of the games separately from interaction given only the screen pixels as input. This is a demanding testbed: not only are the inputs high-dimensional, the game visuals and game mechanics vary substantially between games”, the Atari 2600 games is the environment, and the algorithm with a fixed set of hyperparameters were received); and 
training a swarm of agents to perform the action in one or more other environments of the set of environments based on the semi-supervised learning model ([Hasselt, Figure 3; page 6, Robustness to Human starts, 1st paragraph] “The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded”, and “By testing the agents from various starting points, we can test whether the found solutions generalize well, and as such provide a challenging testbed for the learned polices (Nair et al., 2015)”, discloses training group of agents (i.e. an artificial entity, a human, a robot, a chatbot, etc.. anything), in this case, the agents are neural networks (DQN and double DQN). [Hasselt, page 4, right column, Results on overoptimism; page 5, Figure 3] “Figure 3 shows examples of DQN’s overestimations in six Atari games. DQN and Double DQN were both trained under the exact conditions described by Mnih et al. (2015)”, DQN is a Deep Q-Learning Neural Network algorithm, and there are plurality of environments (e.g. games) in Hasselt)
	Claim 17 is a computer program product claim having similar limitations to method claim 5 above. Therefore, they are rejected under the same rational as of claim 5 above.

	Regarding claim 6, Hasselt in view of Li teaches the method of claim 5 further comprising: capturing multimedia data of the swarm of agents performing the action to accomplish a task in the one or more other environments using a second set of sensors ([Hasselt, Figure 3; page 6, Robustness to Human starts, 1st paragraph] “The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded”, and “By testing the agents from various starting points, we can test whether the found solutions generalize well, and as such provide a challenging testbed for the learned polices (Nair et al., 2015)”, discloses training group of agents (i.e. an artificial entity, a human, a robot, a chatbot, etc.. anything that can learn and react), in this case, the agents are neural networks (DQN and double DQN). [Hasselt, page 5, Quality of the learned policies, 2nd paragraph, line 13 - page 6, 1st paragraph, line 4] “For Double DQN we used the exact same hyper-parameters as for DQN, to allow for a controlled experiment focused just on reducing overestimations. The learned policies are evaluated for 5 mins of emulator time (18,000 frames) with an £-greedy policy where £ = 0:05”, policy is the way agents react to the environment, and the experiment captures 5 mins of emulator (a software tool to run or capture the video game on a personal computer) to evaluate it. In this case, the emulator is the sensor); analyzing the multimedia data of the swarm of agents performing the action to accomplish the task in the one or more other environments using the artificial neural network ([Hasselt, Figure 3; page 6, Robustness to Human starts, 1st paragraph] “The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded”, and “By testing the agents from various starting points, we can test whether the found solutions generalize well, and as such provide a challenging testbed for the learned polices (Nair et al., 2015)”, discloses training group of agents (i.e. an artificial entity, a human, a robot, a chatbot, etc.. anything that can learn and react), in this case, the agents are neural networks (DQN and double DQN). [Hasselt, page 5, Quality of the learned policies, 2nd paragraph, line 13 - page 6, 1st paragraph, line 4] “For Double DQN we used the exact same hyper-parameters as for DQN, to allow for a controlled experiment focused just on reducing overestimations. The learned policies are evaluated for 5 mins of emulator time (18,000 frames) with an £-greedy policy where £ = 0:05”, policy is the way agents react to the environment, and the experiment captures 5 mins of emulator (a software tool to run or capture the video game on a personal computer) to evaluate it, [Hasselt, page 1, 4th paragraph] “DQN combines Q-learning with a flexible deep neural network and was tested on a varied and large set of deterministic Atari 2600 games, reaching human-level performance on many games”, discloses many environments, [Hasselt, Figure 2] discloses the action-state relationship and analysis); and determining change in state of the one or more other environments based on analysis of the swarm of agents performing the action to accomplish the task in the one or more other environments ([Hasselt, page 2, Deep Q Networks, 1st paragraph] “A deep Q network (DQN) is a multi-layered neural network that for a given state s outputs a vector of action values Q(s; . ; Θ), where Θ are the parameters of the network”, [Hasselt, Figure 2] discloses the action-state relationship and analysis).


	Regarding claim 7, Hasselt in view of Li teaches the method of claim 5 further comprising: swarm of agents to weight degree of bias based on determined change in state of the one or more other environments ([Hasselt, Figure 3; page 6, Robustness to Human starts, 1st paragraph] “The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded”, “By testing the agents from various starting points, we can test whether the found solutions generalize well, and as such provide a challenging testbed for the learned polices (Nair et al., 2015)”, discloses training group of agents (i.e. an artificial entity, a human, a robot, a chatbot, etc.. anything), in this case, the agents are neural networks (DQN and double DQN). [Hasselt, page 5, Figure 3] “The top and middle rows show value estimates by DQN (orange) and Double DQN (blue) on six Atari games. The results are obtained by running DQN and Double DQN with 6 different random seeds with the hyper-parameters employed by Mnih et al. (2015). The darker line shows the median over seeds and we average the two extreme values to obtain the shaded area (i.e., 10% and 90% quantiles with linear interpolation). The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded, and averaging the actual discounted return obtained from each visited state. These straight lines would match the learning curves at the right side of the plots if there is no bias”, the figure 3 of Hasselt discloses difference between biased output and non-biased output, which corresponds to the ‘weight degree of bias’); 
	Hasselt does not specifically teach identifying equal opportunity disparate impact on protected attributes (i.e. fairness), and post processing a weighted degree of bias to decrease bias by merging biased nodes and non-biased nodes of the artificial neural network and limiting bias weights.
[Li, 0059] “Complete group fairness is achieved when P (Y=y) is equal for all values of S, that is, P (Y=y)=P (Y=y|S=s). As such, a group bias score ϕ.sub.G close to 0 means that the training data is less biased in terms of group bias”, the term fairness and process of finding out if the analysis is fair, corresponds to identifying equal opportunity, [Li, 0061] According to Table 1, the de-biased training data 122 generated by the de-biasing model 114 presented herein reduces both group bias and individual bias compared with the training data 120. Because of the manner in which the baseline data is generated, the baseline data can achieve a fairly low group bias score, but cannot reduce the individual bias of the training data 120. [Li, 0003] “For example, bias might have been introduced into the past decisions by including a bias attribute, such as whether a user is a loyalty member of the resource provider”, the protected attributes means age, sex, race … may include anything, such as information about if the user is a ‘loyalty member’); and post processing a weighted degree of bias to decrease bias by merging biased nodes and non-biased nodes of the artificial neural network and limiting bias weights ([Li, 0047] “The group discriminative model 410, on the other hand, aims to achieve group fairness by reducing the group bias of the de-biased training data 122. One way to achieve group fairness is to obfuscate the bias attribute S from the de-biased training data 122 (X′,Y′) thereby removing the dependency or association between S and (X′,Y′). The group discriminative model 410, denoted as D2, can thus be configured to distinguish between samples from groups where S has different values, i.e. P [G(z)|S=1] and P [G(z)|S=0] for a binary S, and the generator G(.) can be configured to generate samples from each group with probability as similar as possible”, discloses finding bias by finding where S has different values. [Li, 0051] “The de-biasing model 114 also aims to reduce the MSE between the training data 120 and the generated de-biased training data 122 to control data distortion. It should be understood that the overall loss function L(G, D1, D2) of the de-biasing model 114 can be defined in various other ways to achieve different goals”, also discloses calculating Mean Square Error between non-bias training data and generated de-biased training data that includes biases. The ‘node’ is equivalent to mathematical function (i.e. Mean Square)).
	Claim 19 is a computer program product claim having similar limitations to method claim 7 above. Therefore, they are rejected under the same rational as of claim 7 above.
	
	Regarding claim 9, Hasselt in view of Li teaches the method of claim 1, wherein the agent is selected from a group consisting of a robot, a chatbot, an artificial intelligence entity, and a human ([Hasselt, Abstract] “We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games”, discloses the agent is Deep Q-Network algorithms, [Hasselt, Appendix, Page 10, Table 3] shows the agent can be DQN, Random, or Human).

	Regarding claim 10, Hasselt in view of Li teaches the method of claim 1, wherein the artificial neural network is a convolutional neural network ([Hasselt, 4th page, Empirical results, 3rd paragraph] “We closely follow the experimental setting and network architecture outlined by Mnih et al. (2015). Briefly, the network architecture is a convolutional neural network (Fukushima, 1988; LeCun et al., 1998) with 3 convolution layers and a fully-connected hidden layer (approximately 1.5M parameters in total)”).

Claim 2-4, 12, 14-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hasselt (Hasselt et al, 2015, “Deep Reinforcement Learning with Double Q-learning”) in view of Li (US 20200226489 A1) and further in view of Fan (WO 2020172825 A1).

Regarding claim 2, Hasselt in view of Li teaches the method of claim 1 further comprising: receiving a semi-supervised learning model corresponding to a set of two or more environments, wherein the environment is one environment in the set of two or more environments ([Hasselt, Abstract] “We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games” shows the video game is the environment, and [Hasselt, Appendix, page 10, Table 3] shows the list of environments (e.g. video games)); training the agent to perform the action in the environment of the set of two or more environments based on the semi-supervised learning model ([Hasselt, page 4, Results on overoptimism; page 5, Figure 3] “Figure 3 shows examples of DQN’s overestimations in six Atari games. DQN and Double DQN were both trained under the exact conditions described by Mnih et al. (2015)”, DQN is a Deep Q-Learning Neural Network algorithm, and there are plurality of environments (e.g. games) in Hasselt); and mapping the action to be performed by the agent in the environment to a reward ([Hasselt, 2nd page, Background, 2nd paragraph] “The standard Q-learning update for the parameters after taking action At in state St and observing the immediate reward Rt+1 and resulting state St+1”). Hasselt in view of Li does not specifically teaches mapping the action to reward using a Q-table.
	Fan teaches mapping the action to reward using a Q-table ([Fan, 3rd page, line 38-39] “The Q table in Q-learning is updated by using the communication tailing duration corresponding to the i-th transmission strategy as a reward for reinforcement learning”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Fan, Hasselt and Li, to use the Q-table of Fan to reinforcement learning model of Hasselt and Li. The suggestion and/or motivation for doing so is that determining where the bias occurs, for example, if it is in dataset level or model level, can helps fixing bias of the model and increase correctness of the reinforcement learning model.

	Claim 14 is a computer program product claim having similar limitations to method claim 2 above. Therefore, they are rejected under the same rational as of claim 2 above.

Regarding claim 3, Hasselt in view of Li, and further in view of Fan teaches the method of claim 2 further comprising: capturing multimedia data of the agent performing the action to accomplish a task in the environment based on the reward using a first set of sensors ([Hasselt, page 1, 4th paragraph; Appendix, page 10, Table 3] “DQN combines Q-learning with a flexible deep neural network and was tested on a varied and large set of deterministic Atari 2600 games, reaching human-level performance on many games” discloses the list of games that the reinforcement learning observes, and games are multimedia data, [Hasselt, page 2, Background, 2nd paragraph] “The standard Q-learning update for the parameters after taking action At in state St and observing the immediate reward Rt+1 and resulting state St+1 is then” discloses the action is performed based on reward. [Hasselt, page 5, Quality of the learned policies, 2nd paragraph, line 13 - page 6, 1st paragraph, line 4] “For Double DQN we used the exact same hyper-parameters as for DQN, to allow for a controlled experiment focused just on reducing overestimations. The learned policies are evaluated for 5 mins of emulator time (18,000 frames) with an £-greedy policy where £ = 0:05”, policy is the way agents react to the environment, and the experiment captures 5 mins of emulator (a software tool to run or capture the video game on a personal computer) to evaluate it. In this case, the emulator corresponds to the sensor); Docket No. P201810187US01Page 25 of 31analyzing the multimedia data of the agent performing the action to accomplish the task in the environment using the artificial neural network ([Hasselt, page 1, 4th paragraph] “DQN combines Q-learning with a flexible deep neural network and was tested on a varied and large set of deterministic Atari 2600 games, reaching human-level performance on many games”, discloses many environments, [Hasselt, Figure 2] discloses the action-state relationship and analysis, and [Hasselt, page 5, Quality of the learned policies, 2nd paragraph, line 13 - page 6, 1st paragraph, line 4] “For Double DQN we used the exact same hyper-parameters as for DQN, to allow for a controlled experiment focused just on reducing overestimations. The learned policies are evaluated for 5 mins of emulator time (18,000 frames) with an £-greedy policy where £ = 0:05”, DQN (Deep Q-Learning Network) captures 18,000 frames of multimedia data (i.e. Atari game)); and determining change in state of the environment based on analysis of the agent performing the action to accomplish the task in the environment ([Hasselt, page 2, Deep Q Networks, 1st paragraph] “A deep Q network (DQN) is a multi-layered neural network that for a given state s outputs a vector of action values Q(s; . ; Θ), where Θ are the parameters of the network”, [Hasselt, Figure 2] discloses the action-state relationship and analysis).
	Claim 15 is a computer program product claim having similar limitations to method claim 3 above. Therefore, they are rejected under the same rational as of claim 3 above.

Regarding claim 4, Hasselt in view of Li, and further in view of Fan teaches the method of claim 1 further comprising: identifying equal opportunity and disparate impact on protected attributes by the agent to weight degree of bias based on a determined change in state of the environment ([Hasselt, Figure 3; page 6, Robustness to Human starts, 1st paragraph] “The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded”, “By testing the agents from various starting points, we can test whether the found solutions generalize well, and as such provide a challenging testbed for the learned polices (Nair et al., 2015)”, discloses training group of agents (i.e. an artificial entity, a human, a robot, a chatbot, etc.. anything), in this case, the agents are neural networks (DQN and double DQN). [Hasselt, page 5, Figure 3] “The top and middle rows show value estimates by DQN (orange) and Double DQN (blue) on six Atari games. The results are obtained by running DQN and Double DQN with 6 different random seeds with the hyper-parameters employed by Mnih et al. (2015). The darker line shows the median over seeds and we average the two extreme values to obtain the shaded area (i.e., 10% and 90% quantiles with linear interpolation). The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded, and averaging the actual discounted return obtained from each visited state. These straight lines would match the learning curves at the right side of the plots if there is no bias”, the figure 3 of Hasselt discloses difference between biased output and non-biased output, which corresponds to the ‘weight degree of bias’). 
Hasselt does not specifically teach identifying equal opportunity and disparate impact on protected attributes and recalculating a reward corresponding to the action based on the equal opportunity and disparate impact on the protected attributes by the agent.
Li teaches identifying equal opportunity and disparate impact on protected attributes ([Li, 0059] “Complete group fairness is achieved when P (Y=y) is equal for all values of S, that is, P (Y=y)=P (Y=y|S=s). As such, a group bias score ϕ.sub.G close to 0 means that the training data is less biased in terms of group bias”, the term fairness and process of finding out if the analysis is fair, corresponds to identifying equal opportunity, [Li, 0061] According to Table 1, the de-biased training data 122 generated by the de-biasing model 114 presented herein reduces both group bias and individual bias compared with the training data 120. Because of the manner in which the baseline data is generated, the baseline data can achieve a fairly low group bias score, but cannot reduce the individual bias of the training data 120. [Li, 0003] “For example, bias might have been introduced into the past decisions by including a bias attribute, such as whether a user is a loyalty member of the resource provider”, the protected attributes means age, sex, race … may include anything, such as information about if the user is a ‘loyalty member’); 

Fan teaches recalculating a reward corresponding to the action ([Fan, page 7, line 29-30] “Reinforcement learning is mainly learning through trial and error, that is, to determine the best answer by performing actions a limited number of times to get the maximum reward”, discloses the reward value is recalculated every each trial and error, [Fan, page 3, line 48-51] “generating the i-th transmission strategy through the Q table includes: according to the Q table, obtaining the reward value for executing Q actions under the state corresponding to the i-1th data volume threshold, and according to the Q actions The reward value determines the i-th target action and generates the i-th transmission strategy”, and [Fan, page 4, line 8-9] “the i-th transmission strategy is used to transmit the gradient of each layer parameter obtained in the i-th iteration of the first neural network mode”, discloses the neural network which encompasses reward calculation iterates i times (i.e. recalculate reward i times). The protected attributes is just age, gender, race, religion, etc); and updating a Q-table with the recalculated reward corresponding to the action ([Fan, page 15, line 28-31] “The computing node can generate the i-th transmission strategy through the Q-table (Q-Table) used to record the state-action in the Q-learning algorithm; and, according to the communication tail time corresponding to the i-th transmission strategy, compare the Q-table Update, and generate the i+1th transmission strategy through the updated Q table”).
	Claim 16 is a computer program product claim having similar limitations to method claim 4 above. Therefore, they are rejected under the same rational as of claim 4 above.

Claim 8 and 20 is/are rejected under 35 U.S.C. 103 over Hasselt (Hasselt et al, 2015, “Deep Reinforcement Learning with Double Q-learning”) in view of Li (US 20200226489 A1), and further in view of Lee (US 20200202257 A1).

Lee teaches further comprising: relabeling training data of the semi-supervised learning model based on the post processing of the weighted degree of bias ([Lee, 0080] “Next, the training dataset 401 is provided to the trained machine learning model 420 to relabel the training dataset 401. An updated training dataset 425 is generated based on the trained machine learning model 420”, disclose relabeling the training dataset, [0038] “As discussed above, because the classification algorithms require explicit class labeling, classification is a form of supervised learning. The bias-variance tradeoff is a central problem in supervised learning”, discloses the relabeling is based on bias); and retraining the agent to modify performance of the action using the relabeled training data ([Lee, 0085-0086; Figure 4] “However, if it is determined that Δ.sub.1 is not less than or equal to the first threshold value and Δ.sub.2 is not less than or equal to the second threshold value, the machine learning model 410 is trained based on the dataset labeling of the test dataset 402 (e.g., second dataset labeling). If the second difference Δ.sub.2 is determined to be greater than the second threshold value, the machine learning model 410 is updated or trained (or adjusted) based on the dataset labeling of the updated training dataset 425 (e.g., updated first dataset labeling)”, discloses retraining process based on updated labeled dataset, [Lee, 0050] “The updated new dataset 206 may have a new standard 208 or updated dataset labeling (e.g., updated second dataset labeling) which is different from the dataset labeling of the new dataset 202 (e.g., second dataset labeling)” discloses the second dataset labeling is the updated (i.e. relabeled) data).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having both the teachings of Lee, Hasselt and Li, to use the process of relabeling and retraining neural network of Lee to reinforcement learning model of Hasselt and Li. The suggestion 
	Claim 20 is a computer program product claim having similar limitations to method claim 8 above. Therefore, they are rejected under the same rational as of claim 8 above.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. 
Regarding fairness of Artificial Intelligence.
Bellamy et al, 2018, “AI FAIRNESS 360: AN EXTENSIBLE TOOLKIT FOR DETECTING, UNDERSTANDING, AND MITIGATING UNWANTED ALGORITHMIC BIAS”

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can normally be reached on 7:30 AM - 5:30 PM. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).


Examiner, Art Unit 2127

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127