Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-22 are presented.
Drawings per record are accepted.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 21 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because it is directed to “One or more computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training a policy neural network”

The language does not explicitly limit the media being a non-transitory embodiment.
While the original Specification includes examples of what such media (page 20, lines 17 through 21) includes, however remains open-ended on whether it excludes transitory forms such as signal and carrier waves. 
As such, under BRI, the claim encompasses non-statutory embodiments and thus is rejected under this Section.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-10, 18, 20-22 is/are rejected under 35 U.S.C. 102(a)(1) as being unpatentable over reference Jaderberg et al. (“Human-level performance in first person multiplayer games with population-based deep reinforcement learning) – ARXIV.org – 07-2018  XP081356275 (IDS entry).


As to claim 1, 22 and 21:
Jaderberg discloses a method, a system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers and one or more computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers (See Abstract, last paragraph of page 1, the operations are realized by a computing system) to perform the method comprising operations of training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment, (See Abstract, 0 AI agents trained to perform game playing (task) in randomly generated environment. Each agent selectively perform actions based on knowledge/reward signals to compete against each other. See Page 3, each of the agent and its agent teammates have their own respective policies, which is composed of π1 itself, and its teammates’ policies π2 to πN, for a total of N players in the game, policy π having parameter to govern action of a respective agent, “, the agent’s policy π uses the same interface available to human players. It receives raw red-green-blue (RGB) pixel input xt from the agent’s first-person perspective at time step t, produces control actions”) 

the method comprising: 
maintaining data specifying a pool of candidate action selection policies,  (See page 3, a population of different agents (teammates) selected for a game, each having having it’s own policy π. Page 4, Specifying a population of different agents and their respective policy π)
the pool of candidate action selection policies comprising: 
(i) a plurality of learner policy for controlling the agent, each learner policy defined by a respective set of values for the policy parameters of the policy neural network,  (page 3,  “the agent’s policy π uses the same interface available to human players. It receives raw red-green-blue (RGB) pixel input xt from the agent’s first-person perspective at time step t, produces control actions at ~ π (·|x1, ..., xt)”. See Fig. 4,  learner policies include learning the basic of the games such as policy to pick up flag, policy to tag an opponent etc)
and (ii) one or more fixed policies for controlling the agent; (Fig. 1, at least one fixed policy for example includes each agent in game sees only its first-person view)
maintaining, for each of the learner policies, data specifying a respective matchmaking policy for the learner policy that defines a distribution over the pool of candidate action selection policies; (page 2, left column, indexed for a training game using a matching scheme that biases players of similar skill level together. Namely, for each agent, a matchmaking policy is established to group similar-skill agents/players together, and this matchmaking scheme for training applies for each of learner policies)

at each of a plurality of training iterations (Fig. 4, a training period with a plurality of games, i.e. each game is an training iteration): 
for each of one or more of the learner policies: selecting one or more policies from the pool of candidate action selection policies using the matchmaking policy for the learner policy; generating training data for the learner policy by causing a first agent controlled using the learner policy to perform the particular task while interacting with one or more second agents, (See entire page 4, each agent is governed by policy to match with similar skill, and learn from experience/knowlege generated by playing with/against other agents sampled from the population, i.e. learning data is generated and learned by the agent after each match.  For example, a given agent π in the team is trained and thus to learn to navigate its own base, enemy’s base, and to pick up the flag for example)

each second agent controlled by a respective one of the selected policies (See Page 1, right column, each of the agent and its agent teammates have their own respective policies, which is composed of π1 itself, and its teammates’ policies π2 to πN); and updating the respective set of policy parameters that define the learner policy by training the learner policy on the training data through reinforcement learning to optimize a reinforcement learning loss function for the learner policy. (See Fig. 4, progression of agent’s behavior and skill, “Early in training, the agent puts large reward weight on picking up the opponent’s flag, whereas later, this weight is reduced, and reward for tagging an opponent and penalty when opponents capture a flag are increased by a factor of two. “Behavior probability” indicates the frequencies of occurrence for 3 of the 32 automatically discovered behavior clusters through training. Opponent base camping (red) is discovered early on, whereas teammate following (blue) becomes very prominent midway through training before mostly disappearing. The “home base defense” behavior (green) resurges in occurrence toward the end of training, which is in line with the agent’s increased internal penalty for more opponent flag captures. “Memory usage” comprises heat maps of visitation frequencies for (left) locations in a particular map and (right) locations of the agent at which the top-10 most frequently read memories were written to memory, normalized by random reads from memory, indicating which locations the agent learned to recall”, namely action policy parameters (weights) for each types of actions are updated for a given agent with a corresponding action policy π for larger rewards and/or efficient game play).

As to claim 2:
Jaderberg discloses all limitations of claim 1, wherein the matchmaking policies for two or more of the learner policies are different. (See page 4, last paragraph, indexed for a training game using a matching scheme that biases players of similar skill level together. Therefore, matchmaking policy for a low skilled agent  with action policy π will look for other low skill π’s, whereas the matchmaking policy for a higher skill agent  instead look for matching with other high skill agents)

As to claim 3:
Jaderberg discloses all limitations of claim 2, wherein the learner policies are each assigned a respective type from a plurality of types, wherein each type is associated with a different matchmaking 23Attorney Docket No. 45288-0048001 policy from each other type, and wherein each learner policy has the matchmaking policy that is associated with the type to which the learner policy is assigned. (See Fig. 4, learner policies are categorized in term of strength, i.e. skill (beating weak bots, average human, strong human etc… Recall in page 4, that matchmaking policy for a low skilled agent  with action policy π will look for other low skill π’s, whereas the matchmaking policy for a higher skill agent  instead look for matching with other high skill agents)

As to claim 4:
Jaderberg discloses all limitations of claim 1, wherein the matchmaking policy for at least one learner policy is uniform across one or more learner policies that are assigned a particular type and zero for all of the learner policies that are assigned different types and all of the fixed policies.  . (See page 4, last paragraph, indexed for a training game using a matching scheme that biases players of similar skill level together. Therefore, matchmaking policy for a low skilled agent  with action policy π will look for other low skill π’s, whereas the matchmaking policy for a higher skill agent  instead look for matching with other high skill agents. This pattern is uniformed for similarly skilled agents. Matchmaking policy does not match agents assigned with different skills, thus “zero”)

As to claim 5:
Jaderberg discloses all limitations of claim 1, wherein the matchmaking policy for at least one learner policy is uniform across all of the learner policies and zero for all of the fixed policies. (See page 4, last paragraph, indexed for a training game using a matching scheme that biases players of similar skill level together. This matchmaking has no affect/ or affected by fixed policy such as first person restraint)

 
As to claim 6:
Jaderberg discloses all limitations of claim 1, wherein the matchmaking policy for at least one learner policy is uniform across all policies in the pool. (See page 4, last paragraph, indexed for a training game using a matching scheme that biases players of similar skill level together. Therefore, matchmaking policy is uniformed across all policies, namely for a low skilled agent  with action policy π will look for other low skill π’s, whereas the matchmaking policy for a higher skill agent  instead look for matching with other high skill agents)

As to claim 7:
Jaderberg discloses all limitations of claim 1, wherein the reinforcement learning loss function depends on a plurality of hyperparameters, and wherein values for the plurality of hyperparameters are different for two or more of the learner policies.  (See section 2.2, hyperparameters for each policy, Fig. 4, each agent has a plurality internal reward weights for each of policies of tagging, flag captures etc…, and each internal reward weights are different for each policy as they are constantly adjusted)
As to claim 8:
Jaderberg discloses all limitations of claim 7, wherein the hyperparameters include one or more hyperparameters of a reinforcement learning algorithm used in the training.  (See section 2.2 and Fig. 4, each agent has a plurality internal reward weights of an algorithm that reinforce a learned behavior that maximize rewards (i.e. by increasing weight for said behavior)
As to claim 9:
Jaderberg discloses all limitations of claim 7, wherein the hyperparameters include one or more internal reward hyperparameters that define whether the reinforcement learning loss function depends on an internal reward and, if so, how the internal reward is computed based on observations received by the agent during performance of the task. (See section 2.2 , page 19-20, optimizing hyperparameter through training process.  Fig. 4, ““Relative internal reward magnitude” indicates the relative magnitude of the agent’s internal reward weights of 3 of the 13 events corresponding to game points ρ. Early in training, the agent puts large reward weight on picking up the opponent’s flag, whereas later, this weight is reduced, and reward for tagging an opponent and penalty when opponents capture a flag are increased by a factor of two. “Behavior probability” indicates the frequencies of occurrence for 3 of the 32 automatically discovered behavior clusters through training. Opponent base camping (red) is discovered early on, whereas teammate following (blue) becomes very prominent midway through training before mostly disappearing. The “home base defense” behavior (green) resurges in occurrence toward the end of training, which is in line with the agent’s increased internal penalty for more opponent flag captures”)

As to claim 10:
Jaderberg discloses all limitations of claim 1, wherein the one or more fixed policies include a first fixed policy that is defined by values of the policy parameters that have been determined through supervised learning on labeled task instances. (See page 29, section 5.7 “his dataset of activations hi and corresponding labels yi we fit a Decision Tree of depth 1 using Gini impurity criterion. The decision tree learner selects the most discriminative dimension of h and hence the neuron most selective for y. “ and section 5.8)

As to claim 18:
Jaderberg discloses all limitations of claim 1, further comprising, for at least one of the selected policies: updating the respective set of policy parameters that define the selected policy by training the selected policy on the training data through reinforcement learning to optimize a reinforcement learning loss function for the selected policy.  (Fig. 2, “The network parameters are updated using reinforcement learning based on the agent’s own internal reward signal rt , which is obtained from a learnt transformation w of game points ρt . w is optimised for winning probability through population-based training, another level of training performed at yet a slower time scale than RL. See page 20, section 2.3 “After every 100 agent steps, the trajectory of experience from each player’s point of view (observations, actions, rewards) is sent to the learner responsible for the policy carried out by that player. The learner corresponding to an agent composes batches of the 32 trajectories most recently received from arenas, and computes a weight update to the agent’s neural network parameters”)
As to claim 20:
 Jaderberg discloses all limitations of claim 1, wherein the matchmaking policy for at least one learner policy specifies that the learner policies controlling respective agents to have attained higher levels of performance on the particular task are more likely to be selected (See page 5,first paragraph, replacing under-performing agent with better agent)

Allowable Subject Matter
Claims 11 through 17 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The reference(s) cited of record discloses all limitations of claim 1, 20 through 22 in details as above, however does not disclose: “ the supervised learning comprises a first supervised learning using first training data and a second supervised learning using only a selected 24Attorney Docket No. 45288-0048001 portion of the first training data that includes only labeled task instances performed by agents that have attained at least a threshold level of performance on the particular task.” Or “at a particular training iteration of the plurality of training iterations: determining that criteria for converting a particular one of the plurality of learner policies into a fixed policy have been satisfied; and in response, generating a new fixed policy that is represented by the same parameter values as the particular learner policy”

Conclusion


Reference(s) considered pertinent to the invention include:
Gendron-Bellemare et al. (WO 2018/224695) - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network. The policy neural network is used to select actions to be performed by an agent that interacts with an environment by receiving an observation characterizing a state of the environment and performing an action from a set of actions in response to the received observation. A trajectory is obtained from a replay memory, and a final update to current values of the policy network parameters is determined for each training observation in the trajectory. The final updates to the current values of the policy network parameters are determined from selected action updates and leave-one-out updates.
Nazari et al. (US 2019/0102676) – a system may employ an offline training process and an online training process. In the offline training process, an initial policy is learned to provide a warm start to the online training process. In the online training process, the system applies concurrent reinforcement learning across multiple environments, with the goal of learning efficient policies in real time from in-flight user data in one environment, and applying the learned policies to other environments. With the combination of offline training and online training, the system is able to improve initial performance through the warm start, while adapting to a changing context through concurrent reinforcement learning.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to QUAN M HUA whose telephone number is (571)270-7232. The examiner can normally be reached 10:30-6:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Anthony Addy can be reached on 571-272-7795. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/QUAN M HUA/Primary Examiner, Art Unit 2645