DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Election/Restrictions 
Applicant’s argument and clarification, in the “Remarks - 06/29/2022- Applicant Arguments/Remarks Made in an Amendment“, with respect to the restriction are persuasive, specifically, amendments to the claim 1 remove the requirement of restriction, and as such, the restriction requirement as set forth in “Requirement for Restriction/Election - 03/03/2022” is hereby withdrawn. 
In view of the above noted withdrawal of the restriction requirement, applicant is advised that if any claim presented in a continuation or divisional application is anticipated by, or includes all the limitations of, a claim that is allowable in the present application, such claim may be subject to provisional statutory and/or nonstatutory double patenting rejections over the claims of the instant application.
 Once a restriction requirement is withdrawn, the provisions of 35 U.S.C. 121 are no longer applicable. See In re Ziegler, 443 F.2d 1211, 1215, 170 USPQ 129, 131-32 (CCPA 1971). See also MPEP § 804.01.
In view of the above, this office action considers claims 1-20 pending for prosecution.
Claim Rejections - 35 USC § 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Notes: when present, semicolon separated fields within the parenthesis (; ;) represent, for example, as (112; Fig 1A; [0076]) = (element 112; Figure No. 1A; Paragraph No. [0076]). For brevity, the texts “Element”, “Figure No.” and “Paragraph No.” shall  be excluded, though; additional clarification notes may be added within each field. The number of fields may be fewer or more than three indicated above. These conventions are used throughout this document.

Claims 1-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Gendron-Bellemare; Marc et al., (US 20210110271 A1; using priority date of us-provisional-application US 62517826 2017-06-09; hereinafter Gendron).
1. Gendron teaches a method ([0032-0034+]) performed by a system (100; Fig 1A; [0069]) of one or more computers and for training a controller (labelled as policy) neural network (112; Fig 1A; [0076]) having a plurality of controller parameters to generate output sequences (return data. estimates or observation 132 [0060,0083]) by determining trained values (126) of the controller parameters (policy network parameters)  from initial values of the controller parameters ([0034,0083]), the method comprising (see the entire document, Figs 1A-5; specifically [0033], claims, and as cited below):

    PNG
    media_image1.png
    466
    356
    media_image1.png
    Greyscale

Gendron Figure 1A
maintaining data ( in the replay memory 124; Fig 1A; 0057] ) identifying a set of K output sequences ( return estimates [0060] construed from [0079] the system 100 can store the values of the policy network parameters as of a given point during the training for later use in instantiating a policy network 112; and [0081] at each training iteration, the training engine 116 obtains a trajectory 122 stored in a replay memory 124) that were previously generated by the controller neural network during the training and, for each output sequence in the set, a respective reward that measures a quality of the output sequence, wherein K is an integer greater than one;
selecting at least one of the output sequences ([0076]: the system 100 selects the action 102 to be performed by the agent 104 at the time step based on the score distribution 114) from the set of output sequences;
for each selected output sequence, determining a respective score ([0077-0078]) assigned to the selected output sequence by the controller neural network in accordance with current values of the controller parameters;
determining, for each selected sequence, a respective first update ([0084] determine the final update 126 corresponding to a training observation 132, the policy network 112 processes the training observation 132 in accordance with the current values of the policy network parameters to generate a score distribution 114 including a respective score value for each action in the set of possible action) to the current values of the controller parameters that increases the score assigned to the selected output sequence by the controller neural network;
generating a batch of new output sequences using the controller neural network in accordance with the current values of the controller parameters ([0088]: updates 126 from multiple trajectories, wherein [0096] trajectory includes a sequence of multiple training observations);
obtaining a respective reward ([0075, 0096, 0105]) for each of the new output sequences;
determining, from the new output sequences and the output sequences in the maintained data, the K output sequences that have the highest rewards ([0105: more accurate estimate of the rewards); and
modifying the maintained data ([0079]: the training engine 116 trains the policy network 112 continuously (i.e., so that the policy network parameters are constantly being updated as the agent 104 interacts with the environment 10) to identify the determined K output sequences and the respective reward for each of the K output sequences.
2. Gendron as applied to the method of claim 1, further teaches, wherein determining, for each selected sequence, a respective first update to the current values of the controller parameters that increases the score assigned to the selected output sequence by the controller neural network comprises:
determining a gradient ([0009]) of a priority queue objective function (β; [0014, 0109) that depends on a logarithm of the score assigned to the selected sequence (Eqn 4; [0110]) by the neural network.
3. Gendron as applied to the method of claim 1, further teaches, wherein selecting at least one of the output sequences from the set of output sequences comprises: selecting all of the output sequences in the set (construed from [0007+] performing an action from a set of actions in response to the received observation).

4. Gendron as applied to the method of claim 1, further teaches, wherein selecting at least one of the output sequences from the set of output sequences comprises: selecting a random output sequence ([0096] random sampling) from the set.
5. Gendron as applied to the method of claim 1, further teaches, (the method) further comprising:
determining a second ([0034]; [0078: as a second iteration of multiple training iterations]) update to the current values of the controller parameters that increases the rewards ([0078]: increase a cumulative measure of reward (e.g., a time-discounted sum of future rewards) received by the agent 104) received for output sequences generated by the controller neural networks using a reinforcement learning technique ([0058]; [0078]: By increasing a cumulative measure of reward received by the agent 104, the training engine 116 may cause the agent 104 to perform given tasks more effectively).
6. Gendron as applied to the method of claim 5, further teaches, wherein the reinforcement learning technique ([0058, 0078]) is a policy gradient technique ([0034]: a gradient of the current score for the action with respect to the policy network parameters is determined; [0101]: The system can determine the gradient by any appropriate method, such as backpropagation, backpropagation-through-time, or truncated backpropagation-through-time. For a given action, the gradient of the current score for the action with respect to the policy network parameters can be represented in any appropriate numerical format, for example, as a vector).
7. Gendron as applied to the method of claim 1, further teaches, (the method) further comprising:
determining a third ([0034]; [0078: as a third iteration of multiple training iterations]) to the current values of the controller parameters by determining a gradient of an entropy regularization term that encourages exploration of a space of possible output sequences by the controller neural network.
8. Gendron as applied to the method of claim 1, further teaches, wherein the controller neural network is a recurrent neural network ([0035,0090]) that is configured to, at each of a plurality of time steps:
receive as input a preceding output in the output sequence ([0007, 0076]); and
process the preceding output in accordance with the controller parameters to generate a score ([0032]) distribution over possible outputs at the time step.
9. Gendron as applied to the method of claim 8, further teaches, wherein ([0007, 0032,0076]) determining a respective score assigned to the selected output sequence by the controller neural network in accordance with current values of the controller parameters comprises:
for each of the plurality of time steps ([0006]):
providing the preceding output in the selected output sequence as input to the controller neural network (112) to generate a score distribution (114) over possible outputs (108); and
identifying the score assigned to the output that follows the preceding output in the selected output sequence by the score distribution for the time step; and
combining the identified scores for the plurality of time steps.
10. Gendron as applied to the method of claim 1, further teaches, wherein the output sequences are sequences of computer program tokens and the reward measures how well a computer program defined by an output sequence performs a computer programming task.
11. Gendron as applied to the method of claim 1, further teaches, wherein ([0058]) the output sequences are sequences of values of neural network architecture hyperparameters, and wherein the reward measures how well a neural network having an architecture defined by the output sequence performs on a neural network task.
12. (Gendron as applied to the method of claim 1, further teaches, wherein ([0058]) the output sequences are sequences of values of hyperparameters of a machine learning training process, and wherein the reward measures how well a neural network performs after being trained using the machine learning training process using the hyperparameters defined by the training process.
13. Gendron teaches a system ([0056-0060+]; labelled as 100; Fig 1A; [0069]) comprising one or more computers and one or more storage devices ([0053]) storing instructions that when executed by the one or more computers cause the one or more computers to perform operations ([0054]) for training a controller (labelled as policy) neural network (112; Fig 1A; [0076]) having a plurality of controller parameters to generate output sequences (return data. estimates or observation 132 [0060,0083]) by determining trained values of the controller parameters (policy network parameters) from initial values of the controller parameters, the operations comprising (see the entire document, Figs 1A-5; specifically [0056], claims, and as cited below):
maintaining data ( in the replay memory 124; Fig 1A; 0057] ) identifying a set of K output sequences ( return estimates [0060] construed from [0079] the system 100 can store the values of the policy network parameters as of a given point during the training for later use in instantiating a policy network 112; and [0081] at each training iteration, the training engine 116 obtains a trajectory 122 stored in a replay memory 124) that were previously generated by the controller neural network during the training and, for each output sequence in the set, a respective reward that measures a quality of the output sequence, wherein K is an integer greater than one;
selecting at least one of the output sequences ([0076]: the system 100 selects the action 102 to be performed by the agent 104 at the time step based on the score distribution 114) from the set of output sequences;
for each selected output sequence, determining a respective score ([0077-0078]) assigned to the selected output sequence by the controller neural network in accordance with current values of the controller parameters;
determining, for each selected sequence, a respective first update ([0084] determine the final update 126 corresponding to a training observation 132, the policy network 112 processes the training observation 132 in accordance with the current values of the policy network parameters to generate a score distribution 114 including a respective score value for each action in the set of possible action) to the current values of the controller parameters that increases the score assigned to the selected output sequence by the controller neural network;
generating a batch of new output sequences using the controller neural network in accordance with the current values of the controller parameters ([0088]: updates 126 from multiple trajectories, wherein [0096] trajectory includes a sequence of multiple training observations);
obtaining a respective reward ([0075, 0096, 0105]) for each of the new output sequences;
determining, from the new output sequences and the output sequences in the maintained data, the K output sequences that have the highest rewards ([0105: more accurate estimate of the rewards); and
modifying the maintained data ([0079]: the training engine 116 trains the policy network 112 continuously (i.e., so that the policy network parameters are constantly being updated as the agent 104 interacts with the environment 10) to identify the determined K output sequences and the respective reward for each of the K output sequences.
14. Gendron as applied to the system of claim 13, further teaches, wherein determining, for each selected sequence, a respective first update to the current values of the controller parameters that increases the score assigned to the selected output sequence by the controller neural network comprises:
determining a gradient ([0009]) of a priority queue objective function (β; [0014, 0109) that depends on a logarithm of the score assigned to the selected sequence (Eqn 4; [0110] by the neural network.
15. Gendron as applied to the system of claim 13, further teaches, wherein selecting at least one of the output sequences from the set of output sequences comprises: selecting all of the output sequences in the set (construed from [0007+] performing an action from a set of actions in response to the received observation).

16. Gendron as applied to the system of claim 13, further teaches, wherein selecting at least one of the output sequences from the set of output sequences comprises: selecting a random output sequence ([0096] random sampling) from the set.
17. Gendron as applied to the system of claim 13, further teaches, (the method) further comprising:
determining a second ([0034]; [0078: as a second iteration of multiple training iterations]) update to the current values of the controller parameters that increases the rewards ([0078]: increase a cumulative measure of reward (e.g., a time-discounted sum of future rewards) received by the agent 104) received for output sequences generated by the controller neural networks using a reinforcement learning technique ([0058]; [0078]: By increasing a cumulative measure of reward received by the agent 104, the training engine 116 may cause the agent 104 to perform given tasks more effectively).
18. (Gendron teaches a one or more non-transitory computer-readable storage media ([0137+]) storing instructions that when executed by one or more computers cause the one or more computers to perform operations ([0054]) for training a controller (labelled as policy) neural network (112; Fig 1A; [0076]) having a plurality of controller parameters (return data. estimates or observation 132 [0060,0083]) to generate output sequences by determining trained values of the controller parameters from initial values of the controller parameters, the operations comprising (see the entire document, Figs 1A-5; specifically [0137], claims, and as cited below)::
maintaining data ( in the replay memory 124; Fig 1A; 0057] ) identifying a set of K output sequences ( return estimates [0060] construed from [0079] the system 100 can store the values of the policy network parameters as of a given point during the training for later use in instantiating a policy network 112; and [0081] at each training iteration, the training engine 116 obtains a trajectory 122 stored in a replay memory 124) that were previously generated by the controller neural network during the training and, for each output sequence in the set, a respective reward that measures a quality of the output sequence, wherein K is an integer greater than one;
selecting at least one of the output sequences ([0076]: the system 100 selects the action 102 to be performed by the agent 104 at the time step based on the score distribution 114) from the set of output sequences;
for each selected output sequence, determining a respective score ([0077-0078]) assigned to the selected output sequence by the controller neural network in accordance with current values of the controller parameters;
determining, for each selected sequence, a respective first update ([0084] determine the final update 126 corresponding to a training observation 132, the policy network 112 processes the training observation 132 in accordance with the current values of the policy network parameters to generate a score distribution 114 including a respective score value for each action in the set of possible action) to the current values of the controller parameters that increases the score assigned to the selected output sequence by the controller neural network;
generating a batch of new output sequences using the controller neural network in accordance with the current values of the controller parameters ([0088]: updates 126 from multiple trajectories, wherein [0096] trajectory includes a sequence of multiple training observations);
obtaining a respective reward ([0075, 0096, 0105]) for each of the new output sequences;
determining, from the new output sequences and the output sequences in the maintained data, the K output sequences that have the highest rewards ([0105: more accurate estimate of the rewards); and
modifying the maintained data ([0079]: the training engine 116 trains the policy network 112 continuously (i.e., so that the policy network parameters are constantly being updated as the agent 104 interacts with the environment 10) to identify the determined K output sequences and the respective reward for each of the K output sequences.
19. Gendron as applied to the computer-readable storage media of claim 18, further teaches, wherein determining, for each selected sequence, a respective first update to the current values of the controller parameters that increases the score assigned to the selected output sequence by the controller neural network comprises:
objective function (β; [0014, 0109) that depends on a logarithm of the score assigned to the selected sequence (Eqn 4; [0110]) by the neural network.
20. Gendron as applied to the computer-readable storage media of claim 18, further teaches, the operations further comprising:
determining a second ([0034]; [0078: as a second iteration of multiple training iterations]) update to the current values of the controller parameters that increases the rewards ([0078]: increase a cumulative measure of reward (e.g., a time-discounted sum of future rewards) received by the agent 104) received for output sequences generated by the controller neural networks using a reinforcement learning technique ([0058]; [0078]: By increasing a cumulative measure of reward received by the agent 104, the training engine 116 may cause the agent 104 to perform given tasks more effectively).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOAZZAM HOSSAIN whose telephone number is (571)270-7960.  The examiner can normally be reached on M-F: 8:30AM - 6:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William Kraig can be reached on 571-272-8660.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MOAZZAM HOSSAIN/Primary Examiner, Art Unit 2896 
July 23, 2022