DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Amendment
This Office Action is in response to applicant’s communication filed 23 September 2021, in response to the Office Action mailed 21 July 2021.  The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –



Claim(s) 1, 3, 4, 8, 10, 11, 15, and 17 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Ho et al. (Generative Adversarial Imitation Learning, June 2016, pgs. 1-14 – cited in an IDS).

As per claim 1, Ho teaches a computer-implemented method for active, imitation learning, comprising: providing training data to a processor, the training data comprising an expert trajectory, wherein the expert trajectory comprises a plurality of state/action pairs [an algorithm for generative adversarial learning, harnessing generative adversarial training to fit distributions of states and actions defining expert behavior (section 1)]; querying the expert trajectory during an interactive, active learning process to determine an optimal action to be taken in response to a given state [an algorithm for generative adversarial learning, harnessing generative adversarial training to fit distributions of states and actions defining expert behavior; where the learner is given only samples of trajectories from the expert (section 1) and learns/optimizes policies in steps (sections 4 and 5, etc.)], wherein the expert trajectory is queried for only a subset of iterations of the iterative, active learning process [the learner is given samples of trajectories from the expert(s) (sections 1 and 2, etc.)]; generating a decision policy based at least in part on the expert trajectory and a result of querying the expert trajectory during the iterative, active learning process [a decision policy is generated based on the samples of expert trajectory using a generative model G (sections 2, 5, etc.)]; attempting to distinguish the decision policy from the expert trajectory [one or more discriminator network(s) D attempt to distinguish the policy from expert policy (section 2, etc.)]; in response to distinguishing the decision policy from the expert trajectory, outputting a policy update and generating a new decision policy based at least in part on the policy update [using the discriminator outputs as a learning signal the generator attempts to find a policy minimizing divergence from the expert’s, updating the policy by steps (sections 4 and 5, etc.)]; and in response to not distinguishing the decision policy from the expert trajectory, outputting the decision policy [using the discriminator outputs as a learning signal the generator attempts to find a policy minimizing divergence from the expert’s, updating the policy by steps (sections 4 and 5, etc.)].

As per claim 3, Ho teaches wherein attempting to distinguish between the expert trajectory and the decision policy is performed using a binary classifier [the discriminator(s) D perform binary classification (section 5, etc.)].

As per claim 4, Ho teaches wherein generating the decision policy includes a stochastic process, and wherein generating the decision policy and attempting to distinguish between the expert trajectory and the decision policy are each independently performed using different deep learning neural networks [the generator includes stochastic policies (section 2, etc.) and the generator(s) and discriminator(s) are separate neural networks (section 6, etc.)].

As per claim 8, see the rejection of claim 1, above, wherein Ho also teaches a computer program product for active imitation learning, the computer program product [the system is evaluated using simulation with MuJoCo (section 6, etc.), which inherently requires storing program instructions executable by a processor].

As per claim 10, see the rejection of claim 3, above.

As per claim 11, see the rejection of claim 4, above.

As per claim 15, see the rejection of claim 1, above, wherein Ho also teaches a system comprising a processor and logic integrated with the processor, executable by the processor, or integrated with and executable by the processor, the logic configured to perform the method [the system is evaluated using simulation with MuJoCo (section 6, etc.), which inherently requires storing program instructions executable by a processor, in order to run the simulation].

As per claim 17, see the rejection of claim 4, above.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 2, 9, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ho et al. (Generative Adversarial Imitation Learning, June 2016, pgs. 1-14 – cited in an IDS) in view of Kalashnikov (US Provisional Application No. 62/685,838 – P.G. Pub. US 2021/0237266).

As per claim 2, Ho teaches the computer-implemented method of claim 1, as described above.
While Ho teaches sampling from the expert trajectory (see above, and e.g., Ho: section 4) it does not explicitly teach wherein the expert trajectory is queried every q iterations, wherein q is a predefined query interval having an integer value greater than 1.
Kalashnikov teaches wherein the expert trajectory is queried every q iterations, wherein q is a predefined query interval having an integer value greater than 1 [the system may sample the trajectory generated and stored in an offline or online buffer, where the sampling rate of each is dynamic and increases or decreases as the duration of the training of the neural network model increases (and thus q will be greater than 1 at least once) (paras. 0022-23, 0062, etc.) including the trajectory of state action pairs (paras. 0049-50, etc.)].
Ho and Kalashnikov are analogous art, as they are within the same field of endeavor, namely training/updating a machine learning model.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to utilize a dynamic sampling rate for sampling an (off) policy path (and/or the on policy path) in use for training the model, as taught by Kalashnikov, for the sampling of the expert (off policy) path in the training of the model in the system taught by Ho.
Because Ho and Kalashnikov both teach sampling an off policy path it would have been obvious to one of ordinary skill in the art to utilize a dynamic sampling rate for sampling an (off) policy path (and/or the on policy path) in use for training the model, as taught by Kalashnikov, for the sampling of the expert (off policy) path in the training of the model in the system taught by Ho, to achieve the predictable result of avoiding overfitting to the possibly initially scarce on policy data and accommodating a possibly lower rate of production of the on policy data. Kalashnikov provides further motivation as [using the off policy path and the dynamic sampling rates allows the system to effectively ingest and train on large and diverse datasets (para. 0060, etc.)].

As per claim 9, see the rejection of claim 2, above.

As per claim 16, see the rejection of claim 2, above.


Claims 5, 6, 12, 13, 18, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ho et al. (Generative Adversarial Imitation Learning, June 2016, pgs. 1-14 – cited in an IDS) in view of Judah et al. (Active Imitation Learning via Reduction to I.I.D. Active Learning, Oct 2012, pgs. 19-27 – cited in an IDS).

As per claim 5, Ho teaches the computer implemented method of claim 1, as described above.
Ho does not explicitly teach wherein the result of querying the expert trajectory is a most uncertain state/action pair from the expert trajectory, and wherein the active learning process comprises determining the most uncertain state/action pair from the expert trajectory using one or more disagreement functions.
Judah teaches wherein the result of querying the expert trajectory is a most uncertain state/action pair from the expert trajectory, and wherein the active learning process comprises determining the most uncertain state/action pair from the expert trajectory using one or more disagreement functions [the system queries an expert about desired action in individual states (abstract, etc.) including selecting the most uncertain pair (sections 1 and 2, etc.) where the selected query is the state that maximizes the product of state density and committee disagreement (section 5)].
Ho and Judah are analogous art, as they are within the same field of endeavor, namely imitation learning.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to query the expert for the most uncertain state/action from the trajectory, as taught by Judah, for the state/action samples obtained from the expert in the system taught by Ho.
Judah provides motivation as [the most uncertain pair is the most informative (sections 1 and 2, etc.)].

As per claim 6, Ho/Judah teaches  wherein the one or more disagreement functions are selected from the group consisting of: a Density Weighted Query By Committee disagreement function, a Vote Entropy disagreement function, an Average Coefficient of Variation disagreement function, and combinations thereof [the selected query is the state that maximizes the product of state density and committee disagreement (Judah: section 5, etc.)].

As per claim 12, see the rejection of claim 5, above.

As per claim 13, see the rejection of claim 6, above.

As per claim 18, see the rejection of claim 5, above.

As per claim 19, see the rejection of claim 6, above.


Claims 7, 14, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ho et al. (Generative Adversarial Imitation Learning, June 2016, pgs. 1-14 – cited in an IDS) in view of Piot et al. (Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning, May 2016, pgs. 1814-1826).

As per claim 7, Ho teaches wherein the expert trajectory and the decision policy each comprise a plurality of state/action pairs corresponding to a decision space, wherein at least one of the state/action pairs comprise a continuous state [the policy exists in a state-action space (sections 4 and 5, etc.) in a high-dimensional continuous environment (section 2, etc.)]; 
Ho does not explicitly teach wherein at least one of the state/action pairs comprises a non-deterministic action.
Piot teaches wherein at least one of the state/action pairs comprises a non-deterministic action [the system uses set-policies which are generalizations of the policy for non-deteministic policies (including a number of state/action pairs) (section III, etc.)].
Ho and Piot are analogous art, as they are within the same field of endeavor, namely imitation learning.
It would be obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include nondeterministic policies as taught by Piot in the decision policies of the system of Piot.
Piot provides motivation as [experiments demonstrate that the set-policy-based algorithms outperform both the standard IRL and IL ones and result in more robust solutions (abstract, etc.)].

As per claim 14, see the rejection of claim 7, above.

As per claim 20, see the rejection of claim 7, above.


Response to Arguments
Applicant’s arguments, see the remarks, filed 23 September 2021, with respect to the rejection of claim 15 under 35 U.S.C. 112 have been fully considered and are persuasive.  The rejection of claims 15-20 under 35 U.S.C. 112, second paragraph, has been withdrawn. 

Applicant's arguments filed 23 September 2021 regarding the rejections under 35 U.S.C. 102 have been fully considered but they are not persuasive.

Applicant argues that the cited art does not teach “querying the expert trajectory during an iterative, active learning process to determine an optimal action to be taken in response to a given state, wherein the expert trajectory is queried for only a subset of iterations of the interactive, active learning process”.
However, Ho teaches an algorithm for generative adversarial learning, harnessing generative adversarial training to fit distributions of states and actions defining expert behavior; where the learner is given only samples of trajectories from the expert (section 1) and learns/optimizes policies in steps (sections 4 and 5, etc.), where the learner is given samples of trajectories from the expert(s) (sections 1 and 2, etc.). 
Applicant argues that this does not teach that only a subset of iterations include querying expert trajectories, and rather queries the expert trajectories at every iteration.  
However, the examiner notes that the entire “set” of iterations is a “subset” of itself (as every set is a subset of itself), and therefore, even if Ho does sample the 

Applicant also argues that the cited art does not teach “generating a decision policy based at least in part on the expert trajectory and a result of querying the expert trajectory during the iterative, active learning process”.
However, Ho teaches a decision policy is generated based on the samples of expert trajectory using a generative model G (sections 2, 5, etc.), where the decision policy is based on the sampled(queried) expert trajectory during the iterative learning.

Applicant further argues that the cited art does not teach “in response to distinguishing the decision policy from the expert trajectory, outputting a policy update and generating a new decision policy based at least in part on the policy update”.
However, Ho teaches using the discriminator outputs as a learning signal the generator attempts to find a policy minimizing divergence from the expert’s, updating the policy by steps (sections 4 and 5, etc.); where updating a policy generates a new decision policy.


Conclusion
The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 1-20 are rejected.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Abbeel (Apprenticeship Learning via Inverse Reinforcement Learning, 2004, pgs. 1-8) and Li et al. (Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs, Mar 2017, pgs. 1-11) – disclose systems for estimating a reward function based on expert demonstrations.
Nachum (US 2019/0332922 – Provisional Application No. 62/463562) – discloses sampling an off-policy path including an expert path and previous path from the model being trained, including randomly sampling from the multiple off-policy paths.

The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.

When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the references cited or the objections made. He or she must also show how the amendments avoid such references or objections.  See 37 CFR 1.111(c).

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on 571-272-2589. The fax phone 
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GEORGE GIROUX/Primary Examiner, Art Unit 2128