DETAILED ACTION

Notice of AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 04/19/2021 has been entered.
 

Applicant(s) Response to Official Action 
	  The response filed on 04/19/2021 has been entered and made of record.

Response to Arguments/Amendments
Presented arguments have been fully considered, but are rendered moot in view of the new ground(s) of rejection necessitated by amendment(s) initiated by the applicant(s).









Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5-8, 12-15, 19, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Osband et al., hereinafter referred to as Osband (US 2017/0032245 A1 – already of record) in view of Li et al., hereinafter referred to as Li (US 10,402,733 B1) in further view of Oruklu et al., hereinafter referred to as Oruklu (US 2014/0358077 A1).
	
As per claim 1, Osband discloses a system for implementing artificial intelligence agents to perform machine learning tasks using predictive analytics to leverage ensemble policies for maximizing long-term returns (Osband: [0027].), comprising: 
an artificial intelligence agent (Osband: [0028]: Agent.); 
a memory device (210, 215) for storing program code (Osband: [0051]: Memory device (210, 215).); and 
at least one processor device (205) operatively coupled to the memory device (210, 215) and configured to execute program code stored on the memory device to: 
obtain a set of inputs including a[n] (Osband: Para. [0100] discloses using an ensemble policy with regards to a bootstrapped DQN. Further, Fig. 6 is an example of a bootstrapped DQN. Para. [0051] discloses a processor executing instructions by obtaining as input from memory, the DQN of Fig. 6. Furthermore, Paras. [0063]-[0065] in alignment with Fig. 6, disclose utilizing policy π [i.e., an ensemble policy] and an approximator 610 [i.e., meta-policy parameter].); 
select an action for execution within the system environment using a meta-policy function determined based in part on the (Osband: Fig. 6; [0064], [0065], [0100]: A state-value function selects 630 an action for execution within the system environment based on an ensemble policy and approximator 610.); 
cause the artificial intelligence agent to execute the selected action within the system environment (Osband: Fig. 6; [0030], [0064], [0065]: Causes the agent to execute the selected action within the system environment.); and 
update (650) the meta-policy parameter (610)  based on the execution of the selected action (Osband: Fig. 6; [0011], [0064], [0065]: Since the approximator 610 is responsible for the selecting and determining actions, the approximator 610 is updated 650 based on the execution of the selected action, as the observed data and results 635 are updated.).
However Osband does not explicitly disclose “… a set of ensemble policies …”.
Further, Li is in the same field of endeavor and teaches a set of ensemble policies (Li: Col. 6, ll. 51-57 disclose a set of ensemble policies.).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, and having the teachings of Osband and Li before him or her, to modify the reinforcement learning model of Osband to include the set of ensemble policies feature as described in Li. The motivation for doing so would have been to improve prediction accuracy by providing weighted models (Li: Col. 1, ll. 21-64).
However Osband-Li do not explicitly disclose “… update … an initial baseline …”.
Further, Oruklu is in the same field of endeavor and teaches updating an initial baseline (Oruklu: Para. [0061] discloses updating the baseline at each cycle.).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, and having the teachings of Osband-Li and Oruklu before him or her, to modify the machine learning model of Osband-Li to include the initial baseline update feature as described in Oruklu. The motivation for doing so would have been to improve machine-learning algorithms by providing additional parameters that improve learning.  

	As per claim 5, Osband-Li disclose the system of claim 1, wherein the at least one processor device is further configured to execute program code stored on the memory device to:
obtain an observation and reward in a system environment (Osband: Fig. 6; [0061], [0064]: Obtains an observation and reward 635 in a system environment.); and 
update a history based on the observation and the reward (Osband: Fig. 6; [0061], [0064]: Update 650 a history based on the observation and the reward 635.).
	
As per claim 6, Osband-Li disclose the system of claim 5, wherein the meta-policy function is further based on the updated history (Osband: Fig. 6; [0011], [0061], [0064]: Since the approximator 610 is responsible for the selecting and determining actions, the approximator 610 is updated 650 based on the on the updated history.).

	As per claim 7, Osband-Li disclose the system of claim 1, wherein the at least one processor device is further configured to execute program code stored on the memory device to return the updated meta-policy parameter as output for selecting a future action for execution within the system environment (Osband: Fig. 6; [0011], [0061], [0064]: The updated approximator 610 is continuously updated 650 and output for selecting a future action.).

As per claim 8, the claim(s) recites analogous limitations to claim(s) 1 above, and
is/are therefore rejected on the same premise.

As per claim 12, the claim(s) recites analogous limitations to claim(s) 5 above, and
is/are therefore rejected on the same premise.
 
As per claim 13, the claim(s) recites analogous limitations to claim(s) 6 above, and
is/are therefore rejected on the same premise.

As per claim 14, the claim(s) recites analogous limitations to claim(s) 7 above, and
is/are therefore rejected on the same premise.
 
As per claim 15, the claim(s) recites analogous limitations to claim(s) 1, 8 above, and
is/are therefore rejected on the same premise.

	As per claim 19, Osband-Li disclose the computer program product of claim 15, wherein the at least one processor device is further configured to execute program code stored on
the memory device to: obtain an observation and reward in a system environment (Osband: Fig. 6; [0061], [0064]: Obtains an observation and reward 635 in a system environment.); and 
update a history based on the observation and the reward (Osband: Fig. 6; [0061], [0064]: Update 650 a history based on the observation and the reward 635.); 
wherein the meta-policy parameter is further based on the updated history (Osband: Fig. 6; [0011], [0061], [0064]: Since the approximator 610 is responsible for the selecting and determining actions, the approximator 610 is updated 650 based on the on the updated history.).

As per claim 20, the claim(s) recites analogous limitations to claim(s) 7 above, and
is/are therefore rejected on the same premise.

Claims 2, 9, 16 are rejected under 35 U.S.C. 103 as being unpatentable over Osband in view of Li in further view of Zhang et al., hereinafter referred to as Zhang (US 2019/0318648 A1).

	As per claim 2, Osband-Li disclose the system of claim 1, wherein each policy in the set of (Osband: Figs. 4-5 disclose state vector representation and Li: Col. 6, ll. 51-57 disclose the set of ensemble policies.).
	However Osband-Li do not explicitly disclose “… a state vector having a fixed length.”
	Further, Zhang is in the same field of endeavor and teaches a state vector having a fixed length (Zhang: Para. [0057] discloses a state vector having a fixed length.).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, and having the teachings of Osband-Li and Zhang before him or her, to modify the RL model of Osband-Li to include the state vector fixed length feature as described in Zhang. The motivation for doing so would have been to improve reinforcement learning by providing knowledge learned generated on predictive distribution.
  
As per claim 9, the claim(s) recites analogous limitations to claim(s) 2 above, and
is/are therefore rejected on the same premise.
 
As per claim 16, the claim(s) recites analogous limitations to claim(s) 2, 9 above, and
is/are therefore rejected on the same premise.

Allowable Subject Matter
Claims 3, 4, 10, 11, 17, 18 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(a), set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure and can be seen in the list of cited references.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PEET DHILLON whose telephone number is (571)270-5647.  The examiner can normally be reached on M-F: 5am-1:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sath V. Perungavoor can be reached on 571-272-7455.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.










/PEET DHILLON/Primary Examiner, Art Unit 2488                                                                                                                                                                                                        Date: 05-13-2021