DETAILED ACTION
Claims 1-6 are considered in this office action. Claims 1-6 are pending examination.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-2 and 5-6 are  rejected under 35 U.S.C. 102(a)(1) based upon a public use or sale or other public availability of the invention Siddalingaprabhu et al. (US2011/0106737) and herein after will be referred as Siddalingaprabhu. 
Regarding Claim 1, Siddalingaprabhu teaches a reinforcement learning method performed by a computer (Para  [0060] : “A controller, such as a processor, may implement or execute the reinforcement learning algorithm unit 106 and the evolutionary algorithm unit 108 to perform one or more of the steps described in the method 500 in evolving one or more rules.”)
 the reinforcement learning method comprising: performing, based on an action obtained by a basic controller that defines an action on a state of an environment, first reinforcement learning to obtain a first reinforcement learner by using a state action value function expressed in a polynomial in an action range smaller than an action range limit for the environment (Para [0038] : “In FIG. 2, at step 202, starting at time x, five (5) rules having zero credits for all five rules (R252) are initially considered within the system 100. Each rule has its credit value of zero (0) since no rule has been evaluated at time x. At time x, resource usage data and policy schedule data favor the conditions of Rule3. Thus, applying Rule3's action, which starts execution of policies with schedule time x on the system 100, and moves the system 100 to the next step, step 204 (at time x+5 s). Since the state at step 202 is a state with no peak loads on the system 100, Rule3 gets a positive credit of 10 points.”); performing, based on an action obtained by a first controller that includes the first reinforcement learner, second reinforcement learning to obtain a second reinforcement learner by using a state action value function expressed in a polynomial in an action range smaller than the action range limit (Para [0039] : “At step 204, starting at time x+5 s, there are still five (5) rules (R254) within the system 100. Each rule has its credit value of zero (0) except Rule3, which has its credit value of ten (10) since only Rule3 has been evaluated. At time x+5 s, the resource usage data and the policy schedule data favor the conditions of Rule4. Thus, applying Rule4's action, which moves the system to the next step, step 206 (at time x+10 s). Since the state at step 204 is a state with no peak loads on the system 100, Rule4 gets a positive credit of 10 points.”) 
and performing, based on an action obtained by a second controller that includes a second merged reinforcement learner obtained by merging the first reinforcement learner and the second reinforcement learner, third reinforcement learning to obtain a third reinforcement leaner by using a state action value function expressed in a polynomial in an action range smaller than the action range limit (Para [0042] : “At step 210, starting at time x+30 s, there are seven (7) rules (R260) within the system 100. At time x+30 s, application of Rule1 deteriorates the system state by causing CPU/Memory spikes as the process continues. This causes the policy scheduler 102 to impose a negative credit of −10 points on Rule1. This way the system 100 moves from one state to the next state as the rules are tested against the resource usage data and the policy schedule data. Periodically, the evolutionary algorithm unit 108 introduces one or more new rules into the system 100 by running an evolutionary algorithm while the reinforcement learning algorithm unit 106 repeats the reinforcement learning algorithm, such as step 202 through step 210.”).  

Similarly Claims 5 and 6 are rejected . 

Regarding Claim 2,  Siddalingaprabhu teaches the reinforcement learning method of claim 1. Siddalingaprabhu also teaches further comprising: repeatedly performing a reinforcement learning process for integer j starting from 4 while incrementing j by 1, the reinforcement learning process including performing, based on an action obtained by a j-th controller that includes a j-th merged reinforcement learner obtained by merging the (j-1)-th merged reinforcement learner obtained immediately before and a (j-1)-th reinforcement learner obtained by the (j-1)-th reinforcement learning performed immediately before, j-th reinforcement learning to obtain a j-th reinforcement learner by 75Fujitsu Ref. No.: 18-01449 using a state action value function expressed in a polynomial in an action range smaller than the action range limit (Para [0042] : “At step 210, starting at time x+30 s, there are seven (7) rules (R260) within the system 100. At time x+30 s, application of Rule1 deteriorates the system state by causing CPU/Memory spikes as the process continues. This causes the policy scheduler 102 to impose a negative credit of −10 points on Rule1. This way the system 100 moves from one state to the next state as the rules are tested against the resource usage data and the policy schedule data. Periodically, the evolutionary algorithm unit 108 introduces one or more new rules into the system 100 by running an evolutionary algorithm while the reinforcement learning algorithm unit 106 repeats the reinforcement learning algorithm, such as step 202 through step 210.”).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Siddalingaprabhu in view of Tange (US20160063142) and herein after will be referred as Tange. 

Regarding Claim 4,  Siddalingaprabhu teaches the reinforcement learning method of claim 1. 

Tange teaches wherein the merging is performed by using a quantifier elimination with respect to a logical expression using a polynomial (Para [0079] : “In the quantifier elimination process, the variables to which the existential quantifier () has been attached in the first-order predicate logical formula are eliminated to generate a logically equivalent control logical formula. In the present embodiment, as shown in FIG. 8, a control logical formula (p(ed), Au0) indicating the relationship between the current value e() of the target deviation and the operation change amount Au0=Au(S0) at the operation timing S0 is generated. A commonly known algorithm, Such as the QE (Quantifier Elimination) algorithm in Non-Patent Document 5, for example, can be used for the quantifier elimination process. Here, a (portion of) an example of the control logical formula (p(ed) Au0) based on the example of the first-order predicate logical formula up shown in FIG. 7 is illustrated inside the region of the dotted lines in FIG. 8.”).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Siddalingaprabhu to incorporate the teachings of Tange to include the merging is performed by using a quantifier elimination with respect to a logical expression using a polynomial. Doing so would optimize the merging process of the control as discussed at least in the abstract of Tange. 
	

Allowable Subject Matter
Claim 3 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure are :
Ikai et al. US2018/0373223 : discloses to perform reinforcement learning enabling to prevent complicated adjustment of coefficients of backlash compensation and backlash acceleration compensation. A machine learning apparatus includes a state information acquiring part for acquiring, from a servo control apparatus, state information including at least position deviation and a set of coefficients to be used by a backlash acceleration compensating part, by making the servo control apparatus execute a predetermined machining program, an action information output part for outputting action information including adjustment information on the set of coefficients included in the state information to the servo control apparatus, a reward output part for outputting a reward value in the reinforcement learning on the basis of the position deviation included in the state information, and a value function updating part for updating an action-value function on the basis of the reward value output by the reward output part, the state information and the action information.
Huang et al. WO2018/206504 discloses A pre-training apparatus and method for reinforcement learning based on a Generative Adversarial Network (GAN) is provided. GAN includes a generator and a discriminator. The method comprising receiving training data from a real environment where the training data includes a data slice corresponding to a first state-reward pair and a first state-action pair, training the GAN using the training data, training a relations network to extract a latent relationship of the first state-action pair with the first state-reward pair in a reinforcement learning context, causing the generator trained with training data to generate first synthetic data, processing a portion of the first synthetic data in the relations network to generate a resulting data slice, merging the second state-action pair portion of the first synthetic data with the second state-reward pair from the relations network to generate second synthetic data to update a policy for interaction with the real environment.
Mnih et al. WO 2015054264 discloses We describe a method of reinforcement learning for a subject system having multiple states and actions to move from one state to the next. Training data is generated by operating on the system with a succession of actions and used to train a second neural network. Target values for training the second neural network are derived from a first neural network which is generated by copying weights of the second neural network at intervals.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDHESH K JHA whose telephone number is (571)272-6218. The examiner can normally be reached M-F:0800-1700.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, James J Lee can be reached on 571-270-5965. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ABDHESH K JHA/Primary Examiner, Art Unit 3668