DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Amendment
This Office Action is in response to applicant’s communication filed 26 October 2021, in response to the Office Action mailed 14 September 2021.  The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tesauro et al. (A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation, June 2006, pgs. 65-73) in view of Abe (US 2004/0015386).

As per claim 1, Tesauro teaches obtaining (i) a specification of an iterative workload comprising a plurality of states of the iterative workload and a set of available [the workload includes a space of state-action pairs (pg. 65, abstract describes the space generally; pg. 68, section IV the pairs and reward; etc.)], and (ii) a domain model of the iterative workload that relates an amount of resources allocated in training data with one or more service metrics [a model of the workload relates various metrics such as response time and throughput (pg. 67, section II) which metrics are used in training the RL model (pgs. 65-66, section I for training in general to maximize specific metric; pgs. 66-67, section II for modeling the metrics; pgs. 68-69, section IV for the training of the RL)], wherein a duration of one simulated iteration of a plurality of simulated iterations of the iterative workload using said domain model of the iterative workload satisfies one or more predefined duration criteria [allocation decisions are made in a fixed time interval where the metrics are measured over the intervals using a set time series model of traffic (pgs. 66-67, section II; also see pgs. 68-69, section IV; pgs. 71-72, section VI.B; etc.)]; adjusting weights of at least one reinforcement learning agent by performing iteration steps for each simulated iteration of the iterative workload and then using variables observed during a given simulated iteration of the iterative workload to refine the at least one reinforcement agent [the reinforcement learning (RL) model is trained over a number of batches/iterations using back-propagation to adjust the weights of the neural network (pgs. 68-69, section IV; etc.)]; and determining, by the at least one reinforcement learning agent implemented using at least one processing device, a dynamic resource allocation policy for the iterative workload [the trained hybrid RL model is used to dynamically allocate resources for a data center workload (pgs. 65-66, section I; etc.)], wherein [the training process for the RL model includes learning a value function for an application over a number of batches, taking as input a recorded sequence of observer state/action/reward produced by a policy and computing a value function using Algorithm 1 (pgs. 68-69, section IV including Algorithm 1; etc.)]; (b) updating, by the at least one reinforcement learning agent, a function that evaluates a quality of a plurality of state-action combinations [the training process for the RL model includes learning a value function for an application over a number of batches, taking as input a recorded sequence of observer state/action/reward produced by a policy and computing a value function using Algorithm 1 (pgs. 68-69, section IV including Algorithm 1; etc.)]; and (c) repeating the employing and updating steps with a new allocation of resources for a respective simulated iteration of the iterative workload [the training process for the RL model includes learning a value function for an application over a number of batches, taking as input a recorded sequence of observer state/action/reward produced by a policy and computing a value function using Algorithm 1 (pgs. 68-69, section IV including Algorithm 1; etc.)].
While Tesauro teaches updating the RL agent function that evaluates a quality of the state-action combinations (see above) it does not explicitly teach updating, by the at least one reinforcement learning agent, using a weighted average of the current state 
Abe teaches updating, by the at least one reinforcement learning agent, using a weighted average of the current state and the next state, a function that evaluates a quality of the plurality of state-action combinations [the method includes updating the value function estimate using a weighted average of the Q-value estimate from the last state and the discounted partial sums of rewards obtained over the next several states (para. 0112, etc.) for the batch reinforcement learning (para. 0102, etc.)].
Tesauro and Abe are analogous art, as they are within the same field of endeavor, namely training and using a reinforcement learning model.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to utilize the weighted average of Q values for the states to update the value function, as taught by Abe, for the update of the value function, including Q values, in the system taught by Tesauro.
Abe provides motivation as [it is possible to simulate online reinforcement learning on very large data sets with a particular policy by electing to use specific data by Q sampling, which includes the weighted averaging (paras. 0110-112, etc.)].

As per claim 2, Tesauro/Abe teaches wherein the domain model is obtained from sample training executions used to learn the relationship between the amount of resources allocated and the one or more service metrics [a model of the workload relates various metrics such as response time and throughput (Tesauro: pg. 67, section II) which metrics are used in training the RL model based on a prior policy (Tesauro: pgs. 65-66, section I for training in general to maximize specific metric; pgs. 66-67, section II for modeling the metrics; pgs. 68-69, section IV for the training of the RL)].

As per claim 3, Tesauro/Abe teaches wherein the step of adjusting weights of the at least one reinforcement learning agent employs a reward metric based on a difference between a desired service metric and a measured service metric [a model of the workload relates various metrics such as response time and throughput (Tesauro: pg. 67, section II) which metrics are used in training the RL model (Tesauro: pgs. 65-66, section I for training in general to maximize specific metric; pgs. 66-67, section II for modeling the metrics; pgs. 68-69, section IV for the training of the RL) and the training process for the RL model includes learning a value function for an application over a number of batches, taking as input a recorded sequence of observer state/action/reward produced by a policy and computing a value function using Algorithm 1, which includes the error calculation (Tesauro: pgs. 68-69, section IV including Algorithm 1; etc.)].

As per claim 4, Tesauro/Abe teaches wherein the step of adjusting weights of the at least one reinforcement learning agent comprises a neural network selecting an action from the set of available actions based on a current state and an expected reward of the selected action and comparing the expected reward of the selected action [the training process for the RL model includes learning a value function for an application over a number of batches, taking as input a recorded sequence of observer state/action/reward produced by a policy and computing a value function using Algorithm 1, which includes the error calculation (Tesauro: pgs. 68-69, section IV including Algorithm 1, etc.)].

As per claim 5, Tesauro/Abe teaches wherein the iterative workload comprises a training of a Deep Neural Network [We have chosen to use neural networks (multi-layer perceptrons) as they have the most successful track record in RL applications (Tesauro: pg. 66, section I)].

As per claim 6, Tesauro/Abe teaches wherein possible actions for resource allocation are discretized using a control action parameter [the basic interaction consists of observing the environment's current state, selecting an allowable action (Tesauro: pg. 67, section III; etc.); where “allowable” actions are the discrete actions chosen by the allowance (parameter)].

As per claim 7, Tesauro/Abe teaches wherein the simulated iteration executes in a simulated environment that generates observations from the domain model [the workload generator runs in s simulated environment (Tesauro: pg. 67, section II; etc.)].

As per claim 8, see the rejection of claim 1, above, wherein Tesauro/Abe also teaches a computer program product, comprising a non-transitory machine-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by at least one processing device perform the method [the system runs on several applications on multiple servers and includes a resource arbiter (Tesauro: pg. 67, fig. 1; etc.)].

As per claim 9, see the rejection of claim 2, above.

As per claim 10, see the rejection of claim 3, above.

As per claim 11, see the rejection of claim 4, above.

As per claim 12, see the rejection of claim 5, above.

As per claim 13, see the rejection of claim 6, above.

As per claim 14, see the rejection of claim 1, above, wherein Tesauro/Abe also teaches a memory; and at least one processing device, coupled to the memory, operative to implement the method [the system runs on several applications on multiple servers and includes a resource arbiter; which inherently requires at least some memory and a coupled processing device to run software (Tesauro: pg. 67, fig. 1; etc.)].

As per claim 15, see the rejection of claim 2, above.

As per claim 16, see the rejection of claim 3, above.

As per claim 17, see the rejection of claim 4, above.

As per claim 18, see the rejection of claim 5, above.

As per claim 19, see the rejection of claim 6, above.

As per claim 20, see the rejection of claim 7, above.


Response to Arguments
Applicant’s arguments, see the remarks, filed 26 October 2021, with respect to the double patenting rejections, the rejections under 35 U.S.C. 112, and the rejecitons under 35 U.S.C. 101 have been fully considered and are persuasive in view of the amendments filed (both in this and the related case).  The rejections of claims 1-20 highlighted above have been withdrawn. 

Applicant’s further remarks are drawn to the amendments made to the claims, which have been addressed by the newly cited reference, to Abe, as described above.


Conclusion
The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 1-20 are rejected.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Mao et al. (Resource Management with Deep Reinforcement Learning, Nov 2016, pgs. 50-56) – discloses using a RL model for resource allocation.
Tumbde (US 2013/0097321) – discloses using RL for workload balancing.

The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.

When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the .

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GEORGE GIROUX/Primary Examiner, Art Unit 2128