DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This action is in response to amendment filed on 07/27/2022.
Claims 1-13, 15-16, 18-20 have been amended via Applicant’s amendment.
Claims 14 and 17 have been canceled via Applicant’s amendment.
Claims 21-22 are newly added via Applicant’s amendment.
Claim 1 is currently amended via Examiner’s amendment.
Claim 21 is canceled via Examiner’s amendment.
Claims 1, 11 and 16 are independent claims.
Claims 1-13, 15-16, 18-20 and 22 are pending.
Claims 1-13, 15-16, 18-20 and 22 are allowed.

Drawings
The applicant’s drawings submitted are acceptable for examination purposes.

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in a telephone interview with Brady A. Garcea (Reg. No. 79,075) on 08/17/2022 to place the application in condition for allowance. 

The claims have been amended as follows: 
In the Claims:
	Claim 1 is currently amended via Examiner’s amendment.
	Claim 21 is canceled via Examiner’s amendment.

This list of claims will replace all prior versions, and listings, of claims in the application:
 
List of the Claims: 
1. (Currently Amended) In a resource management digital medium environment, a method implemented by at least one computing device across multiple iterations, and in each iteration the method comprising: 
identifying, by an application, a previous action performed in a previous iteration of the multiple iterations to manage computing device resource usage by the application;
determining a current state of the application indicating a current health of the application, the current state being one of multiple states for the application; 
determining a reward value to apply based at least in part on the current state of the application; 
updating a reinforcement learning model which associates each of multiple actions with each of the multiple states, the reinforcement learning model being updated by distributing a first portion of the reward value to an action value associated with the previous action and a previous state in the previous iteration, and an additional portion of the reward value to an additional action value associated with an additional action and the previous state; 
wherein the first portion of the reward value and the additional portion of the reward value are distributed in a same iteration of the multiple iterations;
selecting, based on the reinforcement learning model, an action of the multiple actions associated with the current state; and 
performing, by the application, the selected action to modify usage of at least one computing device resource.  

2. (Previously Presented) The method of claim 1, wherein the determining the current state of the application comprises determining the current state of the application based on at least one of a nature of a workflow being performed by the application, a health of the application, and user interface activity for the application.  

3. (Previously Presented) The method of claim 1, wherein determining the reward value comprises determining the reward value based on the current state being different than the previous state in the previous iteration, the reward value being greater if the current state is an improved state over the previous state.  

4. (Previously Presented) The method of claim 1, wherein determining the reward value comprises determining the reward value based on a change in resources consumed by the application if the current state is a same state as the previous state in the previous iteration.  

5. (Previously Presented) The method of claim 1, wherein the reinforcement learning model comprises a table including multiple columns and multiple rows corresponding to the multiple states and the multiple actions, updating the reinforcement learning model comprises distributing the reward value across a first cell of the table corresponding to the previous action and the previous state in the previous iteration, as well as one or more cells of the table corresponding to the previous state that are adjacent to the first cell.  

6. (Previously Presented) The method of claim 5, wherein distributing the reward value comprises applying the first portion of the reward value to the first cell, a second portion of the reward value to a second cell that is adjacent to the first cell and corresponds to an action of the previous state, and a third portion of the reward value to a third cell that is adjacent to the first cell and corresponds to the additional action of the previous state.  

7. (Previously Presented) The method of claim 6, wherein the first portion of the reward value comprises one-half of the reward value, the second portion of the reward value comprises one-quarter of the reward value and the third portion of the reward value comprises one-quarter of the reward value.  

8. (Previously Presented) The method of claim 1, wherein selecting the action comprises selecting an action using a first policy and a second policy, the first policy comprising selecting the action based on which action in the reinforcement learning model corresponding to the current state has a largest action value, the second policy comprising selecting an action from the reinforcement learning model randomly. 
 
9. (Previously Presented) The method of claim 8, wherein selecting the action comprises selecting one of the first policy and the second policy based on a distribution giving a probability of the first policy being selected at least seven times a probability of the second policy being selected.  

10. (Previously Presented) The method of claim 8, wherein the reinforcement learning model comprises a table including multiple columns and multiple rows corresponding to the multiple states and the multiple actions, the first policy further comprises selecting the action from a set including a first action, a second action, and a third action, the first action corresponding to a first cell of the table corresponding to the current state and having the largest action value, the second action corresponding to a cell of the table corresponding to the current state and being adjacent to the first cell, and the third action corresponding to an additional cell of the table corresponding to the current state and being adjacent to the first cell.

11. (Previously Presented) In a content creation digital medium environment, a computing device comprising: 
a processor; and 
computer-readable storage media having stored there on multiple instructions of an application that, responsive to execution by the processor, cause the processor to perform operations across multiple iterations, each iteration including: 
identifying, by the application, a previous action performed in a previous iteration of the multiple iterations to manage computing device resource usage by the application;
determining a current state of the application indicating a current health of the application; 
updating a reinforcement learning model by distributing a reward value across action values associated with at least one action, the reinforcement learning model associating each of multiple actions with each of multiple states of the application;
selecting between a first policy and a second policy to implement for selecting an action of the multiple actions associated with the current state, the first policy comprising selecting the action based on which action in the reinforcement learning model corresponding to the current state has a largest action value, the second policy comprising selecting the action from the reinforcement learning model randomly, the first policy having a higher probability of being selected than the second policy; 
selecting, using the selected policy and based on the reinforcement learning model, the action of the multiple actions associated with the current state; and
performing, by the application, the selected action to modify usage of at least one computing device resource.  

12. (Previously Presented) The computing device of claim 11, wherein the reinforcement learning model comprises a table including multiple columns and multiple rows corresponding to the multiple states and the multiple actions, updating the reinforcement learning model comprises distributing the reward value across a first cell of the table corresponding to the previous action and a previous state in the previous iteration, as well as one or more cells of the table corresponding to the previous state that are adjacent to the first cell.  

13. (Previously Presented) The computing device of claim 12, wherein distributing the reward value comprises applying a first portion of the reward value to the first cell, a second portion of the reward value to a second cell that is adjacent to the first cell and corresponds to an action of the previous state, and a third portion of the reward value to a third cell that is adjacent to the first cell and corresponds to an additional action of the previous state.  

14. (Canceled)  

15. (Previously Presented) The computing device of claim 11, wherein a probability of the first policy being selected is at least seven times greater than a probability of the second policy being selected.  

16. (Previously Presented) A system comprising: 
an environment monitoring module, implemented at least in part in hardware, of an application to identify a previous action performed in a previous iteration of multiple iterations to manage computing device resource usage by the application; 
a state generation module, implemented at least in part in hardware, to determine a current state of the application indicating a current health of the application, the current state being one of multiple states for the application; 
means for selecting, based at least in part on the current state of the application and a reinforcement learning model, one of multiple actions to reduce resource usage by the application using a policy for selecting from the multiple actions, the reinforcement learning model comprising a table including multiple columns and multiple rows corresponding to the multiple states and the multiple actions, the policy comprising selecting the one action from a set including a first action, a second action, and a third action, the first action corresponding to a first cell of the table corresponding to the current state and having a largest action value, the second action and the third action corresponding to cells of the table corresponding to the current state and being adjacent to the first cell; and 
an action performance module, implemented at least in part in hardware, to perform the selected action to modify usage of at least one computing device resource.  

17. (Canceled)  

18. (Previously Presented) The system of claim 16, wherein the means for selecting includes means for determining a reward value to be distributed among one or more cells of the reinforcement learning model, the reward value being based on the current state being different than a previous state in the previous iteration, and the reward value being greater if the current state is an improved state over the previous state.  

19. (Previously Presented) The system of claim 16, wherein the means for selecting includes means for determining a reward value to be distributed among one or more cells of the reinforcement learning model, the reward value being based on a change in resources consumed by the application if the current state is a same state as a previous state in the previous iteration.
  
20. (Previously Presented) The system of claim 16, wherein the means for selecting includes means for updating the reinforcement learning model by distributing a reward value across a cell of the table corresponding to the previous action and a previous state in the previous iteration, as well as one or more cells of the table corresponding to the previous state that are adjacent to the cell.  

21. (Canceled)  

22. (Previously Presented) The system of claim 16, wherein adjacent cells in the table which are adjacent to a respective cell correspond to actions that are more similar to an action associated with the respective cell than other actions associated with other cells.

REASONS FOR ALLOWANCE
The cited prior arts:

Kasaragod et al. (US 2020/0167686 A1) discloses performing multiple iteration to manage resource usage by the application; determining a current state of the application and applying reward value based on the current state of the application; and updating a reinforcement learning model (Kasaragod: “[0040-0042] discloses a training application container that performs training of the reinforcement learning model based on actions performed within the simulation environment.  The training of the reinforcement learning model may take into account the reward value, as determined via the reinforcement function, corresponding to the action performed, the initial state, and the state attained via execution of the action.  The training container may provide the updated reinforcement learning model to a simulation application container to utilize in the simulation of the application and to obtain new state-action-reward data that may be used to continue updating the reinforcement learning model.  Determining an average reward value for the simulation through execution of actions in the simulation environment over a minimum number of iterations of the simulation.”  “[0021] discloses based on the simulation environment state achieved through execution of the action, the application may determine, based on the reinforcement function, a reward value.  [0040] [0059] The training of the reinforcement learning model may further take into account the reward value, as determined via the custom-designed reinforcement function, corresponding to the action performed, the initial state, and the state attained via execution of the action. The training application container may provide the updated reinforcement learning model to a simulation application container to utilize in the simulation of the application and to obtain new state-action-reward data that may be used to continue updating the reinforcement learning model. [0069] [0077-0078] discloses using the reinforcement function, the simulation application container may determine the corresponding reward value for the tuple comprising the initial state, action performed, and resulting state of the simulation environment.”); selecting an action associated with the current state (e.g. Kasaragold:  [0069] [0108] the simulation application container may initiate the simulation using a randomized reinforcement learning model, whereby the simulation application container uses the model to select, based on an initial state of the simulation environment, a random action to be performed. The simulation application container may execute the action and determine the resulting state of the simulation environment. Using the reinforcement function, the simulation application container may determine the corresponding reward value for the tuple comprising the initial state, action performed, and resulting state of the simulation environment. The simulation application container may store this data point in the memory buffer to provide the performance data to the training application and execute another action based on the current state of the simulation environment.).
Padala et al. (US 2015/0058265 A1) discloses selecting an action associated with the current state; and performing the selected action to modify usage of computing resource (e.g.  [Abstract] [0003-0005] discloses recommending and selecting a scaling action from a plurality of possible actions for the multi-tier application in the current state.  [0034] [0038-0041] discloses the automatic scaling modules operates to automatically scale the multi-tier application as needed.  [0071-0072] discloses selecting scaling action and applying scale-up or scale-down policies based on the selected action.). 
Genc et al. (US 2020/0167687 A1) discloses A simulation application container executes a simulation of a system in a simulation environment, through which an agent representing the system uses a reinforcement learning model to operate within the simulation environment. The simulation application container obtains data indicating how the agent performed in the simulation environment and transmits this data to a robot application container. The robot application container uses the data to update the reinforcement learning model and provides the updated reinforcement learning model to perform another iteration of the simulation and continue training the reinforcement learning model.  The training of the reinforcement learning model may further take into account the reward value, as determined via the custom-designed reinforcement function, corresponding to the action performed, the initial state, and the state attained via execution of the action. The training application container may provide the updated reinforcement learning model to a simulation application container to utilize in the simulation of the application and to obtain new state-action-reward data that may be used to continue updating the reinforcement learning model.
LaBute et al. (US 2019/0243691 A1) discloses method and system for automatically scaling provisioned resources using machine learning model.  The determination of provisioning action is based on number of required threads, model parameters that are adjusted using performance reward produced from the reward calculator.  The model parameter might be adjusted using reinforcement learning applied to the performance rewards produced by performance calculator, rewarding model when it succeeds in keeping the resource utilization score within pre-defined bounds.
Bedi et al. (US 2020/0409829 A1) discloses “method and a system for automated testing of applications includes crawling an application by an application crawler to identify application states. Rewards associated with the application states are calculated using a reinforcement learning engine, based on a reward matrix. Critical paths are identified by the reinforcement learning engine and are passed to a test scenario generator. Test scripts are generated by the test scenario generator based on the identified critical paths. The applications are tested by a test scenario execution engine based on the generated one or more test scripts, and test data generated by the test data generator. Test results are captured by a behavior analyzer. One or more insights are generated by the behavior analyzer, from the captured test results to update the reward matrix and to improve the efficiency of continuous autonomous testing system.”  “one or more rewards associated with the one or more identified application states may be calculated using a reinforcement learning engine as in step 206. The one or more rewards may be calculated based on a reward matrix, wherein the reward matrix may be domain specific. Table 2 illustrates a sample reward matrix for web applications in e-commerce domain. A reward matrix may be a table containing pre-defined values(rewards) assigned for transition from one page to another page of a web application.”

However, the cited prior arts taken alone or in combination fail to teach, in combination with other claimed limitations, “updating a reinforcement learning model which associates each of multiple actions with each of the multiple states, the reinforcement learning model being updated by distributing a first portion of the reward value to an action value associated with the previous action and a previous state in the previous iteration, and an additional portion of the reward value to an additional action value associated with an additional action and the previous state; wherein the first portion of the reward value and the additional portion of the reward value are distributed in a same iteration of the multiple iterations; selecting, based on the reinforcement learning model, an action of the multiple actions associated with the current state; and performing, by the application, the selected action to modify usage of at least one computing device resource” as recited in independent claim 1.

The cited prior arts taken alone or in combination also fail to teach, in combination with other claimed limitations, “updating a reinforcement learning model by distributing a reward value across action values associated with at least one action, the reinforcement learning model associating each of multiple actions with each of multiple states of the application; selecting between a first policy and a second policy to implement for selecting an action of the multiple actions associated with the current state, the first policy comprising selecting the action based on which action in the reinforcement learning model corresponding to the current state has a largest action value, the second policy comprising selecting the action from the reinforcement learning model randomly, the first policy having a higher probability of being selected than the second policy; selecting, using the selected policy and based on the reinforcement learning model, the action of the multiple actions associated with the current state; and performing, by the application, the selected action to modify usage of at least one computing device resource” as recited in independent claim 11.

Furthermore, the cited prior arts taken alone or in combination also fail to teach, in combination with other claimed limitations, “means for selecting, based at least in part on the current state of the application and a reinforcement learning model, one of multiple actions to reduce resource usage by the application using a policy for selecting from the multiple actions, the reinforcement learning model comprising a table including multiple columns and multiple rows corresponding to the multiple states and the multiple actions, the policy comprising selecting the one action from a set including a first action, a second action, and a third action, the first action corresponding to a first cell of the table corresponding to the current state and having a largest action value, the second action and the third action corresponding to cells of the table corresponding to the current state and being adjacent to the first cell; and an action performance module, implemented at least in part in hardware, to perform the selected action to modify usage of at least one computing device resource” as recited in independent claim 16.

	These claimed limitations are not present in the prior art of record and would not have been obvious, thus all pending claims 1-13, 15-16, 18-20 and 22 are allowed. 

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
  
Conclusion 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Hiren Patel whose telephone number is (571) 270-3366.  The examiner can normally be reached on Monday to Friday 9:30 AM to 6:00 PM.		
If attempts to reach the above noted Examiner by telephone are unsuccessful, the Examiner’s supervisor, Emerson Puente, can be reached at the following telephone number: (571) 272-3652. 
The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

August 22, 2022


/HIREN P PATEL/Primary Examiner, Art Unit 2196