DETAILED ACTION
This action is written in response to the application filed 2/28/19. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-3, 9-11, 13-14 and 19-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Calmon (US 2020/0241921 A1).
Regarding claim 1, Calmon discloses a computing system for simulating allocation of resources to a plurality of entities, the computing system comprising:
one or more processors;
[0121] processor.
a reinforcement learning agent model configured to receive an entity profile that describes at least one of a preference or a demand of a simulated entity, and in response to receiving the entity profile, output an allocation output that describes a resource allocation for the simulated entity of the plurality of entities;

    PNG
    media_image1.png
    839
    1207
    media_image1.png
    Greyscale
Calmon fig. 9, annotated by Examiner.
[0072] “As noted above, some exemplary embodiments address the problem of resource allocation for the task of training deep neural networks.” (Emphasis added.)[0088] “At the beginning of each episode , an initial allocation of resources ro and the SLA metric set-point m* are defined. For each epoch, the allocation is changed with an action and a new state and rewards are measured.” (Emphasis added.)
an entity model configured to receive data descriptive of at least one resource, and in response to receiving the data descriptive of the at least one resource, simulate a simulated response output that describes a response of the simulated entity to the data descriptive of the at least one resource;
See fig. 9, reproduced above.[0082] “In the example of FIG. 8, the first three stages (e.g., an "observe current state” stage 830, an "action decision, actuation” stage 840 and an "observe reward” stage 850) are all linked to a definition of the simulated environment task 820.” (Emphasis added.)
one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:
inputting the entity profile into the reinforcement learning agent model;
See fig. 9, reproduced above.
[0031] “As shown in FIG. 1, the exemplary reinforcement learning module 100 processes an iterative workload specification 110, comprising a plurality of states of the workload and a set of available actions for one or more of the plurality of states, and an iterative workload domain model 120 that relates an amount of resources allocated in training data with one or more service metrics , such as Service Level Agreement metrics.” (Emphasis added.)
receiving, as an output of the reinforcement learning agent model, the allocation output that describes the resource allocation for the simulated entity;
See fig. 9, reproduced above.[0083] “The action is defined as an increment or decrement of the amount of resources dedicated to the controlled workload.”
selecting the at least one resource to provide to the entity model based on the resource allocation described by the allocation output;
See fig. 9, reproduced above, specifically action 930.[0083] “The action is defined as an increment or decrement of the amount of resources dedicated to the controlled workload.”See generally [0083]-[0088] describing the resource state definition as well as how actions update the state at each epoch (iteration).
providing the at least one resource to the entity model;
See fig. 9, reproduced above, specifically action 930.
receiving, as an output of the entity model, the simulated response output that describes the response of the simulated entity to the at least one resource; and
See fig. 9, reproduced above, specifically state-reward 960. See also description  at [0090].
updating at least one of a resource profile that describes the at least one resource or the entity profile based on the simulated response output.
See generally [0083]-[0088] describing the resource state definition as well as how actions update the state at each epoch (iteration).[0088] “At the beginning of each episode, an initial allocation of resources ro and the SLA metric set-point m* are defined. For each epoch, the allocation is changed with an action and a new state and rewards are measured. ... The training of the QDNN is performed at QDNN training stage 900 in a fixed number of episodes, and it trains a neural network that receives the current state s, and outputs the expected reward q for each of the actions ai, as discussed further below in conjunction with FIG. 9.” (Emphasis added.)


Regarding claim 13, Calmon discloses a method for simulating allocation of resources to a plurality of entities, the method comprising:
inputting, by one or more computing devices, an entity profile that describes at least one of a preference or a demand of a simulated entity into a reinforcement learning agent model, the reinforcement learning agent model configured to receive the entity profile, and in response to receiving the entity profile, output an allocation output that describes a resource allocation for the simulated entity;

    PNG
    media_image1.png
    839
    1207
    media_image1.png
    Greyscale
Calmon fig. 9, annotated by Examiner.
[0072] “As noted above, some exemplary embodiments address the problem of resource allocation for the task of training deep neural networks.” (Emphasis added.)[0088] “At the beginning of each episode , an initial allocation of resources ro and the SLA metric set-point m* are defined. For each epoch, the allocation is changed with an action and a new state and rewards are measured.” (Emphasis added.)
receiving, by the one or more computing devices, as an output of the reinforcement learning agent model, the allocation output that describes the resource allocation for the simulated entity;
See fig. 9, reproduced above.[0082] “In the example of FIG. 8, the first three stages (e.g., an "observe current state” stage 830, an "action decision, actuation” stage 840 and an "observe reward” stage 850) are all linked to a definition of the simulated environment task 820.” (Emphasis added.)[0083] “The action is defined as an increment or decrement of the amount of resources dedicated to the controlled workload.”
33selecting, by the one or more computing devices, at least one resource to simulate providing to an entity model based on the resource allocation described by the allocation output, the entity model being configured to receive data descriptive of the at least one resource, and in response to receiving the data descriptive of the at least one resource, simulate a simulated response output that describes a response of the simulated entity to the data descriptive of the at least one resource;
See fig. 9, reproduced above, specifically action 930.See generally [0083]-[0088] describing the resource state definition as well as how actions update the state at each epoch (iteration).
providing, by the one or more computing devices, data descriptive of the at least one resource to an entity model;
See fig. 9, reproduced above.[0083] “The action is defined as an increment or decrement of the amount of resources dedicated to the controlled workload.”
receiving, by the one or more computing devices, as an output of the entity model, the simulated response output that describes the response of the simulated entity to the at least one resource;
and
See fig. 9, reproduced above, specifically state-reward 960. See also description  at [0090].
updating, by the one or more computing devices, at least one of a resource profile that describes the at least one resource or the entity profile based on the simulated response output.
[0088] “At the beginning of each episode, an initial allocation of resources ro and the SLA metric set-point m* are defined. For each epoch, the allocation is changed with an action and a new state and rewards are measured. ... The training of the QDNN is performed at QDNN training stage 900 in a fixed number of episodes, and it trains a neural network that receives the current state s, and outputs the expected reward q for each of the actions ai, as discussed further below in conjunction with FIG. 9.” (Emphasis added.)

wherein the reinforcement learning agent model comprises a reinforcement learning agent that is learned based on a reward that is a function of the simulated response output.
Fig. 3, step 320: “Update, by Reinforcement Learning Agent(s), Function that Evaluates Quality of State-Action Combinations”.See generally [0036]-[0037] describing the iteration routine for training the reinforcement learning agents and [0083]-[0088] describing how actions update the state of the simulated system at each epoch (iteration).

Regarding claim 3, Calmon discloses its further limitation wherein:
the simulated entity comprises at least one of a computing task or a source of the computing task; and
[0072] “As noted above, some exemplary embodiments address the problem of resource allocation for the task of training deep neural networks.” (Emphasis added.)
the at least one resource comprises a worker configured to execute the computing task.
[0109] “Some illustrative embodiments of a processing platform that may be used to implement at least a portion of an information processing system comprise cloud infrastructure including virtual machines implemented using a hypervisor that runs on physical infrastructure. The cloud infrastructure further comprises sets of applications running on respective ones of the virtual machines under the control of the hypervisor. It is also possible to use multiple hypervisors each providing a set of virtual machines using at least one underlying physical machine. Different sets of virtual machines provided by one or more hypervisors may be utilized in configuring multiple instances of various components of the system.” (Emphasis added.)

Regarding claim 9, Calmon discloses its further limitations further comprising a resource model configured to receive data descriptive of a plurality of resources including the at least one resource, and in response to receiving the data descriptive of the plurality of resources, output resource 32observable features, and wherein the reinforcement learning agent model is trained to select the allocation output based, at least in part, on the resource observable features, and wherein the operations further comprise:
inputting the data descriptive of the plurality of resources into the resource model;
See fig. 9, reproduced above, specifically state/action information “Q(s, a) 930. Also [0079]: Generally , the first three stages in FIG . 7 ( e.g. , an “observe current state” stage 730 , an “action decision , actuation” stage 740 and an “observe reward” stage 750) are all linked to a definition of the environment task 720.” (Emphasis added.)
receiving, as an output of the resource model, resource observable features; and
See fig. 9, reproduced above, specifically “Simulated Environment 950”. Also [0090]: “a simulated environment 950 is employed in the embodiment of FIG. 9 to perform such actions 930 and to obtain the corresponding state-reward values 960.”
inputting the resource observable features into the reinforcement learning agent model.
See fig. 9, reproduced above, specifically “State, Reward 960”.

Regarding claims 10, Calmon discloses its further limitations wherein:
the at least one resource comprises a plurality of resource items; and
[0102] “FIG. 11 illustrates the relation 1100 between an allocated amount of CPUs (central processing units) 1110 and a time per epoch 1120 in training a DNN to detect handwritten digits, according to some embodiments.” See also [0109], reproduced below.
the simulated response output describes a selection of fewer than all of the plurality of resource items.
comprise cloud infrastructure including virtual machines implemented using a hypervisor that runs on physical infrastructure. The cloud infrastructure further comprises sets of applications running on respective ones of the virtual machines under the control of the hypervisor. It is also possible to use multiple hypervisors each providing a set of virtual machines using at least one underlying physical machine. Different sets of virtual machines provided by one or more hypervisors may be utilized in configuring multiple instances of various components of the system.” (Emphasis added.) This passage makes clear that the system may allocate one or more virtual or physical machines to a particular task.

Regarding claims 11, Calmon discloses its further limitations wherein the entity model comprises a discrete choice model.
[0095] “In the Deep Q-Learning approach, the QDNN outputs one result for each possible action in the domain. This implicitly restricts the approach to domains with a finite and discrete number of actions. The set of actions A was previously defined as a finite and discrete set of values, but the resource allocation problem actually configures a continuous action space.” (Emphasis added.)

Regarding claim 19, Calmon discloses its further limitation comprising:
inputting, by the one or more computing devices, the data descriptive of the plurality of resources into a resource model that is configured to receive data descriptive of a plurality of resources including the at least one resource, and in response to receiving the data descriptive of the plurality of resources, output resource observable features;
See fig. 9, reproduced above and annotated by the Examiner, specifically “allocation output” / state-action 930.
[0072] “As noted above, some exemplary embodiments address the problem of resource allocation for the task of training deep neural networks.” (Emphasis added.)[0088] “At the beginning of each episode , an initial allocation of resources ro and the SLA metric set-point m* are defined. For each epoch, the allocation is changed with an action and a new state and rewards are measured.” (Emphasis added.)
receiving, by the one or more computing devices, as an output of the resource model, resource observable features; and
See fig. 9, reproduced above, specifically “state, reward 960”.
inputting, by the one or more computing devices, the resource observable features into the reinforcement learning agent model, wherein the reinforcement learning agent model is trained to select the allocation output based, at least in part, on the resource observable features.
See fig. 9, reproduced above, specifically Agent 910 (equivalent to the recited reinforcement learning agent model). At each iteration (epoch) the agent generates updates to the resource allocation (Action 930) to send to the simulated environment 950.

Regarding claim 20, Calmon discloses its further limitation further comprising receiving, by one or more computing devices, the reinforcement learning agent model before inputting, by the one or more computing devices, the entity profile into the reinforcement learning agent model.
See fig. 9, reproduced above.See generally [0083]-[0088] describing the resource state definition as well as how actions update the state at each epoch (iteration).[0088] “At the beginning of each episode , an initial allocation of resources ro and the SLA metric set-point m* are defined. For each epoch, the allocation is changed with an action and a new state and rewards are measured.”

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The following are the references relied upon in the rejections below:
Calmon (primary reference) (US 2020/0241921 A1).
Paternina-Arboleda (Paternina-Arboleda CD, Montoya-Torres JR, Fabregas-Ariza A. Simulation-optimization using a reinforcement learning approach. In2008 Winter Simulation Conference 2008 Dec 7 (pp. 1376-1383). IEEE.)
Wang (Wang Z, Zaman T. Learning Preferences and User Engagement Using Choice and Time Data. arXiv preprint arXiv:1608.08168. 2016 Aug 29.)
Wu (Wu Q, Wang H, Hong L, Shi Y. Returning is believing: Optimizing long-term user engagement in recommender systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management 2017 Nov 6 (pp. 1927-1936).)

Claims 4 are rejected under 35 U.S.C. 103 as being unpatentable over Calmon and Paternina-Arboleda.
Regarding claim 4, Paternina-Arboleda discloses the following further limitation which Calmon does not seem to disclose explicitly wherein:
the simulated entity comprises an industrial process; and
P. 1376, second col.: “The aim of this paper is to use the advantages of artificial intelligence-based techniques such as reinforcement learning and artificial neural networks, in order to propose a global optimization approach that can be coupled with discrete-event computer simulation models to efficiently resolve practical industrial problems.” (Emphasis added.)
the at least one resource comprises an input to the industrial process.
P. 1378, first full paragraph: “The agent’s job is to find a policy π, mapping states to actions, that maximizes some long-run measure of reinforcement.”
At the time of filing, it would have been obvious to a person of ordinary skill to apply the systems described by Calmon to industrial problems (as taught by Paternina-Arboleda) because simulated environments can provide additional training for the reinforcement learning agent (leading to improved performance) without additional real-world training (which may be expensive or impractical to obtain). Both disclosures pertain to reinforcement learning.

Claims 5-8 and 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Calmon and Wu.
Regarding claims 5 and 15, Calmon discloses the system of claim 1 (as well as the method of claim 13). Wu discloses the following further limitation which Calmon does not seem to disclose wherein the simulated entity comprises a simulated human user, and the entity profile comprises a user profile that describes at least one of interests or preferences of the simulated human user.

    PNG
    media_image2.png
    187
    673
    media_image2.png
    Greyscale
Excerpt from Wu, p. 1932, emphasis added by Examiner. 
At the time of filing, it would have been obvious to a person of ordinary skill to apply the reinforcement learning techniques disclosed by Calmon to the problem identified by Wu, namely simulating user engagement in a recommender system. This would provide for improved user engagement with the system with minimal real-world training, which could lead to improved revenue for a website, while minimizing the need for expensive real-world training. Both disclosures pertain to reinforcement learning.

Regarding claims 6 and 16, Calmon discloses the system of claim 5 (as well as the method of claim 13). Wu discloses the following additional limitation which Calmon does not seem to disclose wherein the simulated response output describes an engagement metric that describes at least one of an interaction time, a consumption amount, a number of engagements, or a rating of the simulated human user with respect to the at least one resource.
P. 1932, sec. 4: “Various evaluation metrics, such as cumulative clicks over time and average return time, were used to compare the algorithms.”
The obviousness analysis of claim 5 applies equally here.

Regarding claims 7 and 17, Calmon discloses the system of claim 1 (as well as the method of claim 13). Wu discloses the following further limitation which Calmon does not seem to disclose wherein updating at least one of the resource profile or the entity profile based on the simulated response output comprises providing data that describes the simulated response output to a user transition model that generates an updated set of user hidden state features and updating the entity profile based on the user hidden state features.

    PNG
    media_image3.png
    272
    579
    media_image3.png
    Greyscale
 Excerpt from Wu, p. 1932, emphasis added by Examiner.
The obviousness analysis of claim 5 applies equally here.

Regarding claims 8 and 18, Calmon discloses the system of claim 1 (as well as the method of claim 13). Wu discloses the following further limitation which Calmon does not seem to disclose wherein the at least one resource comprises at least one document that comprises at least one of text, audio, video, or graphical content.
P. 1396, first paragraph: “We used the learnt models to rank the articles by their estimated click and return probabilities accordingly, extracted text content from the top 5000 articles of each type, and then generated world clouds to summarize those documents in Figure 4.” (Emphasis added.)
The obviousness analysis of claim 5 applies equally here.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Calmon and Wang.
garding claim 12, Calmon discloses the system of claim 11. Wu discloses the following additional limitation which Calmon does not seem to disclose wherein the discrete choice model comprises at least one a multinomial proportion function, multinomial logit function, or exponential cascade function.
P. 4: “The model modifies the common multinomial-logit (MNL) choice model by including a user specific engagement parameter in addition to the normal item utility parameters.” (Emphasis added.)
At the time of filing, it would have been obvious to a person of ordinary skill to apply the multinomial logic choice model (disclosed by Wang) to the reinforcement learning system of Calmon because this would provide means to model a discrete event space (such as the allocation of certain computing resources—such as a number of CPUs, virtual machines, or memory banks—which are discrete). Both disclosures pertain to reinforcement learning.

Additional Relevant Prior Art
The following references were identified by the Examiner as being relevant to the disclosed invention, but are not relied upon in any particular prior art rejection:
Sanyal discloses, inter alia, a reinforcement learning system which considers resource costs and resource utilization. (US 2020/0265302 A1)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Vincent Gonzales whose telephone number is (571) 270-3837. The examiner can normally be reached on Monday-Friday 7 a.m. to 4 p.m. MT.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Vincent Gonzales/Primary Examiner, Art Unit 2124