Detailed Action
This action is in response to Applicant's communications filed 22 September 2017.  
Claims 1-25 are pending in this Application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-2, 4, 6-9, 12-15, 17-21, and 24-25 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Liu et al. (A Hierarchical Framework of Cloud Resource Allocation and Power Management Using Deep Reinforcement Learning, hereinafter "Liu").

Regarding Claims 1, 8, 14, and 20,
Liu teaches claims 1, 8, 14, and 20.
Liu teaches an apparatus to determine a physical resource assignment, the system comprising:
a compiler logic circuitry ("computational re- sources (including CPU, memory, disk, communication band- width, etc.)" sec. I, p. 372) to identify one or more states in a code, the one or more states to comprise virtual resources to assign to physical resources ("However, a complete resource allocation framework in the cloud computing systems exhibits high dimensions in state and action spaces. For example, a state in the state space may be the Cartesian product of characteristics and current resource utilization level of each server (for hundreds of servers) as well as current workload level (number and characteristics of VMs for allocation)." sec. I, p. 372); 
to generate training data for a neural network logic of the compiler logic circuitry ("Based on the stored state transition proﬁles and Q(s, a) value estimates, the DNN is constructed with weight set θ trained using standard training algorithms [27]." sec. IV, p. 376; "In this work, for the global tier, we perform ofﬂine training, including the experience memory initialization and training of the autoencoder, using the whole Google cluster traces. To obtain DNN model in the global tier of the proposed framework, we use workload traces for ﬁve different M-machine clusters. We generate four new state transition proﬁles using the ε-greedy policy [20] and store the transition proﬁles in the memory before sampling the minibatch for training the DNN." sec. VII.A, p. 380), 
the training data to comprise more than one policy ("An effective dynamic power management (DPM) policy should minimize power consumption while maintaining performance degradation within an acceptable level" Abstract, par. 1, p. 372), each policy comprising the one or more states, each state having a status of the code (Algorithm 2; "RL state", sec. VI.B, p. 379), an action (Algorithm 2; "action set A" sec. VI.B, p. 379), and an expected reward (Algorithm 2; "reward rate" sec. VI.B, p. 379) for assignment of a virtual resource to a physical resource ("a deep Q-learning framework was also proposed to derive the optimal action a at each state s in order to maximize (or minimize) the corresponding Q(s, a) value." sec. I, p. 373; "The deep Q-learning technique is adopted for the online control based on the ofﬂine-trained DNN. More speciﬁcally, at each decision epoch tk of an execution sequence, the system under control is in a state sk. The DRL agent performs inference using the DNN to derive the Q(sk,a) estimate of each state-action pair (sk,a), and uses ε-greedy policy to derive the action with the highest Q(sk,a) with probability 1 − ε and choose the other actions randomly with total probability ε. The chosen action is denoted by ak. At the next decision epoch tk+1, the DRL agent performs Q-value updating based on the total reward (or cost) rk(sk,ak) observed during this time period [tk,tk+1). At the end of the execution sequence, the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in the next execution sequence. More detailed procedures are shown in Algorithm 1." sec. IV, p. 376);
("A scheduling scheme must be developed for dynamically assigning the jobs to servers and allocating resources in each server" sec. III, p. 374; "An action in the action space may be the allocation of VMs to the servers (a.k.a. physical machines) and allocating resources in the servers for VM execution." sec. I, p. 372; " Action space: The action of the DRL agent for cloud resource allocation is deﬁned as the index of server for VM (job) allocation. The action space for a cluster with M servers is deﬁned" sec. V.A, p. 377); and
to determine an approximated value function based on the training data (At the next decision epoch tk+1, the DRL agent performs Q-value updating based on the total reward (or cost) rk(sk,ak) observed during this time period [tk,tk+1). At the end of the execution sequence, the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in the next execution sequence. More detailed procedures are shown in Algorithm 1." sec. IV, p. 376); and to determine the physical resource assignment based on the approximated value function ("we present a generalized form of DRL technique compared with the prior work, which could be utilized for resource allocation and other problems as well... At the end of the execution sequence, the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in the next execution sequence." sec. IV, p. 376).

Regarding Claims 2, 12, 15, and 24,
Liu teaches claims 1, 8, 14, and 20, from which claims 2, 12, 15, and 24 depend, respectively.
Liu further teaches wherein the compiler logic circuitry comprises the neural network logic to determine the approximated value function by iterative determination of a gradient descent of the approximated value function and backpropagation of error to incrementally converge to the approximated value function; and to determine the physical resource assignment based on the approximated value function ("In the training process, ﬁrst we initialize the weights for the input layer and output layer as a normal distribution with a mean value of 0 and standard deviation of 1. The bias for both layers is set as a constant value 0.1. The initial state of LSTM cell is set as 0 for all cells. In response to the back propagated errors, the network is updated by adopting Adam optimization [27], a method for efﬁcient stochastic optimization that only requires ﬁrst-order gradients with little memory requirement. The method computes individual adaptive learning rates from estimates of the ﬁrst and second moments of the gradients [33]. The state of the LSTM cell and weights will be trained for minimizing the propagated errors." sec. VI.A, p. 378).

Regarding Claim 4,
Liu teaches claim 1, from which claim 4 depends.  
Liu further teaches wherein the neural network logic comprises a mini-batch logic configured to determine a sample set of training data with which to perform a gradient descent ("We generate four new state transition proﬁles using the ε-greedy policy [20] and store the transition proﬁles in the memory before sampling the minibatch for training the DNN. In addition, we clip the gradients to make their norm values less than or equal to 10. The gradient clipping method has been introduced to DRL in [37]." sec. VII.A, p. 380).

Regarding Claim 17,
Liu teaches claim 14, from which claim 17 depends.  
Liu further teaches wherein the compiler logic circuitry comprises a random logic to generate a new sequence of the different sequences by insertion of a random assignment of a physical resource for the code ("The deep Q-learning technique is adopted for the online control based on the ofﬂine-trained DNN. More speciﬁcally, at each decision epoch tk of an execution sequence, the system under control is in a state sk. The DRL agent performs infer-ence using the DNN to derive the Q(sk,a) estimate of each state-action pair (sk,a), and uses ε-greedy policy to derive the action with the highest Q(sk,a) with probability 1 − ε and choose the other actions randomly with total probability ε. The chosen action is denoted by ak. At the next decision epoch tk+1, the DRL agent performs Q-value updating based on the total reward (or cost) rk(sk,ak) observed during this time period [tk,tk+1). At the end of the execution sequence, the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in the next execution sequence. More detailed procedures are shown in Algorithm 1." sec. IV, p. 376).

Regarding Claims 7, 13, 19, and 25,
Liu teaches claims 1, 8, 17, and 20, from which claims 7, 13, 19, and 25 depend, respectively.
Liu further teaches wherein the physical resource assignment comprises an assignment of a register class or an assignment of a task to a processor ("We demonstrate how jobs are executed on an active server with only CPU resource usage in Fig. 3 as an example. Job 1 consumes 50% of CPU, while each of job 2 and 3 requires 40%. Job 1, 2 and 3 arrive at t1, t2 and t3, respectively, and complete at t4, t5 and t6, respectively. We assume that the server is in active mode at time 0. When job 1 and 2 arrive, there are enough CPU resources so their requirements are satisﬁed immediately. When job 3 arrives, it waits until the job 1 is completed, and the waiting time is t4 − t3. The job latency is deﬁned as the duration between the job arrival and completion. Therefore the latency of job 3 is t6 − t3, which is longer than the job duration. To reduce the job latency, the job broker should avoid overloading servers. A scheduling scheme must be developed for dynamically assigning the jobs to servers and allocating resources in each server." sec. III, p. 374).

Regarding Claims 9 and 18,
Liu teaches claim(s) 8 and 17, from which claim(s) 9 and 18 depend, respectively.
Liu further teaches wherein generating the training data further comprises executing multiple instances of one or more different codes, wherein each of the one or more different codes is compiled with multiple different sequences of assignments of the ("The deep Q-learning technique is adopted for the online control based on the ofﬂine-trained DNN. More speciﬁcally, at each decision epoch tk of an execution sequence, the system under control is in a state sk. The DRL agent performs inference using the DNN to derive the Q(sk,a) estimate of each state-action pair (sk,a), and uses ε-greedy policy to derive the action with the highest Q(sk,a) with probability 1 − ε and choose the other actions randomly with total probability ε. The chosen action is denoted by ak. At the next decision epoch tk+1, the DRL agent performs Q-value updating based on the total reward (or cost) rk(sk,ak) observed during this time period [tk,tk+1). At the end of the execution sequence, the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in the next execution sequence. More detailed procedures are shown in Algorithm 1." sec. IV, p. 376).

Regarding Claim 21,
Liu teaches claim 20, from which claim 21 depends.  
Liu further teaches wherein the operations further comprise determining, for a different code, an optimal assignment ("in each decision epoch the DRL agent needs to enumerate all possible actions at current state and perform inference using DNN to derive the optimal Q(s, a) value estimate" sec. IV, p. 376) of the virtual resource to the physical resource ("Cloud computing with virtualization technology enables computational resources (including CPU, memory, disk, communication bandwidth, etc.) in data centers or server clusters to be shared by allocating virtual machines (VMs) in an on-demand manner." sec. I, p. 372) based on the training data ("In this work, for the global tier, we perform ofﬂine training, including the experience memory initialization and training of the autoencoder, using the whole Google cluster traces. To obtain DNN model in the global tier of the proposed framework, we use workload traces for ﬁve different M-machine clusters. We generate four new state transition proﬁles using the ε-greedy policy [20] and store the transition proﬁles in the memory before sampling the minibatch for training the DNN." sec. VII.A, p. 380) and a current state of the different code ("in each decision epoch the DRL agent needs to enumerate all possible actions at current state and perform inference using DNN to derive the optimal Q(s, a) value estimate" sec. IV, p. 376).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim(s) 3, 5, 10-11, 16, and 22-23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (A Hierarchical Framework of Cloud Resource Allocation and Power Management Using Deep Reinforcement Learning, hereinafter "Liu") in view of Palmes et al. (Mutation-Based Genetic Neural Network, hereinafter "Palmes").

Regarding Claims 3 and 16,
Liu teaches claim(s) 1 and 14, from which claim(s) 3 and 16 depend, respectively.
Liu further teaches wherein the neural network logic comprises a weight and a bias ("In the training process, ﬁrst we initialize the weights for the input layer and output layer as a normal distribution with a mean value of 0 and standard deviation of 1. The bias for both layers is set as a constant value 0.1. The initial state of LSTM cell is set as 0 for all cells. In response to the back propagated errors, the network is updated by adopting Adam optimization [27], a method for efﬁcient stochastic optimization that only requires ﬁrst-order gradients with little memory requirement. The method computes individual adaptive learning rates from estimates of the ﬁrst and second moments of the gradients [33]. The state of the LSTM cell and weights will be trained for minimizing the propagated errors." sec. VI.A., p. 378).

Liu does not explicitly teach wherein the neural network logic comprises a transitivity layer configured to apply an activation function to the approximated value function.
Palmes teaches wherein the neural network logic comprises a transitivity layer configured to apply an activation function to the approximated value function ("The output value in the second term of the ﬁtness function follows the typical feed-forward computation that uses sigmoidal activation function and threshold values of ANN" sec. V, p. 591)
Liu and Palmes are analogous art because they are both directed to using neural networks to find optimal solutions. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the resource allocation neural network of Liu with the mutation-based genetic neural network of Palmes.  The modification would have been obvious because one of ordinary skill in the art would be motivated to use a popular approach (Palmes:"The use of evolutionary algorithms (EAs) to aid in artificial neural networks (ANNs) learning has been a popular approach to address the shortcomings of back-propagation (BP)" sec. I, p. 587) for fast operation and more robust search coverage (Palmes:"important EA operations are fast and makes feasible the use of a bigger population size for a more robust search coverage" sec. I, p. 587).

Regarding Claims 5, 11, and 23,
Liu teaches claim(s) 1, 8, and 20, from which claim(s) 5, 11, and 23 depend, respectively.
Liu further teaches wherein the compiler logic circuitry comprises a logic to determine different sequences of physical resource assignments ("a deep Q-learning framework was also proposed to derive the optimal action a at each state s in order to maximize (or minimize) the corresponding Q(s, a) value." sec. I, p. 373; "The deep Q-learning technique is adopted for the online control based on the ofﬂine-trained DNN. More speciﬁcally, at each decision epoch tk of an execution sequence, the system under control is in a state sk. The DRL agent performs inference using the DNN to derive the Q(sk,a) estimate of each state-action pair (sk,a), and uses ε-greedy policy to derive the action with the highest Q(sk,a) with probability 1 − ε and choose the other actions randomly with total probability ε. The chosen action is denoted by ak. At the next decision epoch tk+1, the DRL agent performs Q-value updating based on the total reward (or cost) rk(sk,ak) observed during this time period [tk,tk+1). At the end of the execution sequence, the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in the next execution sequence. More detailed procedures are shown in Algorithm 1." sec. IV, p. 376) to generate the training data for the code ("Based on the stored state transition proﬁles and Q(s, a) value estimates, the DNN is constructed with weight set θ trained using standard training algorithms [27]." sec. IV, p. 376; "In this work, for the global tier, we perform ofﬂine training, including the experience memory initialization and training of the autoencoder, using the whole Google cluster traces. To obtain DNN model in the global tier of the proposed framework, we use workload traces for ﬁve different M-machine clusters. We generate four new state transition proﬁles using the ε-greedy policy [20] and store the transition proﬁles in the memory before sampling the minibatch for training the DNN." sec. VII.A, p. 380), wherein each of the different sequences is identified as a policy ("a deep Q-learning framework was also proposed to derive the optimal action a at each state s in order to maximize (or minimize) the corresponding Q(s, a) value." sec. I, p. 373; "The deep Q-learning technique is adopted for the online control based on the ofﬂine-trained DNN. More speciﬁcally, at each decision epoch tk of an execution sequence, the system under control is in a state sk. The DRL agent performs inference using the DNN to derive the Q(sk,a) estimate of each state-action pair (sk,a), and uses ε-greedy policy to derive the action with the highest Q(sk,a) with probability 1 − ε and choose the other actions randomly with total probability ε. The chosen action is denoted by ak. At the next decision epoch tk+1, the DRL agent performs Q-value updating based on the total reward (or cost) rk(sk,ak) observed during this time period [tk,tk+1). At the end of the execution sequence, the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in the next execution sequence. More detailed procedures are shown in Algorithm 1." sec. IV, p. 376).

Liu does not explicitly teach genetic logic.
Palmes teaches genetic logic ("mutation-based genetic neural network (MGNN), sec. Abstract, p. 587).
Liu and Palmes are analogous art because they are both directed to using neural networks to find optimal solutions. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the resource allocation neural network of Liu with the mutation-based genetic neural network of Palmes.  The modification would have been obvious because one of ordinary skill in the art would be motivated to use a popular approach (Palmes:"The use of evolutionary algorithms (EAs) to aid in artificial neural networks (ANNs) learning has been a popular approach to address the shortcomings of back-propagation (BP)" sec. I, p. 587) for fast operation and more robust search coverage (Palmes:"important EA operations are fast and makes feasible the use of a bigger population size for a more robust search coverage" sec. I, p. 587).


Regarding Claim 6,
The Liu/Palmes combination teaches claim(s) 5, from which claim(s) 6 depends.
Liu further teaches wherein the compiler logic circuitry comprises a random logic to generate a new sequence of the different sequences by insertion of a random assignment of a physical resource for the code ("The deep Q-learning technique is adopted for the online control based on the ofﬂine-trained DNN. More speciﬁcally, at each decision epoch tk of an execution sequence, the system under control is in a state sk. The DRL agent performs infer-ence using the DNN to derive the Q(sk,a) estimate of each state-action pair (sk,a), and uses ε-greedy policy to derive the action with the highest Q(sk,a) with probability 1 − ε and choose the other actions randomly with total probability ε. The chosen action is denoted by ak. At the next decision epoch tk+1, the DRL agent performs Q-value updating based on the total reward (or cost) rk(sk,ak) observed during this time period [tk,tk+1). At the end of the execution sequence, the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in the next execution sequence. More detailed procedures are shown in Algorithm 1." sec. IV, p. 376).

Regarding Claims 10 and 22,
Liu teaches claim(s) 8 and 20, from which claim(s) 10 and 22 depend, respectively.
Liu further teaches wherein generating the training data comprises performing a logic function to determine virtual resource assignments to physical resources ("However, a complete resource allocation framework in the cloud computing systems exhibits high dimensions in state and action spaces. For example, a state in the state space may be the Cartesian product of characteristics and current resource utilization level of each server (for hundreds of servers) as well as current workload level (number and characteristics of VMs for allocation)." sec. I, p. 372), 

Liu does not explicitly teach the genetic logic function to combine sequences from two or more different instances of the code, to introduce a mutation into one or more sequences to generate additional sequences, or to both combine the sequences and to introduce the mutation.
	Palmes teaches the genetic logic function to combine sequences from two or more different instances of the code, to introduce a mutation into one or more sequences to generate additional sequences, or to both combine the sequences and to introduce the mutation ("One attractive feature of GA evolution is its support for the generic implementation of its major operations such as crossover, mutation, selection, and replacement. This is achieved by using a dual representation where evolutionary op-erations are done at the genotype level while ﬁtness evaluation is carried out at the phenotype level." sec. III, p. 589).
Liu and Palmes are analogous art because they are both directed to using neural networks to find optimal solutions. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the resource allocation neural network of Liu with the mutation-based genetic neural network of Palmes.  The modification would have been obvious because one of ordinary skill in the art would be motivated to use a popular approach (Palmes:"The use of evolutionary algorithms (EAs) to aid in artificial neural networks (ANNs) learning has been a popular approach to address the shortcomings of back-propagation (BP)" sec. I, p. 587) for fast operation and more robust search coverage (Palmes:"important EA operations are fast and makes feasible the use of a bigger population size for a more robust search coverage" sec. I, p. 587).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHARLES C KUO whose telephone number is (571)270-7477.  The examiner can normally be reached on M-F: 9:00 a.m. - 6:00 p.m..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer 






/CHARLES C KUO/Examiner, Art Unit 2126                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Primary Examiner, Art Unit 2116