DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is a response to communications dated 09/23/2020.  Claims 1-30 are pending in the application.

Information Disclosure Statement
The information disclosure statements filed 09/23/2020 and 10/27/2021 comply with the provisions of 37 CFR 1.97, 1.98 and MPEP § 609.  They have been considered and placed in the application file.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-5 and 7-10 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Tan et a. (US 2019/0014488) (hereinafter “Tan”).


claim 1, in accordance with Tan reference entirety, Tan teaches A method of determining a sequence of actions (Abstract and thereinafter:  “A neural network is trained using deep reinforcement learning (DRL) techniques for adjusting cell parameters of a wireless network is disclosed”), the method comprising: 
training a first deep Q-network (DQN) (para [0007]: "training a neural network for a plurality of cells in a wireless network using a deep reinforcement learning (DRL) process for adjusting one or more cell parameters of cells associated with base stations in the wireless network, each base station providing communication services to user equipments (UEs) within a coverage area of one or more cells, and the neural network being trained to determine actions that can be performed on the base stations."  And para [0040]: “Deep learning is a technique … a Q-learning network … or any other applicable network”); 
providing a plurality of entries of a first multi-dimensional matrix as input to the DQN, the first matrix representing a first state, each entry of the first matrix representing an action that can be taken in the first state (para [0007]: "wherein the DRL process comprises: generating a plurality of experience tuples for a plurality of cells in the wireless network, comprising: generating a state tensor for each of the plurality of cells, each state tensor indicating a state of a respective cell, wherein a state of a cell comprises a setting of a base station associated with the cell, the base station providing a coverage area of the cell"); 
determining, using the first DQN, a plurality of Q-values for the plurality of entries of the first matrix, respectively (para [0007]: "selecting an action for each of the plurality of cells, the action moving the respective cell from one state to another state, wherein an action comprises information for adjusting a setting of a base station associated with a cell"); 
executing a first action, the first action being the action represented by the entry, from among the plurality of entries, for which the first DQN determined the highest Q-value among the plurality of determined Q-values (para [0007]: "applying respective actions selected for the plurality of cells to the respective cells to adjust one or more cell parameters"); 
accumulating a reward based on executing the first action (para [0007]: "a local reward calculated for applying the action to the respective cell, and a global reward calculated for applying the action to the respective cell, the local reward being calculated based on a local cost function and the global reward being calculated based on a global cost function; and determining whether or not an action of a first experience tuple in the plurality of experience tuples corresponding to a first cell is acceptable based on the local reward and the global reward of the first experience tuple"); and 
transitioning from the first state to a next state in accordance with a first set of rules and the executed first action (para [0007]: "generating an experience tuple for each of the plurality of cells based on the respective action applied, the experience tuple comprising a cell identifier identifying the respective cell, a first state of the respective cell that is indicated by a respective state tensor, a second state of the respective cell, the action applied to the respective cell that moves the respective cell from the first state to the second state").

claim 2, in addition to features recited in base claim 1 (see rationales discussed above), Tan also discloses wherein the training of the first DQN (FIG. 6 and paras [0050] to [0059]: “The DRL method 600 may be used to train a neural network in a deep learning process”) comprises: 
initializing first weights (FIG. 6; step 604 and para [0050]: "The DRL method 600 may start with initializing the neural network with a set of weight values. At step 602, the DRL method 600 determines whether the SDNN is initialized with randomly configured or selected weight values, or with weight values of an expert neural network"); 
randomly selecting a mini-batch of sample states from among a plurality of stored states, each stored state including a plurality of entries corresponding to a plurality of actions (FIG. 6; step 616 and para [0055]: " At step 616, the method 600 determines whether DRL-generated or expert-generated experience tuples are selected for the mini batch. The method 600 may make the determination based on a predefined criterion. The criterion may be threshold based, probability based, similarity based, based on relationships between the experience tuples, or based on importance sampling. For example, a threshold, e.g., a number or percentage of DRL-generated or expert generated experience tuples, may be predefined, and the method determines to select a DRL-generated or expert generated experience tuple based on whether the threshold is satisfied. In another example, the method boo may probabilistically select a DRL-generated or expert-generated experience tuple. For example, a value between O and 1 is randomly generated, e.g., using a random uniform generator. If the value is greater than a threshold, an expert-generated experience tuple is retrieved; and if the value is not greater than the threshold, a DRL-generated experience tuple is retrieved. The determination may also be made based on relationships between the experience tuples. For example, instead of selecting only one experience tuple independently at one time, a group of relevant experience tuples may be selected together each time. The determination may also be made based on other criteria. In one example, an experience tuple may be selected based on its importance to the current training process. The importance of an experience tuple may be defined according to various factors associated with the"); and 
for each sample state Sj among the randomly selected mini-batch of sample states determining one or more valid actions of the sample state S (FIG. 6; step 616 and para [0055]: "para [0055]: "At step 616, the method 600 determines whether DRL-generated or expert-generated experience tuples are selected for the mini batch. The method 600 may make the determination based on a predefined criterion. The criterion may be threshold based, probability based, similarity based, based on relationships between the experience tuples, or based on importance sampling. For example, a threshold, e.g., a number or percentage of DRL-generated or expert-generated experience tuples, may be predefined, and the method determines to select a DRL-generated or expert-generated experience tuple based on whether the threshold is satisfied. In another example, the method boo may probabilistically select a DRL-generated or expert-generated experience tuple. For example, a value between O and 1 is randomly generated, e.g., using a random uniform generator. If the value is greater than a threshold, an expert-generated experience tuple is retrieved; and if the value is not greater than the threshold, a DRL-generated experience tuple is retrieved. The determination may also be made based on relationships between the experience tuples. For example, instead of selecting only one experience tuple independently at one time, a group of relevant experience tuples may be selected together each time. The determination may also be made based on other criteria. In one example, an experience tuple may be selected based on its importance to the current training process. The importance of an experience tuple may be defined according to various factors associated with the experience tuple. For example the more a TD error associated with an experience tuple is, the more importance this experience tuple is). In another example, experience tuples may be selected based on similarity to one or more currently selected experience tuples. For example, a probability that an experience tuple is selected may be set to be proportional to its similarity to the currently selected experience tuples, and experience tuples are selected based on their probabilities and the currently selected experience tuples"); 
based on the first set of rules, generating, using the first DQN having the first weights, one or more first Q-values corresponding, respectively, to the one or more valid actions of the sample state Sj, generating, using a second DQN having second weights, one or more target values corresponding, respectively, to the one or more valid actions of the sample state S, and updating the first weights based on the one or more first Q-values and the one or more target values (para [0056]: "At step 622, the method 600 determines whether the mini batch needs more experience tuples based on a criterion. The criterion may be threshold-based, probability-based, or based on relationship between the experience tuples. For example, when the number of experience tuples in the mini batch is less than a predefined threshold, the method 600 may need to select more experience tuples. If the determination is yes, the method 600 goes back to step 616 to continue select more experience tuples for the mini batch. Otherwise, the method 600 goes to step 624. As another example, when the value generated by a random number generator is greater than a predefined or a dynamic changing probability (e.g., based on cooling schedule defined in simulated annealing), the method 600 may select more experience tuples. In yet another example, when an experience tuple is selected, other experience tuples related to this experience tuple may also be selected."  In addition, para [0057]: "At step 624, the method 600 calculates a temporal difference (TD) error corresponding to each action of the experience tuples in the mini batch. The TD error may be calculated using a method that is value-based, policy-based or model-based, and may be calculated using any applicable algorithms existed or unforeseen, such as techniques of deep Q-network, Double Q, Dueling network, A3C, Deep Sarsa, N-step Q, etc. At step 626, the method 600 back-propagates gradients calculated according to the TD errors to update weights of the SDNN. The techniques for calculating TD error, gradients, updating weights of a neural network are well-known in the pertinent art and will not be described in detail herein.").
Regarding claim 3, in addition to features recited in base claim 2 (see rationales discussed above), Tan also discloses wherein, each valid action of the sample state Sj is an action that is permitted to be executed in the sample state Sj, in accordance with the first set of rules (para [0057]: "At step 624, the method 600 calculates a temporal difference (TD) error corresponding to each action of the experience tuples in the mini batch. The TD error may be calculated using a method that is value-based, policy-based or model-based, and may be calculated using any applicable algorithms existed or unforeseen, such as techniques of deep Q-network, Double Q, Dueling network, A3C, Deep Sarsa, N-step Q, etc. At step 626, the method 600 back-propagates gradients calculated according to the TD errors to update weights of the SDNN. The techniques for calculating TD error, gradients, updating weights of a neural network are well-known in the pertinent art and will not be described in detail herein.").
Regarding claim 4, in addition to features recited in base claim 2 (see rationales discussed above), Tan also discloses wherein initializing the first weights includes randomly selecting the first weights (FIG. 6; step 604 and para [0050]: "The DRL method 600 may start with initializing the neural network with a set of weight values. At step 602, the DRL method 600 determines whether the SDNN is initialized with randomly configured or selected weight values, or with weight values of an expert neural network").
Regarding claim 5, in addition to features recited in base claim 4 (see rationales discussed above), Tan also discloses initializing the second weights by setting the second weights equal to the first weights (FIG. 6; step 604 and para [0050]: "The DRL method 600 may start with initializing the neural network with a set of weight values. At step 602, the DRL method 600 determines whether the SDNN is initialized with randomly configured or selected weight values, or with weight values of an expert neural network").
claim 7, in addition to features recited in base claim 1 (see rationales discussed above), Tan also discloses  iteratively performing (updating weights) each of the providing, determining, executing, accumulating and transitioning steps for each consecutive state until reaching a terminal state, the terminal state being a state for which no valid action exist (para [0007]: "updating weights of the neural network based on whether or not the action is acceptable. The method also includes selecting an action for adjusting a cell parameter of a cell in the wireless network based on the trained neural network; and instructing to adjust the cell parameter of the cell in the wireless network according to the selected action.").
Regarding claim 8, in addition to features recited in base claim 1 (see rationales discussed above), Tan also discloses wherein a valid action of a current state is an action that is permitted to be executed in the current state, in accordance with the first set of rules (para [0007]: "determining whether or not an action of a first experience tuple in the plurality of experience tuples corresponding to a first cell is acceptable based on the local reward and the global reward of the first experience tuple").
Regarding claim 9, in addition to features recited in base claim 1 (see rationales discussed above), Tan also discloses wherein executing the first action includes assigning resources (adjusting a cell parameter of a cell) in a wireless communications network (para [0007]: "determining whether or not an action of a first experience tuple in the plurality of experience tuples corresponding to a first cell is acceptable based on the local reward and the global reward of the first experience tuple; and updating weights of the neural network based on whether or not the action is acceptable. The method also includes selecting an action for adjusting a cell parameter of a cell in the wireless network based on the trained neural network; and instructing to adjust the cell parameter of the cell in the wireless network according to the selected action.").
Regarding claim 10, in addition to features recited in base claim 1 (see rationales discussed above), Tan also discloses wherein, for each entry among the plurality of entries of the first matrix, a numerical value of the entry corresponds to a reward associated with executing the action represented by the entry (para [0006]: "a reward value is calculated using a cost function based on measurement reports received from UEs in the wireless network, wherein each experience tuple can be a DRL-generated experience tuple in which a respective action is selected by a DRL agent based on the neural network according to a DRL technique or an expert-generated experience tuple in which the respective action is provided based on expert experience, and wherein whether an action is selected by the DRL agent based on the neural network or provided based on the expert experience is determined based on a first criterion."  In addition, para [0007]: "a local reward calculated for applying the action to the respective cell, and a global reward calculated for applying the action to the respective cell, the local reward being calculated based on a local cost function and the global reward being calculated based on a global cost function.").


Allowable Subject Matter
Claims 11-30 are allowed.
Claim 6 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is an examiner’s statement of reasons for allowance:  Liu et al. (Deep Reinforcement Learning Based Dynamic Channel Allocation Algorithm in Multibeam Satellite Systems, IEEE, 10 pages, April 18, 2018), is considered the closest reference on the record pertaining the teaching of a deep reinforcement learning-based dynamic channel allocation algorithm as depicted in FIGURE 2 on page 4 and its corresponding description begins thereinafter ("For the DRL-DCA architecture, the state s, action a and reward r are defined in the modeled MDP. Then the state s is reformulated into an image tensor 0-(s) to take full advantage of the CNN. The Q-Network Q(0-(s); a; 0) with parameters 0 is the action-value function in charge of mapping the input environments to output CA decisions. During each mapping, Q-Network generates an history result consisting of current state -(sj), current action aj, instant reward rj+1, the next state -(sj+1) and stores them into the replay memory D. The target network Q' with parameter 0- is copied from the Q-Network every G steps. At each time step, a minibatch randomly sampled from the replay memory D together with the target network Q- is used to calculate the loss and train the Q-Network. A detailed description on the MDP model, the state reformulation and the implementation of the proposed DRL-DCA algorithm will be given in the following subsections.").  Such teaching is related to the claimed invention of at least claims 11-30.  However, Liu et al. 
Specifically, the prior art of record, considered individually or in combination, appears to fail to fairly show or suggest a claimed invention comprising, among other limitations, novel and unobvious limitations of "performing a UE-beam pair selecting operation including, determining, by a deep Q-network (DQN) of the scheduler, based on the plurality of metric values, a plurality of Q-values, the plurality of Q-values corresponding, respectively, to the plurality of UE-beam pairs, and selecting a UE-beam pair from among the plurality of UE-beam pairs based on the plurality of Q-values; and assigning the UE included in the selected UE-beam pair to the beam included in the selected UE-beam pair," structurally and functionally interconnected in a manner as recited in the claims.
The following is a statement of reasons for the indication of allowable subject matter:  The prior art of record, considered individually or in combination, appears to fail to fairly show or suggest a claimed invention of base claim 1 and further limits with novel and unobvious limitations of "wherein the generating one or more target values comprises: determining, for each valid action a among A ... representing the first weights", structurally and functionally interconnected in a manner as recited in claim 6.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably 

Conclusion
The prior/related art made of record and not relied upon is considered pertinent to applicant's disclosure.
Akoum et al. (US 2018/0279286).
Luo et al. (US 2019/0116605).
Narasimha et al. (US 2019/0253900).
Landis et al. (US 10,666,342).
Liu et al., Deep Reinforcement Learning Based Dynamic Channel Allocation Algorithm in Multibeam Satellite Systems, IEEE, 10 pages, April 18, 2018.
Mnih et al., Human-level control through deep reinforcement learning, Nature, 13 pages, February 26, 2015.
Watkins et al., Q-Learning, Kluwer Academic Publishers, 14 pages, 1992.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FRANK DUONG whose telephone number is (571)272-3164. The examiner can normally be reached 7:00AM-3:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MICHAEL THIER can be reached on 571-272-2832. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Applicant is encouraged to submit a written authorization for Internet communications (PTO/SB/439, http://www.uspto.gov/sites/default/files/documents/sb0439.pdf) in the instant patent application to authorize the examiner to communicate with the applicant via email. The authorization will allow the examiner to better practice compact prosecution. The written authorization can be submitted via one of the following methods only: (1) Central Fax which can be found in the Conclusion section of this Office action; (2) regular postal mail; (3) EFS WEB; or (4) the service window on the Alexandria campus. EFS web is the recommended way to submit the form since this allows the form to be entered into the file wrapper within the same day (system dependent). Written authorization submitted via other methods, such as direct fax to the examiner or email, will not be accepted. See MPEP § 502.03.




/FRANK DUONG/Primary Examiner, Art Unit 2474                                                                                                                                                                                                        February 23, 2022