DETAILED ACTION
Response to Amendment
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This communication is responsive to the amendment filed 09/19/2022.
Claims 1, 4, 7, 9, 15, 16, and 19 have been amended and no claims have been added and/or canceled.
In light of applicant’s amendment, previous claim rejections based on 35 USC 103 with respect to claims 1-20 have been withdrawn.
Claims 1-20 are pending with claims 1, 9, and 16 as independent claims.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 4, 5-11, 14-17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Nachum et al. (US 2019/0332922, pub. 10/31/2019, hereinafter as Nachum) in view of Halder (US 2019/0310650, pub. 10/10/2019, hereinafter as Halder) in view of Commons (US 9,875,440, pub. 01/23/ 2018).

As per claim 1, a method, comprising: 
identifying a current observable state of an interactive video game; (Nachum discloses in [0006 and 0034] “a reinforcement learning agent interacting with an environment by performing actions… the policy neural network having a plurality of policy network parameters and being configured to process an input observation characterizing a current state of the environment in accordance with the policy network parameters… the simulated environment may be a video game and the agent may be a simulated user playing the video game.” EX.: processing input observation characterizing a current state may indicate identifying a current observable state)
computing, by a neural network processing the current observable state, a plurality of user interface actions and their respective action scores; (Nachum discloses in [0006 and 0034] “training a policy neural network used to select actions to be performed by a reinforcement learning agent interacting with an environment by performing actions from a pre-determined set of actions, the policy neural network having a plurality of policy network parameters and being configured to process an input observation characterizing a current state of the environment in accordance with the policy network parameters to generate a score distribution that includes a respective score for each action in the pre-determined set of actions” EX.: a score has been generated for each action)
Nachum does not explicitly disclose
wherein an action score associated with a particular user interface action indicates a likelihood of the particular user interface action triggering an observable state transition that belongs to a shortest path from the current observable state to a target observable state. However, Commons, in analogous art, discloses in ([col. 5, ln 17 to col. 6, ln 35 and col. 45, ln 57 to col. 46, ln 36] “To select a travel path, the automatic driver would take as input from a user a destination address. The automatic driver would then ascertain the current location through a global positioning system (GPS) mechanism…The travel path could be selected by a mapping algorithm calculating a shortest or approximately shortest path between the starting point and the destination… the automatic driver could access Google Maps, Yahoo Maps, or a similar service over the Internet or over a cellular network to obtain driving directions…The car could be steered in accordance with the selected travel path by driving along the path. GPS devices that tell the driver exactly when and in which direction to turn are known in the art and are provided by TomTom Corporation, Garmin Corporation, Magellan Corporation, and others. Therefore, these can be implemented by the automatic driver.” EX.: the current state may be the starting point and the target state may be the destination. The agent, the automatic driver, may take the action toward the shortest path, which calculated by the GPS system, to navigate the agent to get to the destination. The user interface action triggering the observable state may be the entry of the destination address either interred by a human driver and/or agent as the automated driver.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Nachum with the teaching of Commons because “The aim is to discover the policy that minimizes the cost; i.e., the MC for which the cost is minimal. ANNs are frequently used in reinforcement learning as part of the overall algorithm. Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.” Commons background.
selecting, based on the action scores, a user interface action of the plurality of user interface actions; (further Nachum discloses in [0037] “The policy output is a score distribution that includes a respective score for each action in a predetermined set of actions. In some cases, the system 100 determines the action 114 to be performed by the agent 116 at the time step to be the action with the highest score.”)
applying the selected user interface action to the interactive video game; (Nachum discloses in [0021 and 0034] ““Reward” in this sense may refer to an indication of whether the agent has accomplished a task (e.g., navigating to a target location in the environment) or of the progress of the agent towards accomplishing a task. On-policy path data refers to path data where the agent interacts with the environment by selecting actions based on the current action selection policy of the reinforcement learning system… the simulated environment may be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent is a simulated vehicle navigating through the motion simulation environment. In these implementations, the actions may be control inputs to control the simulated user or simulated vehicle.”) and
iteratively repeating the computing, selecting, and applying operations until the target observable state of the interactive video game is reached; (Nachum discloses in [0034] “the reward 122 may indicate whether the agent 116 has accomplished a task (e.g., navigating to a target location in the environment 118) or the progress of the agent 116 towards accomplishing a task.” EX.: the target location may be a target observable state in navigation to location environment).
Nachum does not explicitly disclose computing…a plurality of user interface actions. However, Halder, in an analogous art, discloses in ([0007] “the autonomous vehicle management system is configured to generate a plan of action for the autonomous vehicle such that the goal is achieved in a safe manner. The plan of action may identify a sequence of one or more planned actions to be performed by the autonomous vehicle in order for the autonomous vehicle to achieve the goal in a safe manner.” EX.: the action can be generated to be performed by autonomous vehicle to transition the vehicle from current state to destination state) selecting…a user interface action of the plurality of user interface actions ([0077, 0225-0226-0243] “the internal map may be used to provide the current state information regarding autonomous vehicle 120 via a user interface, etc.… Any number of combinations of a planned action plus a reason for taking the planned action maybe be available for output depending on the decision making capabilities of the autonomous vehicle management system. The reason for the action may relate to a rule triggered based on information stored in an internal map. In certain embodiments, the future action indicated by a user interface is an action planned several seconds ahead of time… user interface 1500 comprising graphical elements indicating a planned action and a reason for taking the action… the user interface can indicate the next action to be performed in addition to actions that have been performed to completion or superseded by subsequently determined actions… the user interface is updated to reflect a current state of the autonomous vehicle.” EX.: the autonomous vehicle may comprises GUI with a map for navigating the vehicle from the starting state to the destination state while actions being performed in the GUI to indicate progress of the motion of the vehicle on the map)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Nachum with the teaching of Halder because “if the user believes the planned action is inappropriate or would prefer taking a different action…the ability to manually override a planned action may be limited, for example, to initiating an unscheduled stop using an emergency stop feature.”

As per claims 2, 11, and 17, the rejection of the method of claim 1 is incorporated and further wherein selecting the user interface action further comprises: selecting a user interface action that is associated with an optimal action score among the action scores; (Nachum discloses in [0037] “The policy output is a score distribution that includes a respective score for each action in a predetermined set of actions. In some cases, the system 100 determines the action 114 to be performed by the agent 116 at the time step to be the action with the highest score.”).

As per claims 4 and 19, the rejection of the method of claim 1 is incorporated and further wherein the current observable state of the interactive video game is associated with a reward value, and wherein the neural network is trained to maximize overall reward accumulated by traversing a user interface path to the target observable state of the interactive video game; (Nachum discloses in [0006, 0017, and 0041] “wherein the observation in each tuple is an observation characterizing a state of the environment, the action in each tuple is an action performed by the agent in response to the observation in the tuple, and the reward in each tuple is a numeric value received as a result of the agent performing the action in the tuple… obtain path data that defines one or more paths traversed by an agent through the environment 118 over a predetermined number of time steps.”).

As per claims 6 and 14, the rejection of the method of claim 1 is incorporated and the method further comprising: responsive to detecting an error in the interactive video game, modifying one or more parameters of the neural network; (Nachum discloses in [0006 and 0034] “determining a value update for the current values of the policy parameters from the consistency error and the gradient; and using the value update to adjust the current values of the policy parameters.”).

As per claims 7 and 15, the rejection of the method of claim 1 is incorporated and the method further comprising: responsive to failing to achieve the target observable state of the interactive video game within a predefined number of iterations, modifying one or more parameters of the neural network; (Nachum discloses in [0006 and 0034] “determining a value update for the current values of the policy parameters from the consistency error and the gradient; and using the value update to adjust the current values of the policy parameters.”).

As per claim 8, the rejection of the method of claim 1 is incorporated and the method further comprising: training the neural network by a reinforcement learning process; (Nachum discloses in [0006 and 0034] “training a policy neural network used to select actions to be performed by a reinforcement learning agent interacting with an environment by performing actions from a pre-determined set of actions”).

As per claim 9, a system, comprising: a memory; and a processor, communicatively coupled to the memory, the processor configured to: 
identify a current observable state of an interactive video game; (rejected base on rationale used in rejection of claim 1)
compute, by a neural network processing the current observable state, a plurality of user interface actions and their respective action scores; (rejected base on rationale used in rejection of claim 1) 
wherein an action score associated with a particular user interface action indicates a likelihood of the particular user interface action triggering an observable state transition that belongs to a shortest path from the current observable state to a target observable state; (rejected base on rationale used in rejection of claim 1)
select, based on the action scores, a user interface action of the plurality of user interface actions; (rejected base on rationale used in rejection of claim 1)
apply the selected user interface action to the interactive video game; () and
iteratively repeat the computing, selecting, and applying operations until the target observable state of the interactive video game is reached; (rejected base on rationale used in rejection of claim 1).

As per claim 10, the rejection of the system of claim is incorporated and further wherein the interactive video game is an interactive video game; (Nachum discloses in [0006 and 0034] “the simulated environment may be a video game and the agent may be a simulated user playing the video game. As another example, the simulated environment may be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent is a simulated vehicle navigating through the motion simulation environment.”).

As per claim 16, a computer-readable non-transitory storage medium comprising executable instructions that, when executed by a computing device, cause the computing device to: 
identify a current observable state of an interactive video game; (rejected base on rationale used in rejection of claim 1)
compute, by a neural network processing the current observable state, a plurality of user interface actions and their respective action scores; (rejected base on rationale used in rejection of claim 1)
wherein an action score associated with a particular user interface action indicates a likelihood of the particular user interface action triggering an observable state transition that belongs to a shortest path from the current observable state to a target observable state; (rejected base on rationale used in rejection of claim 1)
select, based on the action scores, a user interface action of the plurality of user interface actions; (rejected base on rationale used in rejection of claim 1)
apply the selected user interface action to the interactive video game; (rejected base on rationale used in rejection of claim 1) and
iteratively repeat the computing, selecting, and applying operations until the target observable state of the interactive video game is reached; (rejected base on rationale used in rejection of claim 1).

Claims 3, 12, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Nachum in view of Halder in view of Commons in view of Gendron-Bellemare et al. (US 20121/0110271, filed 10/07/2019, hereinafter as Gendron).

As per claims 3, 12, and 18, the rejection of the method of claim 1 is incorporated and further Nachum does not explicitly disclose wherein the current observable state of the interactive video game is represented by a numeric vector characterizing one or more parameters of a current graphical user interface (GUI) screen. However, Gendron, in an analogous art, discloses in ([0097] “For each training observation, the system determines a final update for the policy network parameters (204). Each final update can be represented in any appropriate numerical format (e.g., as a vector) and includes one or more final update values which (as will be described further with reference to 206) the system can use to adjust the current values of the policy network parameters.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Nachum and Halder with the teaching of Gendron because “Each final update 126 can be represented in any appropriate numerical format (e.g., as a vector) and includes one or more final update values which the training engine 116 can use to adjust the current values of the policy network parameters.” See Gendron [0083].

Claims 5, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Nachum in view of Halder in view of Commons in view of MNIH et al. (US 2015/0100530, pub. 04/09/2015, hereinafter as MNIH).

As per claims 5, 13, and 20, the rejection of the method of claim 1 is incorporated and the method further comprising: Nachum does not explicitly disclose identifying the neural network among a plurality of neural networks associated with the interactive video game, by matching a version identifier of the neural network to a version identifier of the interactive video game. However, MNIH, in an analogous art, discloses in ([0010-0011] “copying some or all of a set of weights learnt by the second neural network to the first neural network. In effect, in embodiments, two instances of the same neural network are maintained, a first instance being used to generate the target values for updating the second, from time to time updating the first instance to match the second.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Nachum with the teaching of MNIH because “An approach of this type is advantageous in itself generating the experience through which the procedure (or data processor) learns: In effect each neural network provides an output which is used by the other.” See MNIH [0014].

Response to Arguments
Applicant’s arguments with respect to at least claim 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Argument: applicant argument may be based on amendment to the at least independent claims.
Response: new reference has been cited to teach the amendment as detailed above. 


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See form 892.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AHAMED I NAZAR whose telephone number is (571)270-3174. The examiner can normally be reached 10 am to 7 pm Mon-Fri.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached on 571-272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AHAMED I NAZAR/Examiner, Art Unit 2178                                                                                                                                                                                                        12/02/2022

/SHAHID K KHAN/Examiner, Art Unit 2178