Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 17 is objected to because of the following informalities:  does not end with a period.  Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103

A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Thomson (US 2018/0329998) in view of Sharma (US 2019/0095738).
Thomson discloses:
1. In a digital medium dialog system training environment using a simulated user system, a method implemented by a computing device, the method comprising:
generating, by the simulated user system of the computing device, a simulated user action (Thomson does not use the term “simulated” but discloses user actions such as detecting user actions pertaining to editing digital images, “Touch screen 212 has a touch-sensitive surface, sensor, or set of sensors that accepts input from the user based on haptic and/or tactile contact. Touch screen 212 and display controller 256 (along with any associated modules and/or sets of instructions in memory 202) detect contact (and any movement or breaking of the contact) on touch screen 212 and convert the detected contact images) that are displayed on touch screen 212.”, 0054) that includes interaction with both a dialog system (voice/speech can be used, “The digital assistant can interpret the user's intent from the utterance and operationalize the user's intent into tasks.”, 0003; “The specific tasks or actions that a digital assistant decides to perform in response to a user's natural language input can be based on the policy models implemented by the digital assistant”, 0004; 0032) and an application, the simulated user interaction specifying an adjustment to an attribute of an object included in a digital image by a defined value (“Examples of other applications 236 that are stored in memory 202 include other word processing applications, other image editing applications, drawing applications, presentation applications, JAVA-enabled applications, encryption, digital rights management, voice recognition, and voice replication”, 0103; “In conjunction with touch screen 212, display controller 256, contact/motion module 230, graphics module 232, text input module 234, and camera module 243, image management module 244 includes executable instructions to arrange, modify (e.g., edit), or otherwise manipulate, label, delete, present (e.g., in a digital slide show or album), and store still and/or video images”, 0111);
initiating, by computing device, execution of a system action of the application by the dialog system based on the simulated user action, the system action selected based on a policy of a decision process model of the dialog system (policy models are optimized and decisions are made based on user input which can be speech/voice input e.g., “Systems and processes for optimizing dialogue policy decisions for digital assistants using implicit feedback are provided. In an example process, a user utterance is received. Based on a text representation of the user utterance, one or more user intents corresponding to the user utterance are determined. A policy action is selected from a plurality of candidate policy actions based on a belief state for the one or more user intents and a policy model. The policy action is performed, including outputting results of the policy action for presentation. A success score for the policy action is determined based on whether one or more predetermined types of implicit user feedback are detected after performing the policy action. A set of parameter values of the policy model is modified using the determined success score”, abstract; 
“reinforcement learning techniques can be implemented where user feedback indicating the success or failure of a determined policy action is utilized to optimize the policy model.”, abstract);
determining, by the simulated user system of the computing device, whether the execution of the system action by the application as initiated by dialog system accomplishes a goal of the simulated user action in adjusting the attribute of the object in the digital image by the defined value (A success score for the policy action is determined based on whether one or more predetermined types of implicit user feedback are detected after performing the policy action. A set of parameter values of the policy model is modified using the determined success score”, abstract;
“reinforcement learning techniques can be implemented where user feedback indicating the success or failure of a determined policy action is utilized to optimize the policy model. The user feedback can be used to optimize a reward function of the policy model such that for subsequent user utterances, the policy action that maximizes the cumulative reward would be more likely to coincide with the user's actual desired goal. In this way, an optimal policy can be developed for the policy models of the digital assistant.”, 0025); 
generating, by the simulated user system of the computing device, reward data based on the determining (reinforcement learning inherently involves rewards and/or feedback, see e.g., abstract or 0025; “In examples where policy decision processing module 770 utilizes POMDPs, a policy action is selected from the  -learning) to each candidate policy action and the policy action that is predicted to maximize the reward function (e.g., the total reward over the entire dialogue) is selected”, 0253); and 
training, by the computing device, the policy of the decision process model of the dialog system based on the generated reward data (the various models can be trained using machine learning and/or neural networks, 0060; “unsupervised machine learning techniques can be applied to optimize the policy models implemented by digital assistants. Specifically, reinforcement learning techniques can be implemented where user feedback indicating the success or failure of a determined policy action is utilized to optimize the policy model.”, 0025).
Thomson fails to particularly call for the user input data to be referred to as simulated.   
Sharma teaches data can be labeled as simulated/synthetic (“the DRL algorithm is trained using a plurality of training examples (i.e., trajectories) generated by one or more users who have performed the detailed editing task using standard input devices (e.g., as a mouse or a stylus), or by the proposed system. In other embodiments, the DRL algorithm may be trained "synthetically", i.e., by creating examples by using one or more synthetic data may also be used for training of the DRL algorithm in some embodiments”, 0030).
In it obvious the references could be combined before the time of filing because they are in the same field of endeavor and user input whether by voice to text conversion or by hand or stylus pen movement can be said to be simulated by the system that receives the user input.  The user input data can be real time data used to train and make predictions about the user intent or the input data can be historical data that was used to initially train a model/algorithm in a reinforcement or feedback system.2. The method as described in claim 1, wherein the simulated user action is based on the goal as taken from of a plurality of goals of an agenda stack (agenda stack reads on task flow or policy actions, “For example, to act on an inferred user intent, the system performs one or more of the following: identifying a task flow with steps and parameters designed to accomplish the inferred user intent, inputting specific requirements from the inferred user intent into the task flow; executing the task flow by invoking programs, methods, services, APIs, or the like; and generating output responses to the user in an audible (e.g., speech) and/or visual form”, 0032; 
policy action is selected from a plurality of candidate policy actions based on a belief state for the one or more user intents and a policy model”, abstract;
“Determining a success score for the policy action based on implicit user feedback (e.g., detecting one of a plurality of types of user input that are each other than a response to a structured device query) can enable the accuracy of the policy model to be evaluated objectively. Specifically, in contrast to explicit user feedback, implicit user feedback can be a more reliable and accurate indicia of whether the performed policy action satisfies the user's actual desired goal for providing the user utterance”, 0007).

8. The method as described in claim 1, wherein the reward data includes a reward based on whether the goal of the simulated user action of a single dialog of a plurality of dialogs within a dialog session is accomplished (reads on using reinforcement learning with user input that can be speech, “The plurality of candidate policy actions determined by policy decision processing module 770 can represent policy actions that are most likely to achieve the greatest cumulative reward for the entire dialogue. In this way, policy decision processing module 770 need only select from a small subset of all possible policy actions, which can enable a more efficient and accurate reinforcement learning techniques to select the policy action to perform. In examples where policy decision processing module 770 utilizes POMDPs, a policy action is selected from the set of candidate actions by solving the POMDPs. Specifically, policy decision processing module 770 applies, using policy models 772, a reward function (e.g. Q-function for Q-learning) to each candidate policy action and the policy action that is predicted to maximize the reward function (e.g., the total reward over the entire dialogue) is selected”, 0253).9. The method as described in claim 1, wherein the reward data includes a collective reward based on whether goals of a plurality of said simulated user actions of a single dialog session are accomplished (reads on using reinforcement learning with user input that can be speech, “The plurality of candidate policy actions determined by policy decision processing module 770 can represent policy actions that are most likely to achieve the greatest cumulative reward for the entire dialogue. In this way, policy decision processing module 770 need only select from a small subset of all possible policy actions, which can enable a more efficient and accurate selection process”, 0250; “In some examples, policy decision processing module 770 applies reinforcement learning techniques to select the policy action to perform. In examples where policy decision processing module 770 utilizes POMDPs, a policy action is selected from the set of candidate actions by solving the POMDPs. Specifically, policy decision processing module 770 applies, using policy models 772, a reward function (e.g. Q-function for Q-learning) to each candidate policy action and the policy action that is predicted to maximize the reward function (e.g., the total reward over the entire dialogue) is selected”, 0253).
20 see rejection of claim 1. In a digital medium dialog system training environment, a system comprising: means for executing a plurality of system actions (user input speech and edits images, abstract); means for initiating the plurality of system actions based on a policy of a decision process model as part of a dialog session(policy models are optimized and decisions are made based on user input which can be speech/voice input e.g., “Systems and processes for optimizing dialogue policy decisions for digital assistants using implicit feedback are provided. In an example process, a user utterance is received. Based on a text representation of the user utterance, one or more user intents corresponding to the user utterance are determined. A policy action is selected from a plurality of candidate policy actions based on a belief state for the one or more user intents and a policy model. The policy action is policy action for presentation. A success score for the policy action is determined based on whether one or more predetermined types of implicit user feedback are detected after performing the policy action. A set of parameter values of the policy model is modified using the determined success score”, abstract; 
“reinforcement learning techniques can be implemented where user feedback indicating the success or failure of a determined policy action is utilized to optimize the policy model.”, abstract); and means for simulating user actions, as part of the dialog session, that: 
cause the initiating means to initiate execution of respective said system actions, the dialog session defined using an agenda stack having a plurality of goals and the simulated user actions including interaction with both the initiating means and the executing means(agenda stack reads on task flow or policy actions, “For example, to act on an inferred user intent, the system performs one or more of the following: identifying a task flow with steps and parameters designed to accomplish the inferred user intent, inputting specific requirements from the inferred user intent into the task flow; executing the task flow by invoking programs, methods, services, APIs, or the like; and generating output responses to the user in an audible (e.g., speech) and/or visual form”, 0032; 
policy action is selected from a plurality of candidate policy actions based on a belief state for the one or more user intents and a policy model”, abstract;
“Determining a success score for the policy action based on implicit user feedback (e.g., detecting one of a plurality of types of user input that are each other than a response to a structured device query) can enable the accuracy of the policy model to be evaluated objectively. Specifically, in contrast to explicit user feedback, implicit user feedback can be a more reliable and accurate indicia of whether the performed policy action satisfies the user's actual desired goal for providing the user utterance”, 0007); 
determine whether the execution of the respective said system actions accomplish respective said goals of the simulated user actions (reinforcement learning); generate reward data based on the determination; and train the policy of the decision process model based on the generated reward data (training reads on using machine learning or neural networks). 
Thomson fails to particularly call for the user input data to be referred to as simulated.   
Sharma teaches data can be labeled as simulated/synthetic (“the DRL algorithm is trained using a plurality of training examples (i.e., trajectories) generated by one or more users who have performed the detailed editing task using standard input synthetic data may also be used for training of the DRL algorithm in some embodiments”, 0030).
In it obvious the references could be combined before the time of filing because they are in the same field of endeavor and user input whether by voice to text conversion or by hand or stylus pen movement can be said to be simulated by the system that receives the user input.  The user input data can be real time data used to train and make predictions about the user intent or the input data can be historical data that was used to initially train a model/algorithm in a reinforcement or feedback system.

Claim Rejections - 35 USC § 103
Claims 3-6 are rejected under 35 U.S.C. 103 as being unpatentable over Thomson (US 2018/0329998) in view of Sharma (US 2019/0095738) and Yamanouchi (US 2009/0296137).
3. The method as described in claim 2, wherein the agenda stack starts with an open goal, includes at least one adjust goal, and ends with a close goal.

Yamanouchi teaches open goal, includes at least one adjust goal, and ends with a close goal (open file, edit, close file, see e.g., 0033, 0039, 0052, 0055, 0081).
It is obvious to combine the references before the filing date because they are in the same field of endeavor and processors need to open files before reading or working on them.4. The method as described in claim 2, further comprising generating, by the simulated user system of the computing device, the agenda stack as having an ordered sequence of the plurality of goals within a single dialog session (the agenda stack reads on the sequence of events in the policy model and/or the tasks needed to accept user input, interpret user’s intent/goals, perform action, receive feedback and/or determine reward or success score and repeat, see e.g.,
“For example, to act on an inferred user intent, the system performs one or more of the following: identifying a task flow with steps and parameters designed to accomplish the inferred user intent, inputting specific requirements from the inferred user intent into the task flow; executing the task flow by invoking programs, methods, services, APIs, or the like; and generating output responses to the user in an audible (e.g., .5. The method as described in claim 4, wherein the plurality of goals is generated using open, adjust, close, undo, or redo goals. (the agenda stack reads on the sequence of events in the policy model and/or the tasks needed to accept user input, interpret user’s intent/goals, perform action, receive feedback and/or determine reward or success score and repeat).
The combination of Thomson and Sharma fail to teach details of opening files before working on them.
Yamanouchi teaches open goal, includes at least one adjust goal, and ends with a close goal (open file, edit, close file, see e.g., 0033, 0039, 0052, 0055, 0081).
It is obvious to combine the references before the filing date because they are in the same field of endeavor and processors need to open files before reading or working on them.6. The method as described in claim 5, wherein: the open goal includes a slot specifying data that is to be processed by the application; and the adjust goal includes a slot specifying an attribute and a slot specifying an attribute value for the attribute (reads on inherent time slots, Yamanouchi teaches open goal, includes at least one adjust goal, and ends with a close goal (open file, edit, close file, see e.g., 0033, 0039, 0052, .
Allowable Subject Matter
Claims 7, 10, are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim 17 is objected to for not ending with a period.
REASONS FOR ALLOWANCE

The following is an Examiner's statement of reasons for allowance: Claims 11-16, 18-19 are considered allowable since when reading the claims in light of the specification, as per MPEP §2111.01 or Toro Co. v. White Consolidated Industries Inc., 199 F.3d 1295, 1301, 53 USPQ2d 1065, 1069 (Fed. Cir. 1999), none of the references of record alone or in combination disclose or suggest the combination of limitations specified in the independent claims including  generate an agenda stack having an ordered sequence of a plurality of goals by selecting from a plurality of predefined goals and setting values for respective said goals: execute simulated user actions, as part of a dialog session, that cause the dialog system to initiate respective 
Any comments considered necessary by Applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled "Comments on Statement of Reasons for Allowance."

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Mankovskii (US 2020/0134103) teaches using simulated data and training with reinforcement learning (“Or some embodiments may train a model to simulate human feedback based on a training set of human feedback and then train the caption generator 48 on feedback from the model configured to simulate a human agent, e.g., with reinforcement learning.”, 0064; “Some embodiments perform simulation-based generation of natural language”, 0109).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID R VINCENT whose telephone number is (571)272-3080.  The examiner can normally be reached on ~Mon-Fri 8-430.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 5712703428.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.