Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-6, 8-13, 15-16 and 18-20 are pending.  Claims 1, 8 and 15 are independent and have been amended.  Claims 3 and 10 are amended also.
This Application was published as U.S. 2019/0206402.
Priority used for search is 12/29/2017.

This Application is related to a fair number of U.S. applications as follows:
16/233,539, U.S. 20190202061
16/233,566, issued as U.S. 10567570.
16/233,640, NOA mailed 7/5/2022.
16/233,678, issued as U.S. 11222632.
16/233,716, NOA mailed 6/3/2022.
16/233,786, issued as U.S. 11003860
16/233,829, issued as U.S. 11024294
16/233,939, issued as U.S. 10967508.
16/233,986, issued as U.S. 10994421.
16/234,041, issued as U.S. 11331807.

Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 9/1/2022 has been entered.
Response to Arguments
Applicant’s arguments are moot in view of the new grounds of rejection.
Breazeal (U.S. 20170206064) (identified in the PCT search report) is used as the third reference.  Define the dialog scene with further particularity as an environment in the vicinity of the user, objects located in which may be detected by the sensors of the device.  Further, “at least one” or “one or more” contradicts “spatial relationships” of the Claim.  If there is only one the relationship of the object is with itself.  Define the Objects with particularity to exclude IOT objects.  Have more than one Object and define the spatial relationship to be among the detected object and let the device discover the location of the objects. If you do the above, then Breazel is distinguished.  Figure 4C of instant Application looks at what the user is looking at.  Gaze detection to improve the dialog is well-known in the art including Breazeal.  Figure 11 or 8A,8B are about spatial relationships of Lego parts with one another.  The device can see the Lego parts and based on that it recommends what the user must do next.  You need at least two objects for that.  Instruction regarding operating a device or equipment is known; see below.
Another reference that pertains to a dialog system that recognizes a device/machine/equipment and talks to the user about the device is:  Prevost (U.S. 6570555) that helps the user operate a piece of equipment.
Filev (U.S. 2009/0055190) and Perez (U.S. 2017/0316777) help the user with the same device (Vehicle and Phone, respectively) that is used to conduct the dialog.  So, these references do not identify another device in the environment of the user.
Ehsani (U.S. 2014/0108019) is in touch with device of an IOT system by interfacing with their sensors.  So, these devices cannot be dead objects like toy or chair.  “…The smart home system can receive input from sensors or any other machines with which it is interfaced…”  Abstract.
Chang (U.S. 20080172173) knows the locations of user devices and points of interest (such as library or hospital) and provides them to the user.  This is not objects like chair and toy that are in the vicinity of the user and can be detected by the smartphone sensors such as camera.

Amended Claim 1 provides:
1. A method implemented on at least one machine including at least one processor, storage, and communication platform capable of connecting to a network for an automated dialogue companion, 
the method comprising: 
receiving multimodal input data associated with a user engaged in a dialogue with a predetermined goal on a certain topic in a dialogue scene, wherein the dialogue is managed based on a dialogue tree having a plurality of nodes, each of which is associated with a utility and some of which have branches representing alternative conversations of the dialogue, and the multimodal input data capture a communication from the user and information surrounding the dialogue scene;
analyzing the multimodal input data to recognize one or more objects in the dialogue scene and spatial relationships of the one or more objects in the dialogue scene; 
generating, based on the one or more objects in the dialogue scene and the spatial relationships of the one or more objects in the dialogue scene a current state of the dialogue and a context of the dialogue, wherein the current state of the dialogue corresponds to a node in the dialogue tree; 
accessing first utilities associated with first one or more branches of the node with respect to the current state of the dialogue, wherein the first utilities characterize effectiveness of different dialogue strategies represented by the first one or more branches with respect to the user; and
determining a response communication to be conveyed to the user in response to the communication in accordance with the first utilities and second utilities associated with respectively a plurality of branches of the first one or more branches, wherein the response communication maximizes look-ahead expected utilities given the current state, 
wherein both the first and second utilities are learned based on historic dialogue data with respect to the goal of the dialogue on the certain topic. 

Amended Claim 3 provides:
3. The method of claim 2, wherein the step of analyzing the multimodal input data comprises at least one of: 
analyzing the audio data to recognize content of the communication from the user, characteristics of the communication indicative of an emotion conveyed in the communication, and acoustic sound in the dialogue scene; and 
analyzing the visual data to recognize: a facial expression of the user, an emotion associated with the facial expression, an act performed by the user, and one or more objects in the dialogue scene and the spatial relationships thereof.

Support from the published Application:
“[0083] In this exemplary embodiment, Layer 2 710 may include input understanding processing components such as, e.g., an acoustic information processing unit 710-1, a visual info processing unit 710-2, . . . , and an emotion estimation unit 710-3. The input may be from multiple sensors either deployed on the user device or on the agent device. Such multimodal information is used to understand the surrounding of the dialogue, which may be crucial for dialogue control. For example, the speech processing unit 710-1 may process acoustic input to understand, e.g., the speech uttered by the agent, the tone of the user's speech, or the user or the environmental sound present at the dialogue scene. The visual information processing unit 710-2 may be used to understand, based on e.g., video information, the user's facial expression, the surround of the user in the dialogue scene such as presence of a chair or a toy and such an understanding may be used when the dialogue manager decides how to proceed with the dialogue….”

“[0086] In representing a user's ongoing mindset, different world models 730-5 and a dialogue context 730-6 established from the perspective of the user are used to represent the mindset of the user. The world models 730-5 for a user may be set up based on spatial, temporal, and causal representations. Such representations may be derived based on what is observed in the dialogue and characterizes, e.g., what is observed in the scene (e.g., objects such as desk, chair, computer on the desk, a toy on the floor, etc.) and how they are related (e.g., how objects spatially related). In this exemplary implementation, such representations may be developed using AND-OR graphs or AOG. Spatial representation of objects may be represented by S-AOG. The temporal representation of the objects over time may be represented by T-AOG (e.g., what is done on which object over time). Any causal relationship between temporal actions and spatial representation may be represented by C-AOG, e.g., when an action of moving a chair is performed (the spatial location of the chair is changed).”

35 U.S.C. 112(f) Claim Interpretation
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: “device” and “user interaction engine” and “dialogue manager” in Claim 15 and “utility learning engine” in Claim 19.  These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is.  MPEP 2181 I A.
Applicant has acknowledged the interpretation.  (Applicant’s Response, p. 14.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8-13, 15-16, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Akolkar (U.S. 2014/0337010) in view of Moturu (U.S. 2017/0213007) and further in view of Breazeal (U.S. 20170206064).
Regarding Claim 1, Akolkar teaches:
1. A method implemented on at least one machine including at least one processor, storage, and communication platform capable of connecting to a network for an automated dialogue companion, [Akolkar, see Figure 1 provided above for the hardware components including “processing unit 16,” “memory 28,” and “I/O interfaces 22.”  Figures 4 and 11 showing the “conversational interface 416” and Figure 4 showing the “dialog engine 420.”]

    PNG
    media_image1.png
    477
    705
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    490
    666
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    435
    735
    media_image3.png
    Greyscale

the method comprising: 
receiving multimodal input data associated with a user engaged in a dialogue with a predetermined goal on a certain topic in a dialogue scene, [Input in Akolkar is in “natural language” form and Figure 17 shows examples of “initial input” such as “I want to pay my employees” or “I want to arrange a company event.”  The examples of Figure 17 show that the user is “engaged in a dialogue with a predetermined goal on a certain topic.” The “candidate services” under the column “#Category Candidates” of Figure 17 teach the “topics” of the Claim such as “Events, Music, Travel, Government, Calendar,” for example.  Akolkar does not specify the spoken or written form of the input only that it is in natural language and subject to semantic analysis.]  [Akolkar does not collect sensor or multimodal data.]

    PNG
    media_image4.png
    378
    739
    media_image4.png
    Greyscale

wherein the dialogue is managed based on a dialogue tree having a plurality of nodes, each of which is associated with a utility and some of which have branches representing alternative conversations of the dialogue, the multimodal input data capture a communication from the user and information surrounding the dialogue scene; [ Akolkar does not mention a “dialog tree” but a FSM (finite state machine) is very much like a dialog tree; a FSM includes states/nodes/Vertex and transitions between states/branches/Edge.  “[0109] G(V, E), where vertex V denotes the context state and edge E denotes the transition between contexts. FIG. 12 shows exemplary topology of the logic flow. Based on the discussion with domain experts, initially define eight main context states 1202, 1204, 1230, 1232, 1234, 1236, 1238, 1240 (along with several exception handling states, omitted for clarity and brevity) and fourteen transitions (labeled arrows between the states). One or more embodiments of CSM {cloud services marketplace} use a finite state machine to store the information of the conversation logic flow. The topology of the logic flow can be freely changed by adding more context states and/or updating transitions.”] 

    PNG
    media_image5.png
    440
    635
    media_image5.png
    Greyscale

analyzing the multimodal input data to recognize one or more objects in the dialogue scene and spatial relationships of the one or more objects in the dialogue scene;
generating, based on the one or more objects in the dialogue scene and the spatial relationships of the one or more objects in the dialogue scene a current state of the dialogue and a context of the dialogue, wherein the current state of the dialogue corresponds to a node in the dialogue tree;  [Akolkar teaches that the current state of the dialog is represented by a Vertex of FSM which is like the node of a tree.  “[0105] … At the very beginning, the customer 1102 tells CSM about his or her requirement via the conversational interface 416…. Receiving the retrieved data, the Dialog {Engine} updates the meta-data accordingly, including the user's input and the index of candidate services, and, at the same time, updates the conversation context 1104 according to the logic flow in FIG. 12 to continue the conversation….”  In Akolkar “conversation context” and “state” are the same.  “[0110] At the beginning of each conversation (recognized as the creation of a new session), Dialog Engine 420 creates the profile for the current session, and sets the current state as Service Category Identification 1202. In this state, CSM assumes that the customer's input is related to the scenario of looking for proper service categories. ….”  “[0111] More particularly, the customer inputs a requirement at 1202 and the system attempts to identify the corresponding pertinent category. …”  The State/Node/Vertex is updated as more input is provided by the user. ]
accessing first utilities associated with first one or more branches of the node with respect to the current state of the dialogue, [Akolkar, Figure 12, shows going from a node/state/vertex along a branch/transition/edge to another state and each of these transitions/branches have a certain “effectiveness” / “utility” in getting the user to his final goal.  Akolkar maximizes the effectiveness/utility of line of questions and that is how it decides which question to ask next/ which branch to take.  “First Utilities” are defined as the “effectiveness of a dialog strategy” and Akolkar teaches calculating the “effectiveness of a question sequence” or “eff(Q)” which teaches the “First Utilities” of the Claim.  See below for quotes from Akolkar.  Figure 12 shows the state diagram (FSM) used by Alkokar where the states/Vertices teach the “nodes” of the Claim.  From each state/node/vertex going to each of the possible transitions/branches/edges effectuates a different degree of effectiveness (eff(Q)) /utility.  See the description of “Service Filtering” [0113]-[0130] and particularly [0125]-[0126] where eff*(Q) is discussed as the “best question sequence” which maximizes the effectiveness of the overall strategy and permits the user to reach his goal faster.  The “Conversation Flow Control” uses a “cloud services marketplace (CSM).”]
wherein the first utilities characterize effectiveness of different dialogue strategies represented by the first one or more branches with respect to the user; and [Akolkar teaches that to best assist the user with the most appropriate service, the system has to ask more questions. But on the other hand, too many questions irritate the user and therefore the system tries to come up with an optimal sequence of questions to ask to get to the goal of the user faster and with the least irritation.  The “sequence of questions Q” in Akolkar teaches the “dialogue strategy” of the Claim.  The “eff(Q)” in Akolkar teaches the “first utilities” of the Claim because it characterizes the “effectiveness of different dialogue strategies.”  See [0119] below.  “[0118] The more questions the customer answers, the more unsatisfactory services can be pruned. However, too many questions may degrade the quality of the user experience. To make the service filtering effective, one or more embodiments provide a novel method referred to as Iteration-Min to reduce the iterations for the questioning and answering.”  “[0119] In one or more embodiments, to reduce the number of iterations, find a sequence of questions Q={Q1, Q2 . . . Qn} with the least length to rule out all unsatisfied candidate services via capability or configuration. Quantitatively, use eff(Q) to evaluate how effectively the sequence can filter the candidates. The effectiveness of a question sequence can be considered as the sum of the effectiveness of all its questions, i.e., eff(Q)=Ʃi eff(Qi). Concretely, the effectiveness of a question is qualified as the expected number of candidates it can prune, i.e. nprune, based on the customer's potential answer.”  “[0123] In the above formulas, the probabilities (p(yes), p(no), p(oi)) can be estimated, for example, via an empirical distribution obtained from customer history. ….”  “[0110] … FIG. 12 shows exemplary topology of the logic flow. Based on the discussion with domain experts, initially define eight main context states 1202, 1204, 1230, 1232, 1234, 1236, 1238, 1240 (along with several exception handling states, omitted for clarity and brevity) and fourteen transitions (labeled arrows between the states)….”]
determining a response communication to be conveyed to the user in response to the communication in accordance with the first utilities and second utilities associated with respectively a plurality of branches of the first one or more branches, [Akolkar, Figure 11, the “dialog Engine 420” decides the “response” of the machine which may be another “question” based on the “exemplary logic flow” shown in Figure 12 which is in the form of a “finite state machine.”  (See [0109].)  The finite state machine of Figure 12 is based on the current state of the dialog and includes “service filtering 1232 and 1204” steps which “prunes” the unsatisfactory services ([0118]).  The pruning is related to the effectiveness of the “sequence of questions Q” which is eff(Q)/utility.  ([0119]-[0121]).  So the response is determined using the finite state machine of Figure 12, shown as G(V,E) as a function of a current state of the conversation and the eff(Q)/utility of a “sequence of questions.”  The “utility” of the Claim is taught by effectiveness function eff(Q) of Akolkar which changes dynamically as the conversation proceeds and teaches first, second, etc utility of the Claim.  “[0126] In at least some instances, the best question sequence cannot be obtained via pre-computing, because the variables … used for computing the effectiveness of a single question would change according to the answer of the previous question, making eff(Qi+1) depend on the answer of Qi. Therefore, one or more embodiments of CSM dynamically compute the next best question based on the previous answer on-the-fly. …”   “Utility” is defined as “first utilities characterize effectiveness of different dialogue strategies represented by the first one or more branches” which determines whether it is better to go down the first branch or the second branch, for example.  The equation for eff(Q) in Akolkar ([0120], equation 1) includes p(yes) and p(no) and equation 2 in [0121] includes p(oi) which is the probability that the user selects option oi.  Each of Yes, or No or option Oi sends the dialog and hence the FSM (or decision tree) down a different branch and teaches “first utilities and second utilities associated with respectively a plurality of branches “ of the Claim. ]
wherein the response communication maximizes look-ahead expected utilities given the current state, [Akolkar, for “Look-ahead expected utilities given the current state” see [0126] and equation (5) where “Therefore, one or more embodiments of CSM dynamically compute the next best question based on the previous answer on-the-fly.” ]
wherein both the first and second utilities are learned based on historic dialogue data with respect to the goal of the dialogue on the certain topic. [Akolkar, The probability associated with each potential answer by the user in the formula eff(Q) which measures the effectiveness of a question Q (see [0121]) is obtained from the “customer history” which teaches the “historic dialogue data” of the Claim.  “[0123] In the above formulas, the probabilities (p(yes), p(no), p(oi)) {oi is the ith option} can be estimated, for example, via an empirical distribution obtained from customer history. ….”]

Akolkar does not teach a multimodal input.
Akolkar does not teach the use of a dialog tree, although a FSM does the same job.
Moturu teaches:
1. A method implemented on at least one machine including at least one processor, storage, and communication platform capable of connecting to a network for an automated dialogue companion, [Moturu, the user device in Figure 2 or Figure 6 is shown as a “mobile device” which would inherently include all of the hardware components recited.  See also [0068].]

    PNG
    media_image6.png
    377
    470
    media_image6.png
    Greyscale
 

    PNG
    media_image7.png
    584
    456
    media_image7.png
    Greyscale

the method comprising: 
receiving multimodal input data associated with a user engaged in a dialogue with a predetermined goal on a certain topic in a dialogue scene, [Moturu receives input as speech or text of the user, see Figure 6, e.g., and also receives sensor data regarding mobility of the user and his physical condition.  The dialog as shown in Figure 6 is about the topic of the health of the user.  “[0029] In some variations, Block S120 can include receiving one or more of: location information, movement information (e.g., related to physical isolation, related to lethargy), device usage information (e.g., screen usage information, physical movement of the mobile device, etc.), device authentication information (e.g., information associated with authenticated unlocking of the mobile device), and/or any other suitable information….”  “[0030] In some variations, Block S120 can include collecting biometric data associated with user conditions, such as from electronic health records, sensors of mobile devices and/or supplemental medical devices, user inputs (e.g., entries by the user at the mobile device), and/or other suitable sources. Biometric data can include one or more of: electroencephalogram (EEG) data, electrooculogram (EOG) data, electromyogram (EMG) data, electrocardiogram (ECG) data, airflow data (e.g., nasal airflow, oral airflow, measured by pressure transducers, thermocouples, etc.), pulse oximetry data, sound probes, polysomnography data, family conditions, genetic data, microbiome data, and/or any other biometric data.”  “[0033] … Additionally or alternatively, the device event data can include data from sensors (e.g., accelerometer, gyroscope, other motion sensors, other biometric sensors, etc.) implemented with the mobile device and/or other suitable devices,….” “[0051] … In a specific example, the method 100 can include: applying a machine learning communication model to tag a communication with a topic (e.g., where the communication model is trained on a training dataset including text messages and associated topic labels); mapping the topic to a subset of potential automated communications associated with the topic; and selecting an automated communication to transmit to the user from the subset of potential automated communications….”]


    PNG
    media_image8.png
    548
    473
    media_image8.png
    Greyscale

wherein the dialogue is managed based on a dialogue tree having a plurality of nodes, each of which is associated with a utility and some of which have branches representing alternative conversations of the dialogue, the multimodal input data capture a communication from the user and information surrounding the dialogue scene; [Moturu teaches that its “communication model” can include a “communication decision tree model” including “nodes and branches.”  A decision tree by definition includes branches that represent alternatives.  Moturu takes in multimodal sensor data including for example “mobility” of the user which teaches the “information surrounding the dialogue scene” of the Claim.  Figure 3, S120: mobility supplemental dataset.  [0028]-[0030].   See “[0050] … applying a communication model can include applying a communication decision tree model, such as a decision tree model including internal nodes and branches selected based on correlations between automated communications and user outcomes (e.g., in relation to user conditions)…..”]
analyzing the multimodal input data to recognize one or more objects in the dialogue scene and spatial relationships of the one or more objects in the dialogue scene; [Moturu teaches multimodal input.]
generating, based on the one or more objects in the dialogue scene and the spatial relationships of the one or more objects in the dialogue scene a current state of the dialogue and a context of the dialogue, wherein the current state of the dialogue corresponds to a node in the dialogue tree;  [Moturu, Figure 1, step S150 determines a tailored communication plan for the user based on the data provided to the mobile device.  Moturu teaches the use of a decision tree which has nodes/states corresponding to the selected communications and responses:  “[0050] Determining a communication plan in Block S150 can additionally or alternatively include generating and/or applying a communication model. Communication models preferably output one or more components of a communication plan based on communication-related features and/or datasets, but any suitable inputs can be leveraged by communication models for generating any suitable outputs. The communication model can include any one or more of: probabilistic properties, heuristic properties, deterministic properties, and/or any other suitable properties. In a variation, the communication model can include weights assigned to different communication-related features and/or datasets. For example, features extracted from user-provider communications can be weighted more heavily than features extracted from communications between a user and a non-care provider. In another example, mobility behaviors associated with promoted therapeutic interventions (e.g., user locations where a therapeutic intervention is provided) can be weighted more heavily than mobility behaviors associated with user daily activities. In another variation, applying a communication model can include applying a communication decision tree model, such as a decision tree model including internal nodes and branches selected based on correlations between automated communications and user outcomes (e.g., in relation to user conditions). In a specific example, a communication decision tree model can start with an initial automated communication (e.g., to be transmitted to the user), and subsequent automated communications can be selected and transmitted based on user responses (e.g., associated user meaning, user sentiment, etc.) to communications. However, applying communication decision tree models can be performed in any suitable manner.”]

    PNG
    media_image9.png
    535
    452
    media_image9.png
    Greyscale

accessing first utilities associated with first one or more branches of the node with respect to the current state of the dialogue, wherein the first utilities characterize effectiveness of different dialogue strategies represented by the first one or more branches with respect to the user; and [Moturu teaches that its “communication plan” is designed to “optimize user outcomes” ([0051]) which conveys the same idea but is not specific with respect to branches and nodes of the decision tree.]
determining a response communication to be conveyed to the user in response to the communication in accordance with the first utilities and second utilities associated with respectively a plurality of branches of the first one or more branches, wherein the response communication maximizes look-ahead expected utilities given the current state,  [Moturu teaches that the communication model applied at step S150 of Figures 1 and 2, which determines a “tailored communication plan” to converse with the patient/user maximizes a reward which optimizes user outcome based on all the inputs shown in steps S110, S120, S130, S140, and S145 of Figures 1 and 2.  “[0051] … In specific examples, Block S150 can include training and applying a reinforcement learning model (e.g., deep reinforcement learning model), such as a reinforcement learning model for maximizing a reward (e.g., determining components of a communication plan for optimizing user outcomes, improving user conditions, user openness, and/or other suitable user parameters, etc.); a reinforcement learning model (e.g., inverse reinforcement learning model) for mimicking an observed behavior (e.g., care provider communication behavior in user-provider communications where the user was receptive, etc.); and/or any other suitable type of reinforcement learning models. ….”]
wherein both the first and second utilities are learned based on historic dialogue data with respect to the goal of the dialogue on the certain topic. [Moturu teaches that the communication model applied at step S150 of Figures 1 and 2, which determines a “tailored communication plan” to converse with the patient/user is trained on “historic communications” their “context” based on all the inputs shown in steps S110, S120, S130, S140, and S145 of Figures 1 and 2.   “[0045] In a variation of Block S150, determining a communication plan (and/or associated components) can be based on historic communications (e.g., historic automated communications, user-provider communications, associated user responses, communications associated with other users such as users sharing a subgroup, etc.). Determining a communication plan based on historic communications can include one or more of, in relation to historic communications: determining contextual parameters (e.g., based on data from Blocks S110-S145, etc.), extracting meaning (e.g., user meaning associated with user inputs), determining sentiment (e.g., emotional sentiment associated with a communication, with a therapeutic intervention, with an application feature, etc.), topic tagging (e.g., detecting, categorizing, and/or otherwise tagging communications with topics, which can be used for determining and/or promoting therapeutic interventions, identifying transition events for transitioning between care providers and an automated communication determination system for transmitting communications, summarizing communications for subsequent analysis, updating communication plans, searching communications, determining content components and/or format components, etc.), summarizing communication content (e.g., for documentation such as in relation to the Health Insure Portability and Accountability Act and/or other regulations; for supporting care providers by providing summaries of historic communications with the user; for topic tagging; etc.), and/or any other suitable processes.  ….”]

Akolkar and Moturu pertain to natural language conversational dialog systems where a machine is trained to conduct a dialog with a user and both teach optimizing a conversational model to provide optimized responses or a course of dialog to the user.  It would have been obvious to combine the system of Akolkar that relies on natural language input alone with the system of Moturu which includes multimodal sensor data pertaining the user in order to arrive at a more comprehensive set of inputs for determining and optimizing what the next step of dialog should be (when the dialog depends on of what is happening to the user in addition to what the user is saying) and as combining prior art elements according to known methods to yield predictable results and also to replace the FSM of Akolkar that is used to obtain states of the dialog with the system of Moturu which uses a decision tree to obtain a next state of dialog as an equivalent or simpler system and as simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Akolkar and Moturu do not teach including the spatial data of objects around the user.
Breazeal teaches:
analyzing the multimodal input data to recognize one or more objects in the dialogue scene and spatial relationships of the one or more objects in the dialogue scene; [Breazeal, Figure 27, “Robot Discovery.”  [0448] As inexpensive IOT devices become common, it will be possible to utilize them in entertaining ways. A PCD 100, with spatial mapping, object detection, and audio detection is ideally equipped to control these devices in coordination with music, video and other entertainment media. A well-orchestrated performance will delight its audience.”  Some of the devices discovered by the Robot are not in the “dialog scene” but some are.  ]
generating, based on the one or more objects in the dialogue scene and the spatial relationships of the one or more objects in the dialogue scene a current state of the dialogue and a context of the dialogue, wherein the current state of the dialogue corresponds to a node in the dialogue tree; [Breazeal asks “permission” if he may turn on the devices and therefore generates a dialog with the user.  “[0451] Consider a family with a home in which IOT lights and speakers have been installed in, say, the kitchen and adjacent family room. This family, being adopters of new technology, may purchase a personal PCD 100 that may be deployed in the kitchen. As part of its setup procedure, the social robot may discover the types and locations of the family's IOT devices and request permission to access and control them. If permission is granted, the PCD 100 may offer to perform a popular song. The social robot then uses its own sound system and expressive physical animation to begin the performance. Then, to the delight of the family, the IOT lights in the kitchen and family room begin to pulse along with the music, accentuating musical events. Then the IOT speakers begin playing, enhancing the stereo/spatial nature of the music.”]
Akolkar and Moturua and Breazeak pertain to natural language conversational dialog systems where a machine is trained to conduct a dialog with a user and it would have been obvious to add the information obtained by the system of Breazeal about the objects around the user in order to enhance the conversational experience and move forward the dialog most expeditiously.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 2, Akolkar teaches:
2. The method of claim 1, wherein the multimodal input data include at least audio data, visual data, text data, and haptic data. [Akolkar, the input is expressly said to be natural language and not specified as voice or text which are easily interchangeable.  However, in Figure 5, the input is shown to be text input into the interface. ]
This limitation is interpreted as including ALL of the listed modes because there is no “one” after “at least.”
Moturu teaches:
2. The method of claim 1, wherein the multimodal input data include at least audio data, visual data, text data, and haptic data. [Muturu teaches multimodal input including voice ore text by the patient/user:  “[0014] As such, variations of the method 100 and/or system 200 can be implemented in characterizing and/or improving user conditions including any one or more of: …  communication-related conditions (e.g., expressive language disorder; stuttering; phonological disorder; autism disorder; voice conditions ….”  “[0027] As such, Block S110 preferably enables collection of one or more of: phone call-related data … media such as images, charts and graphs, audio, video, file, links, emojis, clipart, etc.) … vocal and textual content (e.g., text and/or voice data that can be used to derive features indicative of negative or positive sentiments; textual and/or audio inputs collected from a user in response to automated textual and/or voice communications; etc.) ….”  “[0017] …. For example, the technology can improve tailoring of communication plans (e.g., live and automated communication with users) and associated promotion of therapeutic interventions through leveraging passively collected digital communication data (e.g., text messaging features, phone calling features, user-provider relationship features, etc.) and/or supplementary data (e.g., mobility behavior data extracted from GPS sensors of mobile devices) that would not exist but for advances in mobile devices (e.g., smartphones) and associated digital communication protocols (e.g., WiFi-based phone calling; video conferencing for digital telemedicine; etc.)….”] 
Rationale for combination as provided for Claim 1.

Haptic data is not taught by Moturu as input: “[0043] Relating to Block S150, automated communications preferably include one or more format components defining format-related aspects associated with presentation of the automated communication. The format components can include any one or more of: … touch parameters (e.g., braille parameters; haptic feedback parameters; etc.);….”  Note that the smartphone shown in Moturu would include a touchscreen but this is not express.
Breazeal teaches:
2. The method of claim 1, wherein the multimodal input data include at least audio data, visual data, text data, and haptic data. [Breazeal, Figure 5 showing Microphone array 506, Cameras 504, Touch Sensors 508, and  “[0314] … In such instances, PCD 100 may operate to acquire text based/GUI/speech entered information such as during a "getting acquainted" interaction….”  “[0369] … For example, a person may text a message to a PCD 100 associated with a user within which is embedded an emoticon representing an emotion or social action that the sender of the message wishes to convey via PCD 100….”  A keyboard appears as a part of touch sensors 508 but is not expressly mentioned.  ASR 206 of Figure 2 is repeated mentioned to generate text.  Claims is interpreted as including all of the modes recited. ]
Akolkar/Moturu and Breazeal pertain to natural language conversational dialog systems where a machine is trained to conduct a dialog with a user.  It would have been obvious to combine the system of combination that relies on multimodal communication but does not expressly indicate a haptic input with the system of Breazeal which includes a comprehensive multimodal input system expressly counting the various modes of input in order to be more comprehensive with respect to input and as combining prior art elements according to known methods to yield predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 3, Akolkar does not mention mood or sentiment.  Alkokar finds the movements of the user which teaches “act performed by the user” but not by analysis of visual data.
Moturu teaches:
3. The method of claim 2, wherein the step of analyzing the multimodal input data comprises at least one of: 
analyzing the audio data to recognize content of the communication from the user, characteristics of the communication indicative of an emotion conveyed in the communication, and acoustic sound in the dialogue scene; and [Moturu, “[0027] As such, Block S110 preferably enables collection of one or more of: … vocal and textual content (e.g., text and/or voice data that can be used to derive features indicative of negative or positive sentiments ….”  “[0030] In some variations, Block S120 can include collecting biometric data associated with user conditions, such as from electronic health records, sensors of mobile devices and/or supplemental medical devices, user inputs (e.g., entries by the user at the mobile device), and/or other suitable sources. Biometric data can include one or more of: electroencephalogram (EEG) data, electrooculogram (EOG) data, electromyogram (EMG) data, electrocardiogram (ECG) data, airflow data (e.g., nasal airflow, oral airflow, measured by pressure transducers, thermocouples, etc.), pulse oximetry data, sound probes, polysomnography data ….”  “[0030] In some variations, Block S120 can include collecting biometric data associated with user conditions, …. Biometric data can include one or more of: … sound probes, …”]
analyzing the visual data to recognize a facial expression of the user, an emotion associated with the facial expression, an act performed by the user, and one or more objects in the dialogue scene and the spatial relationships thereof. 
Rationale as provided for Claim 1.
Moturu teaches determining sentiment associated with communication ([0045]) it also teaches “automatically initiating a visual telemedicine communication”  (claim 17).  But not expressly determining facial expressions or sentiment from facial expressions based on image analysis.  Moturu also teaches collection of eye or leg movement which may not be by visual data: “[0030] In some variations, Block S120 can include collecting biometric data associated with user conditions, …. Biometric data can include one or more of: … sound probes, polysomnography data …” Polysomnography collects eye and leg movements during sleep.
Breazeal teaches:
3. The method of claim 2, wherein the step of analyzing the multimodal input data comprises at least one of: 
analyzing the audio data to recognize content of the communication from the user, characteristics of the communication indicative of an emotion conveyed in the communication, and acoustic sound in the dialogue scene; and [Breazeal, Figure 2, “ASR 206” leading to “Perceptual Cues /Belief states” which include “emotion” of the speaker.  Figure 9 starting with:  “Interpret User Body/Facial/Speech details to determine his emotional sate 902.”   “[0025] FIG. 9 illustrates a flowchart for a method to indicate and/or influence emotional state of a user by use of the PCD.”]
analyzing the visual data to recognize: a facial expression of the user, an emotion associated with the facial expression, an act performed by the user, and one or more objects in the dialogue scene and the spatial relationships thereof. [Breazeal, Figure 2, “Cameras 212, 214.” Figure 9 starting with:  “Interpret User Body/Facial/Speech details to determine his emotional sate 902.”  “[0327] In accordance with an exemplary and non-limiting embodiment, PCD 100 may modulate aspects of interaction with a user based, at least in part, upon various physiological and physical attributes and parameters of the user. In some embodiments, PCD 100 may employ gaze tracking to determine the direction of a user's gaze. Such information may be used, for example, to determine a user's interest or to gauge evasiveness. Likewise, a user's heart rate and breathing rate may be acquired. In yet other embodiment's a user's skin tone may be determined from visual sensor data and utilized to ascertain a physical or emotional state of the user. Other behavioral attributes of a user that may be ascertained via sensors 102, 104, 106, 108, 112 include, but are not limited to, vocal prosody and word choice. In other exemplary embodiments, PCD 100 may ascertain and interpret physical gestures of a user, such as waving or pointing, which may be subsequently utilized as triggers for interaction. Likewise, a user's posture may be assessed and analyzed by PCD 100 to determine if the user is standing, slouching, reclining and the like.”  “[0366] … For example a user might say "PCD, when my wife comes into the kitchen this morning, play her X song and tell her that I love her"….”  PCD must detect the wife in the kitchen.  “[0451] Consider a family with a home in which IOT lights and speakers have been installed in, say, the kitchen and adjacent family room. This family, being adopters of new technology, may purchase a personal PCD 100 that may be deployed in the kitchen. As part of its setup procedure, the social robot may discover the types and locations of the family's IOT devices and request permission to access and control them. If permission is granted, the PCD 100 may offer to perform a popular song…..”  PCD and the speakers (objects) and lights (objects) are all in the same place:  Kitchen.]
Rationale as provided for Claim 2.  Breazeal was added to teach the additional modalities of multimodal input and their functions would come from Breazeal.

Regarding Claim 4, Akolkar teaches:
4. The method of claim 3, wherein the current state of the dialogue is generated by: 
obtaining a language parsed graph (Lan-PG) of the dialogue based on the content of the communication from user based on and the dialogue tree; [Akolkar, Figure 4, “Conversation Parser.” Conversation parser determines the services that are being requested and leads to “Service Configurator 412” which can lead to the finite state diagram of Figure 8.]
obtaining a spatial-temporal-causal parsed graph (STC-PG) based on the act performed by the user and the dialogue tree; and [Akolkar, the FSM of Figure 8 includes a state diagram which shows what action causes what transition and outputs what result.  A conversation is temporal.  Thus, the FSM of Figure 8 teaches this limitation.]
generating a joint parsed graph (joint-PG) based on the Lan-PG, the STC-PG, and the information surrounding the dialogue scene. [Akolkar, Figure 8, FSM.  Time is a factor in a conversation and in a sequence of events and therefore the FSM of figure 8 includes time and causation.]
Akolkar does not include a factor of space.
Moturu teaches:
obtaining a spatial-temporal-causal parsed graph (STC-PG) based on the act performed by the user and the dialogue tree; and [Moturu.  Because mobility of the patient is an issue in Moturu and the location of the patient is considered, space/location is a factor:  “8. The method of claim 6, wherein extracting the set of mobility-communication features comprises generating a text messaging location feature based on associating a text messaging parameter from the log of use dataset with a location parameter from the mobility supplementary dataset, ….”  “[0050] … In another variation, applying a communication model can include applying a communication decision tree model, such as a decision tree model including internal nodes and branches selected based on correlations between automated communications and user outcomes (e.g., in relation to user conditions). In a specific example, a communication decision tree model can start with an initial automated communication (e.g., to be transmitted to the user), and subsequent automated communications can be selected and transmitted based on user responses (e.g., associated user meaning, user sentiment, etc.) to communications. However, applying communication decision tree models can be performed in any suitable manner.”]
Akolkar and Maturu pertain to conversational systems and it would have been obvious to modify Akolkar which uses a finite state machine graph to show the relationship between the state of the conversation and the next time step with Maturu that teaches the use of a decision tree for the same purpose and additionally includes the factor of location/space as another parameter to be taken into consideration for the decision as combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 5, Alkokar does not discuss machine learning.
Moturu teaches and suggests:
5. The method of claim 1, further comprising machine learning the utilities which comprises: [Moturu, “[0051] In another variation of Block S150, applying a communication model can include applying one or more machine learning communication models employing one or more machine learning approaches ….” ]
accessing the historic dialogue data related to past dialogues; [Moturu determines its “tailored communication plan” based on historic/past dialog data. “[0045] In a variation of Block S150, determining a communication plan (and/or associated components) can be based on historic communications (e.g., historic automated communications, user-provider communications, associated user responses, communications associated with other users such as users sharing a subgroup, etc.)….”]
obtaining, via machine learning, the utilities based on the historic dialogue data, wherein the utilities are formulated as the expected utilities with respect to actions specified by the dialogue tree given the current state of the dialogue; [Moturu.  “Utility” is a measure of effectiveness of a “tailored communication plan” which in Moturu is taught by maximizing the “reward” and Moturu performs machine learning to maximize “reward” / utility:  “[0051] … In a specific example, the method 100 can include: applying a machine learning communication model to tag a communication with a topic (e.g., where the communication model is trained on a training dataset including text messages and associated topic labels); mapping the topic to a subset of potential automated communications associated with the topic; and selecting an automated communication to transmit to the user from the subset of potential automated communications. In another specific example, Block S150 can include training a neural network model (e.g., a generative neural network model) with an input neural layer using features derived datasets described in Blocks S110-S145 to dynamically output content components for an automated communication, and/or any other suitable components of a communication plan. In specific examples, Block S150 can include training and applying a reinforcement learning model (e.g., deep reinforcement learning model), such as a reinforcement learning model for maximizing a reward (e.g., determining components of a communication plan for optimizing user outcomes, improving user conditions, user openness, and/or other suitable user parameters, etc.); ….”]
receiving, continuously, updated dialogue data of additional dialogues involving the user; and [Moturu, “[0020] … The technology can continuously collect and utilize specialized datasets unique to internet-enabled, non-generalized mobile devices in order to personalize and automate communications between a user and care provider for facilitating treatment….”  “[0026] Preferably, Block S110 is implemented using a module of a processing subsystem configured to interface with a native data collection application executing on a mobile device (e.g., smartphone, tablet, personal data assistant, personal music player, vehicle, head-mounted wearable computing device, wrist-mounted wearable computing device, etc.) of the user. As such, in one variation, a native data collection application can be installed on the mobile device of the user, can execute substantially continuously while the mobile device is in an active state (e.g., in use, in an on-state, in a sleep state, etc.), and can record communication parameters (e.g., communication times, durations, contact entities) of each inbound and/or outbound communication from the mobile device….”]
updating dynamically the utilities based on the updated dialogue data of the additional dialogues. [Moturu suggests this limitation because the act of adapting or learning can be a one time training on a pre-determined set of data but is more normally a continuous process as more data comes available.  Moturu teaches continuous data collection.  Moturu teaches machine learning on history.  These two teachings together suggest that the machine learning is continuous as new data is becoming available.]
Akolkar and Maturu pertain to conversational systems and it would have been obvious to modify Akolkar which uses previous stored (historic) user behavior with Maturu that teaches conducting machine learning based on historic dialog data as combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 6, Akolkar teaches:
6. The method of claim 5, wherein the step of determining the response communication comprises: 
identifying a plurality of actions associated with a node in the dialogue tree corresponding to the current state of the dialogue; [Akolkar, the “actions” are potentially appropriate responses in each state of the dialog which depend on which service is selected.  Akolkar calls them “candidates” which teach the “actions” of the Claim.  “[0084] FIGS. 5-7 demonstrate how a customer interacts with CSM via the Conversational Interface 416. In some embodiments, the UI is divided into two parts: (1) the conversation area 502, which includes the conversation display area and the text input area 514, and (2) the candidate service list area 504, which displays the qualified candidate services selected based on the conversation.… Accordingly, CSM retrieves all the relevant services (e.g., service 1 through n) and displays them on the right side at 504 under the "Matched Service List" heading. To further filter the matching services list, CSM provides the customer with the next level of details while simultaneously ruling out unqualified candidates and conducting service configuration, through a series of iterative question and answer procedures which guide the customer through the requirements.”]
determining a reward associated with each of the plurality of actions based on the learned utilities associated with the user; and [Akolkar, the “reward” is taught by the “number of candidates a sequence of questions Q can prune” which determines the  “Effectiveness”/ “utility” value (“eff(Q)”) of the sequence Q of questions based on the goal of the user.  “[0113] Service Filtering.”  “[0119] In one or more embodiments, to reduce the number of iterations, find a sequence of questions Q={Q.sub.1, Q.sub.2 . . . Q.sub.n} with the least length to rule out all unsatisfied candidate services via capability or configuration. Quantitatively, use eff(Q) to evaluate how effectively the sequence can filter the candidates. The effectiveness of a question sequence can be considered as the sum of the effectiveness of all its questions, i.e., eff(Q)=.SIGMA..sub.i eff(Q.sub.i). Concretely, the effectiveness of a question is qualified as the expected number of candidates it can prune, i.e. n.sub.prune, based on the customer's potential answer. There are three types of questions and their effectiveness is evaluated differently:”]
selecting an action as the response communication from the plurality of actions, [Akolkar, “[0084] … Accordingly, CSM retrieves all the relevant services (e.g., service 1 through n) and displays them on the right side at 504 under the "Matched Service List" heading. To further filter the matching services list, CSM provides the customer with the next level of details while simultaneously ruling out unqualified candidates and conducting service configuration, through a series of iterative question and answer procedures which guide the customer through the requirements.”]
that corresponds to a maximum utility represented as a function of the reward. [Akolkar, maximum utility is maximum effectiveness eff(Q) which is a function of the “number of questions it can prune” / “reward.” “[0125] From the perspective of the traditional optimization problem, the goal of picking the question sequence is to find the one that maximizes the effectiveness ….”  The “effectiveness” is the utility of the Claim.]
Akolkar teaches a finite state diagram with an ontology of services in Figure 8 each of each as shown in Figure 8 may include several functions/action.  Alkokar optimizes the effectiveness of a sequence of questions and the most effective sequence is the one selected.  Each question is an action.  See [0119] to [0125].   Alkokar uses a FSM and not a dialog tree which is taught by Moturu.  The dialog tree of Moturu can substitute the FSM of Akolkar as an equivalent/similar method in the context of conversation/dialog.  (In conversation: states and actions are the same: going from state to state, the action that occurs is the dialog portion that is uttered. Question or Response.  See Heegard in the Conclusion section below.)

Claim 8 is a computer program product system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale.
8. Machine readable and non-transitory medium having information recorded thereon for an automated dialogue companion, wherein the information, when read by the machine, causes the machine to perform: 
….
Claim 9 is a computer program product system claim with limitations corresponding to the limitations of method Claim 2 and is rejected under similar rationale.
Claim 10 is a computer program product system claim with limitations corresponding to the limitations of method Claim 3 and is rejected under similar rationale.
Claim 11 is a computer program product system claim with limitations corresponding to the limitations of method Claim 4 and is rejected under similar rationale.
Claim 12 is a computer program product system claim with limitations corresponding to the limitations of method Claim 5 and is rejected under similar rationale.
Claim 13 is a computer program product system claim with limitations corresponding to the limitations of method Claim 6 and is rejected under similar rationale.
Claim 15 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.
15. A system for an automated dialogue companion, comprising: 
a device configured for receiving multimodal input data associated with a user engaged in a dialogue of a certain topic in a dialogue scene, wherein the multimodal input data capture a communication from the user and information surrounding the dialogue scene; 
a user interaction engine configured for 
…. and 
a dialogue manager configured for …. 
Claim 16 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.
Claim 18 is a system claim with limitations corresponding to the limitations of Claim 4 and is rejected under similar rationale.
Claim 19 is a system claim with limitations corresponding to the limitations of Claim 5 and is rejected under similar rationale.
Claim 20 is a system claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499.  The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659