Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-9, 11-15, and 17-20 are pending. Claims 1, 11, and 20 are independent and are amended.  Claims 10 and 16 are canceled.
This Application was published as U.S. 2021/0304787.
Apparent priority: March 2020.
Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 8/2/2022 has been entered.
Response to Amendments and Arguments
Claim 1 as amended provides:
1. A computer-implemented method for causing a virtual personal assistant to interact with a user while assisting the user, the method comprising: 
capturing, by an input device of the virtual personal assistant, a first input that indicates one or more behaviors associated with the user; 
determining, by a processor of the virtual personal assistant, a first emotional state of the user based on the first input; 
selecting, by the processor from a response mapping, a first emotional component for changing the first emotional state of the user to a target emotional state of the user, 
wherein the response mapping includes a machine learning model stored in a memory of the virtual personal assistant, the machine learning model including a plurality of mappings, each of the plurality of mappings including a respective emotional component for changing a respective emotional state of the user to a respective target emotional state for the user;
generating, by the processor, a first vocalization that incorporates the first emotional component,
wherein the first vocalization relates to a first operation that is being performed to assist the user; and 
outputting, by an audio output device of the virtual personal assistant, the first vocalization to the user. 

Support in the Specification:
[0036] In one embodiment, response generator 230 may implement a response mapping 238 that maps input transcription 212 and/or emotional state 222 to one or more operations 232, one or more semantic components 234, and/or one or more emotional components 236. Response mapping 238 may be any technically feasible data structure based on which one or more inputs can be processed to generate one or more outputs. For example, and without limitation, response mapping 238 could include an artificial neural network, a machine learning model, a set of heuristics, a set of conditional statements, and/or one or more look-up tables, among others. In various embodiments, response mapping 238 may be obtained from a cloud-based repository of response mappings that are generated for different users by different instances of system 100. Further, response mapping 238 may be modified using techniques described in greater detail below and then uploaded to the cloud-based repository for use in other instances of system 100.
…
[0039] As a general matter, a given objective function 252 can represent a target behavior for user 140, a target emotional state 222 for user 140, a target state of being of user 140, a target level of engagement with VPA 118, or any other technically feasible objective that can be evaluated based on feedback 242. In embodiments where response generator 230 includes response mapping 238, mapping modifier 250 may update response mapping 238 in order to improve subsequent outputs 132. In the manner described, VPA 118 can adapt to the specific personalities and idiosyncrasies of different users and therefore improve over time at interpreting and engaging with user 140.

Reply to Arguments regarding the 101 Rejection
Regarding the 35 U.S.C. 101 rejection, the amendments are not sufficient.
The Claim is directed to the abstract idea of two people engaged in a conversation and one side listening to the other side and determining that the other side is in a bad mood of some sort (sad, angry, depressed) and adjusting what he says in response (voice and/or content) to soothe the other side.  

Step 2A, Prong Two: Whether Additional Elements exist   that Integrate the Judicial Exception into a Practical Application? And 
Step 2B: Search for Inventive Concept: Whether Additional Elements amount to Significantly More?

Additional elements exist and include: the virtual personal assistant, an input device, a processor, a machine learning model, and an output device.  
The input device and output device behave in their well-understood, routine, and conventional manner for data gathering and data output, respectively and would not cause the Claim as a whole to amount to significantly more than the underlying abstract idea.  The reference to the processor is an attempt to automate an otherwise mental process using a generic computer component.  
The closest additional element to directing the claim to some improvement would be the machine learning model.  However, the “learning” does not occur in the Claim and the machine learning model represents a simple mapping of vocalizations to moods which can be relied upon in an otherwise mental process.  The machine learning model in the Claim is also not directed towards any specific or new type of machine learning model (i.e., a new type of neural network structure or learning technique).  Instead, the machine learning model is generic and is used in a well-understood, routine, and conventional manner to automate data processing which can otherwise be practically performed as a mental process.   Note [0036] of the Specification provided above.
The Claim amounts to telling a VPA to perform an operation that is normally performed by people without providing sufficient specifics.
Please refer to the response to Arguments in the previous Office action.  Some of the mentioned details and particularities have been added to the Claim but even the additions don’t say much.  
For example, we still don’t know if the input is voice or text or something else because the Claim merely refers to an “input device.”
As another example, the preamble includes a VPA and the following limitation could have included: “wherein the first vocalization relates to a first operation that is being performed by the virtual personal assistant to assist the user,” in order to weave back in the VPA into the body of the Claim.

Reply to Arguments regarding the 103 Rejection
Regarding the 35 U.S.C. 103 rejection of Claims, Applicant’s arguments are not persuasive and are moot in view of the modified grounds of rejection.
	Applicant has argued with respect to McDuff, Kim, and Manfredi that they fail to teach that the response changes the user emotion to a target emotional state. 
	IN REPLY: all of the above references intend to make the speaker/user happy and content.  Therefore, there is always a “target emotional state” and it is generally the same in all of them.  
	McDuff that merely mimics the emotions of the user, does so with intent to make him happy.  Target emotion = user content.
Manfredi, for example, makes an offer of a discount to an angry customer to make the customer happy.  See [0010].  Target emotion = user content.
Kim is expressly directed to determining user discontent and responding to it.
Young wants the user to be receptive to the response.  Target emotion = receptivity.

Mapping of user emotion to an emotive/emotional response of the VA using a machine-learned model is taught by Manfredi and also Young.
If the type of Mapping used in the instant Application is special; for example, if the target emotions are varied and the VA actually wants to anger the user in certain circumstances; if there is a table that says if user emotion =Happy then Target emotion= Mad, then such detail needs to be in the Claim.
Otherwise, the numerous references that gage the emotion of a user of a Virtual Assistant and respond accordingly, don’t do it aimlessly; rather they tailor the response with a Target emotion in mind.

Applicant is possibly referring to the mapping in Figures 3A and 3B of the instant Application.
When a feature is considered key in the Claim, include the points of novelty and non-obviousness with further particularity in the Claim.


    PNG
    media_image1.png
    467
    796
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    463
    785
    media_image2.png
    Greyscale

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-9, 11-15, and 17-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Under step 2A, prong 1, the claims fall under mental processes thus falling within a judicial exception (one person talking to another and helping him out with turning the radio on in a car).  Under step 2A, prong 2, the judicial exception needs to be integrated into a practical application. The additional limitation here is simply a computer as noted in preamble. This is a mere attempt to “apply” the steps to a computer. Therefore, the claims are directed to an abstract idea. Under step 2B, the additional limitations as noted earlier with prong 2, include a mere attempt to apply the exception using a generic computing component which does not result in and inventive step.
Claim 1 is a generic automation of a mental process since a human passenger can sense the emotional state of another passenger nd adjust his or her behavior accordingly. We have a new question (prong 2 of step 2A) in the 101 analysis that asks whether the abstract idea is integrated with a practical application. The answer is no in this instance because there is no technological solution in the Claim that “integrates” the abstract idea. The Claim only suggests that the abstract idea be applied. It does not describe an application.   Tie the steps to machine components (microphone; NLP, speaker, spectral analyzer; etc.) and tie them tightly.

Step 1: The independent Claims are directed to statutory categories: 
Claim 1 is a method claim and directed to the process category of patentable subject matter.
Claim 11 is a computer-readable-storage device claim and is directed to the machine or manufacture category of patentable subject matter.
Claim 20 is a system claim and directed to the machine or manufacture category of patentable subject matter.

Step 2A, Prong One: Does the Claim recite a Judicially Recognized Exception? Abstract Idea? Are these Claims nevertheless are considered Abstract as a Mathematical Concept (mathematical relationships, mathematical formulas or equations, mathematical calculations), Mental Process (concepts performed in the human mind (including an observation, evaluation, judgment, opinion), or Certain Methods of Organizing Human Activity (1-fundamental economic principles or practices (including hedging, insurance, mitigating risk), 2-commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations), 3- managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions) and fall under the judicial exception to patentable subject matter?)
The rejected Claims are directed to Mental Processes or Methods of Organizing Human Activity.
Step 2A, Prong Two: Additional Elements that Integrate the Judicial Exception into a Practical Application? Identifying whether there are any additional elements recited in the claim beyond the judicial exception(s), and evaluating those additional elements to determine whether they integrate the exception into a practical application of the exception. “Integration into a practical application” requires an additional element(s) or a combination of additional elements in the claim to apply, rely on, or use the judicial exception in a manner that imposes a meaningful limit on the judicial exception, such that the claim is more than a drafting effort designed to monopolize the exception. Uses the considerations laid out by the Supreme Court and the Federal Circuit to evaluate whether the judicial exception is integrated into a practical application.
The rejected Claims do not include additional limitations that point to integration of the abstract idea into a practical application.
Claim 1 is a generic automation of a mental process since a human agent can sense the emotional state of a customer and adjust his or her behavior accordingly. We have a new question (prong 2 of step 2A) in the 101 analysis that asks whether the abstract idea is integrated with a practical application. The answer is no in this instance because there is no technological solution in the Claim that “integrates” the abstract idea. The Claim only suggests that the abstract idea be applied. It does not describe an application. 

1. A computer-implemented method for causing a virtual personal assistant to interact with a user while assisting the user, the method comprising: 
capturing, by an input device of the virtual personal assistant, a first input that indicates one or more behaviors associated with the user;  [Listener is listening to Speaker/User and captures the User’s behavior based on what the User says.]
determining, by a processor of the virtual personal assistant, a first emotional state of the user based on the first input; [Listener can tell that the Speaker is angry or sad based on what the Speaker/User says or based on his tone of voice.]
selecting, by the processor from a response mapping, a first emotional component for changing the first emotional state of the user to a target emotional state of the user, [Listener contemplates his response options based on his past experience (learned) with the Speaker/User and knowing what may calm down the Speaker/User and what may further aggravate her.]
wherein the response mapping includes a machine learning model stored in a memory of the virtual personal assistant, the machine learning model including a plurality of mappings, each of the plurality of mappings including a respective emotional component for changing a respective emotional state of the user to a respective target emotional state for the user; [Listener knows the Speaker/User from past dealings and has “learned” and stored in his “memory” what type of response can cheer up the Speaker/User.  There is no actual “learning” occurring in this step.  Rather, a previously “learned” set of information is stored.]
generating, by the processor, a first vocalization that incorporates the first emotional component, [Listener decides to console the Speaker or calm him down or do a little of each and therefore “incorporates” /takes into account the mood of the Speaker/User.]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [Listener is trying to help the Speaker/User to do something /perform a first operation and what the Listener says pertain to that operation.]
outputting, by an audio output device of the virtual personal assistant, the first vocalization to the user.  [Listener speaks out.]

Step 2B: Search for Inventive Concept: Additional Element Do not amount to Significantly More: Claim 1 has no extra limitation and he limitations of "memory" and “processor,” in the system Claim 20, or the limitation of “computer readable medium” in Claim 11 are well-understood, routine, and conventional machine components that and are being used for their conventional and rather generic functions. Additionally, these limitations are expressed parenthetically and lack nexus to the Claim language and as such are a separable and divisible mention to a machine. Accordingly, they are not sufficient to cause the Claim to amount to significantly more than the underlying abstract idea. 
The technological aspect has to be claimed with more particularity and woven into the fabric of the Claim. The Claim has to integrate the abstract idea into a technological application. Not: here is the idea of listening to someone and consoling them and here is “computer-implemented” stated broadly, and we put them together. The “How” of each step and the definition of the terms in the Claim if added contribute to possibly integrating the abstract idea into a technological application.
The Dependent Claims do not add limitations that could help the Claim as a whole to amount to significantly more than the Abstract idea identified for the Independent Claim.  For example:
In Claim 2, the User may be trying to do something in his car and the Passenger can talk to him and help him.  
Set the scene from the beginning:  the method is to provide spoken natural language assistance to a driver or passenger of a vehicle who is engaged in using a feature of the vehicle and expresses his frustration or his question by speaking to the car and whose voice is captured via a microphone and analyzed by a natural language processing software or his captured voice is converted to the spectral domain and the spectral content is analyzed and the response is output by the car through a speech synthesizer.  Say something about the machine components that are involved; their respective roles; their interface with the physical environment of the car.  This application is about an automobile and the term “vehicle” is introduced only in Claim 2 and the other claims are not even in the chain or dependency. 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 4-9, 11, 14-15, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over McDuff (U.S. 20200279553) in view of Manfredi (U.S. 20080096533).


Regarding Claim 1, McDuff teaches:
1. A computer-implemented method for causing a virtual personal assistant to interact with a user while assisting the user, [McDuff, Figure 1, “Local Computing Device 110” is described as a “virtual agent.”  “[0021] … The local computing device 106 may be any type of computing device such as a smartphone, a smartwatch, a tablet computer, a laptop computer, a desktop computer, a smart TV, a set-top box, a gaming console, a personal digital assistant, a vehicle computing system, a navigation system, or the like….”]
the method comprising: 
capturing, by an input device of the virtual personal assistant, a first input that indicates one or more behaviors associated with the user; [McDuff, Figure 1, “Communication Interface 116” and “Microphone 110.”  “User 102” providing “Speech 104.”  Figure 4, “speech recognizer 206” feeding the “text sentiment recognizer 404,” and Figure 5, “Determine Linguistic Style 508” and “Identify Sentiment of User’s Input 512” both teach “behaviors associated with the user” of the Claim.] [(Note Specification defines “behavior”: “[0021] Input devices 120 are configured to capture input 122 that reflects one or more behaviors associated with a user 140. As referred to herein, a "behavior" includes any voluntary and/or involuntary actions performed by the user. For example, and without limitation, a "behavior" could include explicit commands issued by the user, facial expressions enacted by the user, changes in affect presented consciously or unconsciously by the user, as well as changes in user posture, heart rate, skin conductivity, pupil dilation, and so forth. Input devices 120 may include a wide variety of different types of sensors that are configured to capture different types of data that reflect behaviors associated with the user. ….”)]
determining, by a processor of the virtual personal assistant, a first emotional state of the user based on the first input; [McDuff, Figure 5, “Identify Sentiment of User’s Input 512” and Figure 7, “Sentiment Analysis Module 716.”  Figure 4, “Text Sentiment Recognizer 404.”  [0065].  “Facial Expression Recognizer 416.”  [0075].  Figure 1, “processors 112.”]
selecting, by the processor from a response mapping, a first emotional component for changing the first emotional state of the user to a target emotional state of the user, [McDuff teaches in Figure 5, “512: identify sentiment of the user’s input” and “generate a response dialogue 514,” that the response by the machine is generated in response to the sensed emotion of the user.  The “linguistic style” factors that are applied to the synthesized speech are the same as the “emotional component” types identified by the Specification of the instant Application.  “[0093] … The prosodic qualities of the response dialogue may also be modified based on a facial expression of the user 102 if that data is available. For example, if the user 102 is making a sad face, the tone of the response dialogue may be lowered to make the conversational agent also sound sad… The prosodic qualities of the response dialogue may be selected to mimic the prosodic qualities of the user's 102 linguistic style identified at 508. Alternatively, the prosodic qualities of the response dialogue may be modified (i.e., altered to be more similar to the linguistic style of the user 102) based on linguistic style identified a 508 without mimicking or being the same as the prosodic qualities of the user's 102 speech 104.”  This teaches adjusting the machine response to fit a target emotional state as claimed; it may be that when the user is sad, the machine ought not be too jittery and impertinent to further aggravate a sad user.  See also “[0089] … Acoustic variables considered to identify a linguistic style include, but are not limited to, speech rate, pitch, and loudness. Acoustic variables may be referred to as prosodic qualities.”] ([Note the Specification defines “emotional component” as “[0033] … For example, and without limitation, a given emotional component 236 could include a specific pitch, tone, timbre, volume, diction speed, and/or annunciation level with which the vocalization should be synthesized to reflect particular emotional qualities and/or attributes.”)]
wherein the response mapping includes a machine learning model stored in a memory of the virtual personal assistant, [McDuff, “… Utterances by the virtual agent may be based on a combination of predetermined scripted responses and open-ended responses generated by machine learning techniques….”  Abstract.  See also [0035]-[0036] and [0128]-[0129].  The “machine-learning” model used by McDuff determines intent and generates a proper response.  McDuff does not teach that this machine-learning is to determine the prosody/ “emotional component” of the response.]
the machine learning model including a plurality of mappings, each of the plurality of mappings including a respective emotional component for changing a respective emotional state of the user to a respective target emotional state for the user;
generating, by the processor, a first vocalization that incorporates the first emotional component, [McDuff, Figure 4, “Synthesized Output 422” which is fed by the “Emotion and Head Pose Synthesizer 420” which takes into account the sentiment of the User 102 as provided by the “text sentiment recognizer 404” and the “facial expression recognizer 416.”  [0079]-[0080].  Figure 1, “Speaker 108.”]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [McDuff, the Local Computing Device 106” is a “Virtual Personal Assistant” and therefore it “performs” some “operation” that assists the “User 102”:  “[0021] FIG. 1 shows a conversational agent system 100 in which a user 102 uses speech 104 to interact with a local computing device 106 such as a smart speaker (e.g., a FUGOO Style-S Portable Bluetooth Speaker). The local computing device 106 may be any type of computing device such as a smartphone, a smartwatch, a tablet computer, a laptop computer, a desktop computer, a smart TV, a set-top box, a gaming console, a personal digital assistant, a vehicle computing system, a navigation system, or the like. In order to participate in audio-based interactions with the user 102, the local computing device 106 includes or is connected to a speaker 108 and a microphone 110. The speaker 108 generates audio output which may be music, a synthesized voice, or other type of output.”  Each of the examples indicate an “operation” such as making a phone call, telling the time, providing navigation assistance, etc.]
outputting, by an audio output device of the virtual personal assistant, the first vocalization to the user. [McDuff, Figure 3, “Speakers 312” and “Display 314” provide the output including the synthesized speech from “Speech Synthesizer 220” of Figure 4.  Figure 7, “Input/Output Devices 708.”]
            
           
            McDuff does not expressly teach that the response by the machine is calculated to put the user in a good mood.
McDuff teaches a “machine-learning” model used to determine intent and generate a proper response.  McDuff does not teach that this machine-learning is to determine the prosody/ “emotional component” of the response.

Regarding Claim 1, Manfredi teaches:
1. A computer-implemented method for causing a virtual personal assistant to interact with a user while assisting the user, [Manfredi, Title: “Virtual assistant with Real-Time Emotions.”  Figure 2, “Virtual Assistant Expert System.”  “A modular digital assistant that detects user emotion and modifies its behavior accordingly…..”  Abstract.]
the method comprising: 
capturing, by an input device of the virtual personal assistant, a first input that indicates one or more behaviors associated with the user; [Manfredi, Figure 4, “Input Collection 2.”]  [Manfredi, Figure 2, “[0035] … Examples of 3 input devices are shown, a mobile phone 80, a personal computer 82 and a kiosk 84…..”  At least, the “mobile phone 80” would inherently include a microphone.” [0156] The system of the invention relies on a voice recognition system (ASR) which interprets the spoken words and generates a written transcription of the speaking….”   “[0019] The virtual agent is able to dynamically construct in real-time a dialogue and related emotional manifestations supported by both precise inputs and a tight objective relevance, including the context of those inputs….”  Either the “inputs” or the “context of inputs” teaches “behaviors associated with the user.”  (See the definition of “behavior” in the instant Application.)]
determining, by a processor of the virtual personal assistant, a first emotional state of the user based on the first input; [Manfredi, Figure 4, “Input Contextualization 4.”  Figure 5, input of “Thanks” with a happy tone and [0142].  “[0147] There are well-established techniques to obtain a user's emotive state from the vocal spectrum. ….”  “User Emotion Calculation” [0202] et. seq.] [Manfredi, Figure 1, “Right Brain Neural Net 52” and “[0086] For each typology of emotional analysis methodology, Table 1 below indicates the main elements to be monitored and their relative value as indicators of the user's emotion…..”  The inputs upon which the “first emotional state of the user” is determined include his face and Voice.  Table 1 lists a large number of factors that are used to determine the “first emotional state of the user.” See Figure 5 for components of emotion and [0091]-[0092] for how a combination of these components determines an emotional state of the speaker: “[0092] FIG. 3 …. For each basic emotion a percentage (e.g., 37.9% fear, 8.2% disgust, etc.) is provided to the other modules. In this case, the values represent a situation of surprise and fear. For example, as if the virtual assistant is facing a sudden and somehow frightful event. By receiving these data, Janus is able to indicate to different modules how to behave, so that spoken is pronounced consistently to emotion and a similar command is transmitted to the 3D model.”]
selecting, by the processor from a response mapping, a first emotional component for changing the first emotional state of the user to a target emotional state of the user, [Manfredi, Figure 4, “Emotional Status Definition 6.”] [Manfredi adjusts the output of the Virtual Assistant to the input of the user, in content and tone.  Thus, the VA “selects” a “first emotional component,” such as Joy or Surprise in order not to aggravate the user / “change the first emotional state to a target emotional state.”    The “target emotional state” is determined according to the “training set” provided to train the neural network model that determines the emotive response of the VA.  See [0087] for the “training set.”  See also: “[0090] … The virtual assistant will determine that the user feels like he is on familiar terms with the virtual assistant, and can therefore genuinely allow himself a joking approach. In a similar situation, the virtual assistant will choose how to behave on the basis of the training provided to the Right Brain engine. For example, the virtual assistant could laugh, communicating happiness, or, if it's the very first time that the user behaves like this, could alternatively display surprise and a small (but effective) percentage of happiness.”]  [(Note the Specification defines “emotional component” as “[0033] … For example, and without limitation, a given emotional component 236 could include a specific pitch, tone, timbre, volume, diction speed, and/or annunciation level with which the vocalization should be synthesized to reflect particular emotional qualities and/or attributes.”)]
wherein the response mapping includes a machine learning model stored in a memory of the virtual personal assistant, the machine learning model including a plurality of mappings, each of the plurality of mappings including a respective emotional component for changing a respective emotional state of the user to a respective target emotional state for the user; [Manfredi, uses a trained neural network to determine the emotive response of the VA according to the emotive state of the user:  “[0122] The Virtual Assistant makes use of an additional model of artificial intelligence representing an emotive map and thus dedicated to identify the emotional status suitable for that situation.”  “Virtual Assistant's Emotion Calculation [0230] An AI engine (based on neural networks) is to compute VA's emotion (selected among catalogued emotions, see .sctn. "User's emotion calculation") with regard to: [0231] User's emotional state (dynamically calculated)  …  [0237] VA's emotional model (into A.I. engine) [0238] The outcome is an expressive and emotional dynamic nature of the VA which, based on some consolidated elements (emotive valence of discussed subject, answer to be provided and VA's emotional model) may dynamically vary, real time, with regard to interaction with the interface and the context.”  “[0083] By means of a suitable selection of network training examples, we are able to coach it to answer in a way that corresponds to a desired "emotive profile."….”  “[0087] … Training a neural network requires a training set, or a set (range) of input values together with their correct related output values, to be submitted to network so that it is autonomously enabled to learn how to behave as per the training examples….”  A trained neural network is a machine-learned model.]
generating, by the processor, a first vocalization that incorporates the first emotional component, [Manfredi, see [0007]-[0008].  “[0007] The present invention provides a digital assistant that detects user emotion and modifies its behavior accordingly. … For example, a happy emotion may be translated to … a cheerful tone of voice for a voice response unit over the telephone….”  See also [0089]-[0090].  “[0008] … there can be percentage variation in the degree of the emotion …. The percentage can be determined to match the detected percentage of the user's emotion…..”]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [Manfredi is directed to Virtual Assistants that execute commands or respond to questions.  “[0178] …As an example it is possible to set an arbitrary language with a gestures or words sequence, that send a not explicit command to the VA….”  This teaches an implicit command and implies an explicit command.]
outputting, by an audio output device of the virtual personal assistant, the first vocalization to the user. [Manfredi, Figure 4, “output presentation 8.”]

McDuff and Manfredi pertain to detection of emotion from a user at a VA and adjusting the response accordingly and it would have been obvious to combine the feature of Manfredi which adjusts the response of the machine to the user to please the user (i.e. a customer) with the system of McDuff which teaches adjusting the response to sound sad in order to possibly sympathize with the speaker/user which is an indirect way of putting the user in a better mood and it would have also been obvious to combine the trained neural network system of Manfredi which is a type machine-learned model for performing such a task.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 4, McDuff teaches:
4. The computer-implemented method of claim 1, wherein the steps of determining the first emotional state of the user comprises the steps of: 
determining a first feature of the first input; and [McDuff, Figure 4.  Two features of the image input are determined:  “Facial Expression Recognizer 416” and “Head Pose Estimator 418” are both extracted from the Video input / First input.]
determining a first type of emotion corresponding to the first feature. [McDuff, Figure 4. “Emotion and Head Pose Synthesizer 420” takes into account both 416 and 418.]  

Regarding Claim 5, McDuff teaches:
5. The computer-implemented method of claim 4, 
wherein the first input comprises an audio input, and [McDuff, Figure 4, “speech recognizer 206” feeding the “text sentiment recognizer 404.”  McDuff teaches determining sentiment from speech of the user:  “[0091] At 512, a sentiment of the user's 102 (i.e. speech 104 or text) may be identified….” ]
wherein the first feature comprises a tone of voice associated with the user. 
McDuff, Figures 2 and 4, teaches the input of audio and extracting prosody/tone of the speech by “prosody style extractor 218.”   It also teaches that “features of speech such as emphasis and intonation.”  See [0038] and [0131].  But it does not elaborate on determining emotion from the tone/prosody of the voice. 
Manfredi teaches:
wherein the first input comprises an audio input, and [Manfredi, Figure 5, input of “Voice” to “Phone 102.”]
wherein the first feature comprises a tone of voice associated with the user. [Manfredi, Table 1, [0086] the emotional analysis of Voice includes as its “main Factors,” includes “a) Alteration of voice tone from initial value or from reference one.”]
McDuff and Manfredi pertain to evaluation of emotion of a user while interacting with a machine and responding to the user/speaker based on the parameters corresponding to the user’s conversational style and emotion and it would have been obvious to modify the system of McDuff which receives speech as input and evaluates speech but does not elaborate on the evaluation of prosodical features of speech (such as tone of voice) with the system of Manfredi which is more elaborate with respect to the evaluation of prosodical aspects of speech, and specifically mentions “tone,” for completeness.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 6, McDuff teaches:
6. The computer-implemented method of claim 4, 
wherein the first input comprises a video input, and [McDuff, “Video Input 410.”]
wherein the first feature comprises a facial expression made by the user. [McDuff, “Face Detector 412” to “Facial Expression Recognizer 416” which feeds the “Emotion and Head Pose Synthesizer 420.”]

Regarding Claim 7, McDuff teaches a “Prosody Recognizer 208” and teaches the elements of “prosody” as: “[0042] The prosody style extractor 218 uses the acoustic variables identified from the speech 104 of the user 102 to modify the utterance of the conversational agent. The prosody style extractor 218 may modify that SSML file to adjust the pitch, loudness, and speech rate of the conversational agent's utterances. For example, the representation of the utterance may include five different levels for both pitch and loudness (or a greater or lesser number of variations). …”   [See also [0006] and [0029].  McDuff uses prosody to match the prosody style of the synthesized output to that of the user and does not discuss determination of emotion/sentiment from prosody.  
Manfredi more expressly teaches the “spectrum of emotion types”:
7. The computer-implemented method of claim 1, wherein the step of determining the first emotional state of the user comprises the steps of: 
determining a first valence value based on the first input that indicates a location within a spectrum of emotion types; and [Manfredi is directed to: “[0007] … a digital assistant that detects user emotion and modifies its behavior accordingly….”  Manfredi provides a “spectrum” of emotions and calculates the particular emotion of the speaker as a combination of different emotions.  “[0011] In one embodiment, various primary emotional input indicators are combined to determine a more complex emotion or secondary emotional state. For example, primary emotions may include fear, disgust, anger, joy, etc. Secondary emotions may include outrage, cruelty, betrayal, disappointment, etc. If there is ambiguity because of different emotional inputs, additional prompting, as described above, can be used to resolve the ambiguity.”  “Janus” is the name of the Virtual Assistant.]
determining a first intensity value based on the first input that indicates a location within a range of intensities corresponding to the location within the spectrum of emotion types. [Manfredi, See Figure 3 teaching that the percentage/ “first intensity value” of each emotion, within a range of 100%, in the speech is determined:  “[0092] FIG. 3 is a diagram of an embodiment of an array which is passed to Janus as a result of neural network computation. Each position of the array represents a basic emotion. For each basic emotion a percentage (e.g., 37.9% fear, 8.2% disgust, etc.) is provided to the other modules. In this case, the values represent a situation of surprise and fear…..”]
McDuff and Manfredi pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and it would have been obvious to modify the system of McDuff which receives speech as input and evaluates sentiment with the system of Manfredi which expressly shows the types of emotions and their combinations to form secondary emotions for more granularity and particularity  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 8, McDuff teaches that the output voice is modified to match the linguistic style and speech of the user and his emotion.  McDuff does not elaborate on the emotion being derived from voice.
Manfredi teaches:
8. The computer-implemented method of claim 7, wherein the first emotional component corresponds to the first valence value and the first intensity value. [Manfredi teaches that the voice of the Virtual Assistant (Janus) including the “first emotional component” is adjusted according to the emotion (and its intensity) obtained from the input voice of the user.  “[0147] There are well-established techniques to obtain a user's emotive state from the vocal spectrum. ….”  “7. A virtual assistant comprising: a user input device for providing input information from a user; an emotion detection module configured to detect a user's emotion from said input information; a core module for producing a virtual assistant emotion for the virtual assistant based on said user's emotion.”  See [0202] to [0225] regarding the type of primary and secondary emotions that are detected and calculated.  Then:  “[0230] An AI engine (based on neural networks) is to compute VA's emotion (selected among catalogued emotions, see .sctn. "User's emotion calculation") with regard to …”  “Emotional Status Definition” is done by:  “[0119] This is performed in two ways: [0120] by extracting emotional valence by proposed stimulus (valence is a static value previously allocated to stimulus) [0121] by dynamically deducing a emotional status by dialogue flow status and by context”  “Virtual Asssitant’s Emotion Calculation” is based on:  “[0235] Emotive valence of discussed subject (taken from knowledge base)” “[0236] Emotive valence of answer to be provided (taken from knowledge base)”.]
Rationale for combination as provided for Claim 7.

Regarding Claim 9, McDuff suggests this feature but Manfredi was cited and Manfredi teaches: 
9. The computer-implemented method of claim 7, wherein the first emotional component corresponds to at least one of a second valence value or a second intensity value. [Manfredi, Figure 3, the percentages of various primary emotions are calculated and the secondary emotions are calculated based on the primary emotions.  The percentage pertaining to a second type of primary emotion can be mapped to “a second intensity value.”  The “first emotional component” the emotion reflected in the voice of the Virtual Assistant and it is based on all the first, second, etc. intensity values of different types of primary emotions.  The secondary emotions that are calculated as a combination of primary emotions, also have a % associated with them, and can also teach the “second intensity value” of the Claim:  “[0027] … So, if the system has computed a state of disappointment at 57% (combination of basic emotions of Sadness and Surprise) than VA could directly ask: "Are you disappointed by my answer?".”]
Rationale for combination as provided for Claim 7.

Claim 11 is a computer program product system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale.  Additionally:
11. A non-transitory computer-readable medium storing program instructions that, when executed by a processor, cause the processor to interact with a user while assisting the user by performing the steps of: [McDuff, Figure 1, “Local Computing Device 110.”  “Processor 112” and “Memory 114.” Figure 7, “Processor 702” and “Memory 704.”  “[0118] Computer-readable media can also store instructions executable by external processing units such as by an external CPU ….” “[0156] Clause 8. A computer-readable storage medium having computer-executable instructions stored thereupon, when executed by one or more processors of a computing system, cause the computing system to perform the method of any of clauses 1-6.”]
…

Claim 14 is a computer program product system claim with limitations corresponding to the limitations of method Claims 4 and 5 OR 6 and is rejected under similar rationale.
14. The non-transitory computer-readable medium of claim 11, wherein the step of determining the first emotional state of the user comprises the steps of:
 determining a first feature of the first input; and [Claim 4]
determining a first type of emotion corresponding to the first feature, [Claim 4]
wherein the first feature comprises a tone of voice associated with the user or a facial expression made by the user. [Claim 5 Or Claim 6]
This Claim has the scope of Claim 6 and is rejected under similar mapping.

Claim 15 is a computer program product system claim with limitations corresponding to the limitations of method Claim 7 and is rejected under similar rationale.

Regarding Claim 18, McDuff teaches:
18. The non-transitory computer-readable medium of claim 11, 
wherein the step of generating the first vocalization comprises the steps of combining the first emotional component with a first semantic component. [McDuff, Figure 4, the “synthesized output 242” uses the “semantics” / meaning of the input speech, obtained at “Text Sentiment Recognizer 404” combined with the emotion obtained from the “Facial Expression Recognizer 416” to generate the output.]

Regarding Claim 19, McDuff teaches:
19. The non-transitory computer-readable medium of claim 18, further comprising: 
generating a transcription of the first input that indicates one or more semantic components included in the first input; and  [McDuff, Figure 4, “Text Sentiment Analyzer 404” determines sentiment from the meaning of the text of the speech.  “[0065] The text sentiment recognizer 404 recognizes sentiments in the content of an input by the user 102. The sentiment as identified by the text sentiment recognizer 404 may be a part of the conversational context. The input is not limited to the user's 102 speech 104 but may include of the forms of input such as text (e.g., typed on the keyboard 310 or entered using any other type of input device). Text output by the speech recognizer 206 or text entered as text is processed by the text sentiment recognizer 404 according to any suitable sentiment analysis technique. Sentiment analysis makes use of natural language processing, text analysis, and computational linguistics, to systematically identify, extract, and quantify affective states and subjective information. The sentiment of the text may be identified using a classifier model trained on a large number of labeled utterances. The sentiment may be mapped to categories such as positive, neutral, and negative. Alternatively, the model used for sentiment analysis may include a greater number of classifications such as specific emotions like anger, disgust, fear, joy, sadness, surprise, and neutral. The text sentiment recognizer 404 is a point of crossover from the audio pipeline to the visual pipeline and is discussed more below.”]
generating the first semantic component based on the one or more semantic components. [McDuff, Figure 5, the “Identify A Sentiment of the User’s Input 512” feeds the “Generate a Synthetic Facial Expression 616” in Figure 6 in addition to “Generate a Response Dialogue 514” in Figure 5.]

Claim 20 is a system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale.  Additionally:
20. A system, comprising: 
a memory storing a software application; and [McDuff, Figure 1, “Local Computing Device 110.”  “Memory 114.” Figure 7, “Memory 704.”  “[0118] Computer-readable media can also store instructions executable by external processing units such as by an external CPU ….”   “[0155] Clause 7. A system comprising one or more processors and memory storing instructions that, when executed by the one or more processors, cause the one or more processors perform the method of any of clauses 1-6.”]
a processor that, when executing the software application, [McDuff, Figure 1, “Local Computing Device 110.”  “Processor 112.”   Figure 7, “Processors 702.”]
is configured to perform the steps of: 
…

Claims 2-3, 12-13, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over McDuff and Manfredi in view of Kim (U.S. 20190355351).
Regarding Claim 2, McDuff teaches:
2. The computer-implemented method of claim 1, 
wherein the user resides within a vehicle where the first input is captured, and [McDuff teaches that the “local computing device 106” of Figure 1 may be “a vehicle computing system” ([0021]) and also teaches that the User 102 may be driving:  “[0076] The emotion identified by the facial expression recognizer 416 may be provided to the conversational style manager 402 to modify the utterance of the embodied conversational agent 302. … For example, a forward-facing camera on a smartphone may provide the video input 410 of the user's 102 face, but the conversational agent app on the smartphone may provide audio-only output without displaying an embodied conversational agent 302 (e.g., in a "driving mode" that is designed to minimize visual distractions to a user 102 who is operating vehicle).”]
wherein the first operation is performed on behalf of the user by a vehicle subsystem included in the vehicle. 
McDuff teaches that the “local computing device 106” of Figure 1 may be “a vehicle computing system” ([0021]) and that it may be used in a “driving mode” ([0076]). Accordingly, McDuff at the least suggests that the operations performed by the computing device would pertain to “a vehicle subsystem.”
McDuff does not teach this expressly.
Instant Application, “Description of Related Art” includes an example of the VPA activating the air conditioner of the car via the VPA (virtual personal assistant) and another example of the VPA changing the volume of the car radio.  (Published Application [0003]-[0004].) Accordingly, the combination of the Applicant’s Admitted Prior art and McDuff can teach this Claim.  
Manfredi does not mention the use of its system in a car.

Kim teaches:
herein the user resides within a vehicle where the first input is captured, and [Kim, Figure 1, “user 162” inside the “vehicle 160.”  “[0019] The processor 102, the memory 104, and the sensors 110, 112 are implemented in a vehicle 160, such as a car….”]
wherein the first operation is performed on behalf of the user by a vehicle subsystem included in the vehicle. [Kim logs and responds to user reaction to an operation of a vehicle subsystem such as navigation or turning on the radio.  “[0020] The memory 104 includes a mapping unit 130 and a user experience evaluation unit 132. The mapping unit 130 is executable by the processor 102 to map received commands, such as a command 142, into operations (also referred to as "tasks" or "skills") to be performed responsive to the command 142. Examples of skills that may be supported by the system 100 include "navigate to home," "turn on radio," "call Mom," or "find a gas station near me."….”]  “[0022] The navigation engine 122 is configured to perform one or more operations associated with the vehicle 160. For example, the navigation engine 122 may be configured to determine a position of the vehicle 160 relative to one or more electronic maps, plot a route from a current location to a user-selected location, or navigate the vehicle 160 (e.g., in an autonomous mode of vehicle operation), as illustrative, non-limiting examples.”  “[0023] The experience manager 124 is configured to receive the experience data 146 from the user experience evaluation unit 132. The experience data 146 may include a classifier of the user experience as "good" or "bad" (e.g., data having a value between 0 and 1, with a "1" value indicating the user experience is positive and a "0" value indicating the user experience is negative)….”]
McDuff/Manfredi and Kim pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and responding to the user/speaker accordingly and it would have been obvious to modify the system of McDuff/Manfredi which can be used in a car with the system of Kim that specifies that the functions/operations requested from the VPA are functions pertaining to the vehicle in order to draw vehicle-related utility from the VPA.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 3, McDuff teaches, Figure 2, [0035]-[0037], a “custom intent recognizer 214” which generates “domain-specific scripted dialogue.”  The emotion as detected in Figure 4 is sued to impact the dialog that is generated in response.  The emotion does not change the performance of a physical operation. 
Manfredi teaches that “…The detected emotion can be used for the commercial purposes the virtual assistant is helping the user with….”  Abstract.
Kim is expressly directed to this aspect and teaches:
3. The computer-implemented method of claim 1, further comprising the steps of: 
determining the first operation based on the first emotional state; and [Kim, Figure 3, “evaluate user experience 306” and “perform remedial action 308.”  Kim teaches that the operation of  playing a soothing music or adjusting the voice interface to speech calmly or driving to the user’s sister’s house is makes her feel better.  So, if the emotion detected is negative, the device suggests the operation of driving to the sister.  “[0027] In another example, the remedial action 126 is selected to reduce a negative aspect of the user experience by improving a mood of the user 162. To illustrate, the remedial action 126 may include one or more of playing soothing music, adjusting a voice interface to speak to the user 162 in a calming manner or to have a calming effect, or recommending a relaxing activity for the user 162….”  “[0028] … As an example, the processor 102 may determine, during analysis of a history of interactions with the user 162, a high correlation between travelling to a house of a sister of the user 162 and a detected transition from a negative user experience to a positive user experience. As a result, the experience manager 124 may generate an output to be presented to the user 162, such as "Would you like to visit your sister today?" as the remedial action 126….”]
performing the first operation to assist the user. [Kim, Figure 3, “perform remedial action 308.”  “[0055] In the event that the user experience is evaluated to be a negative user experience, a remedial action is performed, at 308….”  “16. The method of claim 10, wherein the remedial action includes at least one of: playing soothing music, adjusting a voice interface to generate speech to have a calming effect, or recommending a relaxing activity for a user.”  [0026]-[0028].  Note also Figure 1, “Vehicle 160,” and “[0019] The processor 102, the memory 104, and the sensors 110, 112 are implemented in a vehicle 160, such as a car….”  “[0049] In response to receiving the non-audio input prompted by the GUI 218, the processor 102 is configured to process the user command. To illustrate, when the user command corresponds to a car-related command (e.g., "go home"), the processor 102 may process the user command by performing the user-selected skill to control the car (e.g., a navigation task to route the car to a "home" location).”]
McDuff/Manfredi and Kim pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and responding to the user/speaker based on the parameters corresponding to the user’s emotion and it would have been obvious to modify the system of McDuff/Manfredi which adjusts the output speech of the VPA to the style and emotion of the user with the system of Kim which actually performs a remedial action in response to detecting a negative user emotion.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 12 is a computer program product system claim with limitations corresponding to the limitations of method Claim 2 and is rejected under similar rationale.

Claim 13 is a computer program product system claim with limitations corresponding to the limitations of method Claim 3 and is rejected under similar rationale.

Regarding Claim 17, McDuff teaches that the “synthesized output 422” of Figure 4 is being modified according to the input (voice and image) by the “user 102.”  Accordingly, the output will adjust according to the next incoming input.  Additionally, “dialogue” implies a back and forth and more than one turn of speech.  
Manfredi expressly includes the term “dialogue.”  “[0018] Embodiments of the present invention provide a Software Anthropomorphous (human-like) Agent able to hold a dialogue with human end-users in order to both identify their need and provide the best response to it. This is accomplished by means of the agent's capability to manage a natural dialogue. The dialogue both (1) collects and passes on informative content as well as (2) provides emotional elements typical of a common conversation between humans. This is done using a homogeneous mode (way) communication technology.”  Manfredi includes training which means modifying the response according to inputs and reactions but does not include the express steps of modifying the response in the same dialog right then.
Kim expressly teaches:
17. The non-transitory computer-readable medium of claim 16, further comprising the steps of: 
capturing a second input that indicates at least one behavior the user performs in response to the outputting of the first vocalization; and [Kim, Figure 3 shows a training process where the models are updated based on second, third, etc. inputs by the user.  The user input (Figure 2) are in response to the output by device because Kim is directed to “User Experience Evaluation” in response to a task performed corresponding to a user input command.  First user input is the command; Second user input is his reaction to the performance of the command.]
modifying the response mapping based on the second input and a first objective function that is evaluated to determine how closely the at least one behavior corresponds to a target behavior. [Kim, Figures 2 and3.  Figure 2 shows a “Mapping Unit 130” which maps the commands to skills/tasks and a “User experience evaluation unit 132” which includes an “emotion analyzer 266” for analyzing the emotion of the user in response to the experience of receiving the result.  Figure 3 shows that based on the response of the user both “User Experience Model” and “Skill Model” are updated/modified.  The “Skill Model” teaches the “Response Mapping” because it determines which function/skill will be invoked in response to a command.  The “User Experience Model” teaches the “Objective Function” because it determines based on the detected emotion of the user (See Figure 4, 3rd and 4th processing stages 406, 408) how close is the emotion/behavior of the user to target behavior/emotion of a “good experience” (Figure 4, 450).  See [0056]-[0057] and “[0068] … The user experience model and the skill matching model may be updated based on the user feedback, such as described with reference to FIG. 3.”]
McDuff/Manfredi and Kim pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and adjusting the response.  It would have been obvious to combine the evaluation of the user experience/behavior in response to the task (skill) detected and performed by the Virtual Assistant and update of the model that shows how satisfied the user is (User behavior Corresponding to Target behavior of satisfaction) from Kim with the system of McDuff/Manfredi in order to improve the accuracy of the model that detects the emotion of the user.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 11, and 20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Young (U.S. 10,817,316).
Regarding Claim 1, Young teaches:
1. A computer-implemented method for causing a virtual personal assistant to interact with a user while assisting the user, [Young, Title: “Virtual assistant mood tracking and adaptive responses.”  Figure 1AFigure 1B, “Virtual Assistant 150.”]
the method comprising: 
capturing, by an input device of the virtual personal assistant, a first input that indicates one or more behaviors associated with the user; [Young, Figure 2, “receive user input 205.”  Figure 1B shows the modes of input as including “voice 146” and “text 148” and Figure 3 shows “input device 312.”  The “behavior” is speech or entry of text or a loud voice of the user or a keyword in the speech.  “In the method 200 shown in FIG. 2, the system (e.g., server computer system 110 in FIG. 1A) receives an input from a user directed to a virtual assistant operating on the system (205). A variety of inputs from the user may be received, such as a request for information from the virtual assistant (e.g., "where is the closest restaurant?", "what is the balance of my checking account?", etc.), and/or a request for the virtual assistant to perform a task ("reserve a table for me at the restaurant you just identified," "move $100 from savings to checking," etc.). Inputs from a user may be received in a variety of different formats, such as text and audio.”  Col. 3, lines 55-67.]
determining, by a processor of the virtual personal assistant, a first emotional state of the user based on the first input; [Young, Figure 2, “predict mood of user 215.”  Figure 3, “processor 302.”  All of column 4 and to top half of column 5 describe the various and varied factors used in determining the “emotional state of the user.”  “The system may also perform a voice stress analysis on a user's audio input. In some embodiments, the VA system compares the user's latest voice input to a baseline recording of the user's voice. Machine learning techniques are used to determine, based on the VA's prior interactions with the user, the manner in which different voice stress conditions reflect the mood of different users. For example, one user may naturally speak loudly, while a second user raising their voice may be determined to be indicative of the user being upset or angry.”  Col. 4, lines 29-38.]
selecting, by the processor from a response mapping, a first emotional component for changing the first emotional state of the user to a target emotional state of the user, [Young, the goal to gage the emotion of the user and provide a response to which the user will be “receptive” and would “change the emotional state of the user” from angry to less angry.  Col. 1, lines 12-17.]  [(Note the Specification defines “emotional component” as “[0033] … For example, and without limitation, a given emotional component 236 could include a specific pitch, tone, timbre, volume, diction speed, and/or annunciation level with which the vocalization should be synthesized to reflect particular emotional qualities and/or attributes.”)]
wherein the response mapping includes a machine learning model stored in a memory of the virtual personal assistant, the machine learning model including a plurality of mappings, each of the plurality of mappings including a respective emotional component for changing a respective emotional state of the user to a respective target emotional state for the user; [Young uses a “machine-learned model” to evaluate the degree of “receptivity” / “target emotion” associated with different responses that are generated based on the current emotion of a user:  “1….  generating a plurality of responses to the received input based on the predicted mood of the first user; determining, for each particular one of the plurality of responses, a probability that the response will be well received by the first user, the probability using a machine-learned model trained to find a correlation between receptivity and mood; selecting a response from the plurality of response that has a highest probability that the response will be well received ….”   “…  The VA system may utilize machine learning techniques to find a correlation between receptivity and mood for particular users.”  Col. 5, lines 30-40.  “For example, a VA overseeing a user's financial transactions may fail to recognize a user is upset or angry, and deliver an inappropriate (if perhaps accurate) response to a question or request from the user, thus further antagonizing the user. Embodiments of the present disclosure address these and other issues.”  Col. 1, lines 12-17.
generating, by the processor, a first vocalization that incorporates the first emotional component, [Young, Figure 2, “generate response 220.”  The response may be in different formats which includes voice/vocalization and the goal is for the response to have “high receptivity” by the user.  “The system may generate (220) a variety of different types of responses, different formats of responses, and different content within the responses…. The system may automatically pick the response having the highest likelihood to be received well by the user (i.e., the 80% probability response), or it may select from responses that have a probability of acceptance that meets or exceeds a threshold (e.g., either the 60% or 80% response where the minimum threshold is 60%).”  Col. 5, lines 11-30.]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [Young, Figure 2, “…In various embodiments, the system generates content and responses, and performs tasks and other actions based at least in part on the determined mood of the user….” Col. 5, lines 11-30.]
outputting, by an audio output device of the virtual personal assistant, the first vocalization to the user. [Young, Figure 1B, “response 155” and Figure 2, “provide response 225.”  “…The system may provide (225) a response to the user in a variety of different ways. In some embodiments, the system provides a response to a user's input in the same format (e.g., audio, text, etc.) as the input….”  Col. 5, lines 40-52.]
Claims 11 and 20 are counterparts of Claim 1 and additionally Young, Figure 1A, teaches the Processor 112 and Memory 114 that is in these other Claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Cabrera-Cord (U.S. 10554590).
Charlap (U.S. 11404170): Figure 4, “VA decision and empathy module 440.”
Horling (U.S. 11322143)

    PNG
    media_image3.png
    314
    427
    media_image3.png
    Greyscale

Hwang (U.S. 11423895)

    PNG
    media_image4.png
    534
    800
    media_image4.png
    Greyscale

Ito (U.S. 11186290): Emotion Inference Device … where the device is a motorcycle.
	Un (U.S. 9786299):

    PNG
    media_image5.png
    484
    660
    media_image5.png
    Greyscale


Regarding Claim 1, Kim teaches:
1. A computer-implemented method for causing a virtual personal assistant to interact with a user while assisting the user, [Kim, Figure 1, showing the user 162 in the “vehicle 160.”  “[0019] The processor 102, the memory 104, and the sensors 110, 112 are implemented in a vehicle 160, such as a car. (In other implementations, the processor 102, the memory 104, and the sensors 110, 112 are implemented in other devices or systems, such as a smart speaker system or a mobile device, as described further below). The first sensor 110 and the second sensor 112 are each configured to capture user input received from a user 162, such as an operator of the vehicle 160….”  “[0087] One or more of the disclosed aspects may be implemented in a system or an apparatus, such as the device 600, that may include a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a satellite phone, a computer, a tablet, a portable computer, a display device, a media player, or a desktop computer. Alternatively or additionally, the device 600 may include a set top box, an entertainment unit, a navigation device, a personal digital assistant (PDA) …”]
the method comprising: 
capturing, by an input device of the virtual personal assistant, a first input that indicates one or more behaviors associated with the user; [Kim, Figures 1 and 2.  “Sensors 110/112” capturing “user speech 108” as the “first input.”  “[0019] …For example, the first sensor 110 may include a microphone configured to capture user speech 108, and the second sensor 112 may include a camera configured to capture images or video of the user 162. The first sensor 110 and the second sensor 112 are configured to provide user input to the processor 102. For example, the first sensor 110 is configured to capture and provide to the processor 102 a first user input 140 (e.g., first audio data) indicating a user's command. The user speech 108 may be an utterance from the user 162, such as a driver or passenger of the vehicle 160. In a particular implementation, the first user input 140 corresponds to keyword-independent speech (e.g., speech that does not include a keyword as the first word). The second sensor 112 is configured to provide a second user input 152 (e.g., a video input including non-verbal user information) to the processor 102.”] [(Note the definition of “behavior” from the Specification:  “[0021] …  As referred to herein, a "behavior" includes any voluntary and/or involuntary actions performed by the user. For example, and without limitation, a "behavior" could include explicit commands issued by the user, facial expressions enacted by the user, changes in affect presented consciously or unconsciously by the user, as well as changes in user posture, heart rate, skin conductivity, pupil dilation, and so forth. Input devices 120 may include a wide variety of different types of sensors that are configured to capture different types of data that reflect behaviors associated with the user….”)]
determining, by a processor of the virtual personal assistant, a first emotional state of the user based on the first input; [Kim, Figure 2, “emotion analyzer 266” as part of the “user experience evaluation unit 132.”  “Processor 102.”  Figure 4, “Prosody Analysis 430” and “Keyword detection 432” and “video analytics 434” are used by the “emotion analysis 440” to determine the “emotional state” of the speaker based on his voice and words.  See [0062]-[0064].]
selecting, by the processor from a response mapping, a first emotional component for changing the first emotional state of the user to a target emotional state of the user, [Kim, Figures 2 and 4, “Experience Manager 124” including “remedial action 126” teaches “selecting … first emotional component” which addresses the dissatisfaction / “emotional state of the user” and change it to satisfaction / “target emotional state of the user.”  The selected “first emotional component” may be a soothing voice.] ([Note the Specification defines “emotional component” as “[0033] … For example, and without limitation, a given emotional component 236 could include a specific pitch, tone, timbre, volume, diction speed, and/or annunciation level with which the vocalization should be synthesized to reflect particular emotional qualities and/or attributes.”)]
wherein the response mapping includes a machine learning model stored in a memory of the virtual personal assistant, the machine learning model including a plurality of mappings, each of the plurality of mappings including a respective emotional component for changing a respective emotional state of the user to a respective target emotional state for the user; [Considering that the “remedial action” of Kim is in response to a detected emotion and considering that Kim keeps monitoring the responses of the user, and keeps updating its “user experience model 310” shown in Figure 3, Kim teaches a machine- learned model.  However, the model of Kim maps the input commands to the reaction of the user to the identification of the command by the machine in order to improve command detection.  It does not map an emotional output by the machine that improved the emotion of the user.]
generating, by the processor, a first vocalization that incorporates the first emotional component, [Kim, Figure 2, “Remedial Action 126” as part of the “Experience Manage 124” incorporates the “calming” / “first emotional component” in the generated response and provides a response in a soothing voice.  “[0027] In another example, the remedial action 126 is selected to reduce a negative aspect of the user experience by improving a mood of the user 162. To illustrate, the remedial action 126 may include one or more of playing soothing music, adjusting a voice interface to speak to the user 162 in a calming manner or to have a calming effect, or recommending a relaxing activity for the user 162…..”]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [Kim, Figure 2 or Figure 4.  The remedial action occurs in response to sensed user input which may be in response to his experience with some function/operation of the vehicle.  “[0020] The memory 104 includes a mapping unit 130 and a user experience evaluation unit 132. The mapping unit 130 is executable by the processor 102 to map received commands, such as a command 142, into operations (also referred to as "tasks" or "skills") to be performed responsive to the command 142. Examples of skills that may be supported by the system 100 include "navigate to home," "turn on radio," "call Mom," or "find a gas station near me." ….”]
outputting, by an audio output device of the virtual personal assistant, the first vocalization to the user. [Kim, Figure 2, “speaker 238” outputting “speech 209.”  “[0038] …  The speaker 238 may be configured to output audible information to the user 162, such as speech 209.”]

Regarding Claim 3, Young teaches:
3. The computer-implemented method of claim 1, further comprising the steps of: 
determining the first operation based on the first emotional state; and [Young teaches that the VA may take an action without the user necessarily providing an input and merely in response to the user’s emotional state and according to his profile:  “The system may provide (225) a response to the user in a variety of different ways. In some embodiments, the system provides a response to a user's input in the same format (e.g., audio, text, etc.) as the input. In this context, a "response" generally refers to any output provided by the system to the user. Accordingly, the virtual assistant system may provide a user information, perform a task, or take other action without a user necessarily providing any input. …..”  Col. 5, lines 40-52.]
performing the first operation to assist the user. [Young: “In some embodiments, the VA system uses is determination of a user's current or predicted future mood to determine whether to engage the user and, if so, how. Determining the likelihood that a response will be well-received by a user may vary depending on the user. For example, some users may be more receptive when angry, other users may prefer to be left alone when angry. The VA system may utilize machine learning techniques to find a correlation between receptivity and mood for particular users.”  Col. 5, lines 30-40.”]

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659