Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 11, and 20 are independent and are amended.  Dependent Claims are amended to add “steps of” to the preamble. Claim 17 is amended to clarify that the second input is in response to the output of the first vocalization by the machine.
This Application was published as U.S. 2021/0304787.
Apparent priority: March 2020.

The Claims are abstract as provided in the 35 U.S.C. 101 rejection below.  They could pertain to two people engaged in a conversation and one adjusting his voice to the emotion detected in the voice of the other.  The claimed steps need to be tied to the technological environment and tools that are intended by Application in an inseparable and indivisible form.  See rejection below. 
The amendments provide: “generating a first vocalization that incorporates a first emotional component based on the first emotional state and a target emotional state of the user,” which is not sufficient to overcome the rejection.  See below.
 Applicant was invited to participate in the DSMER Pilot Program but does not appear to have taken the DSMER route.  The invitation to participate in the DSMER Pilot Program only applies to the first action and is hereby withdrawn.

Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection that, if presented, were necessitated by the amendments to the Claims.
This action is Final.
Response to Amendments and Arguments
1. A computer-implemented method for interacting with a user while assisting the user, the method comprising: 
capturing a first input that indicates one or more behaviors associated with the user; 
determining a first emotional state of the user based on the first input; 
generating a first vocalization that incorporates a first emotional component based on the first emotional state and a target emotional state of the user,
wherein the first vocalization relates to a first operation that is being performed to assist the user; and 
outputting the first vocalization to the user. 

Reply to Arguments regarding the 101 Rejection
The amendments to the independent Claims provide: “generating a first vocalization that incorporates a first emotional component based on the first emotional state and a target emotional state of the user,” which is not sufficient to overcome the rejection.  The previous mapping already provided that “Listener decides to console the Speaker or calm him down or do a little of each and therefore “incorporates” /takes into account the mood of the Speaker/User.”  Thus, the Listener took into account the current “angry” state of the Speaker and the target “calm” state when addressing the Speaker.
The rejection did not assert abstract idea based on a mathematical concept and thus Applicant’s arguments in this respect are not on point.  See Response at 8.
The rejection was on the basis of the Claim being directed to a “mental process” or “method of organizing human activity.”  
Applicant argues with respect to mental process:

    PNG
    media_image1.png
    155
    508
    media_image1.png
    Greyscale

Response at 8.
The arguments are conclusory and unpersuasive.
Mental Process is defined as concepts performed in the human mind (including an observation, evaluation, judgment, opinion).
The method may be performed in the mind of a human Listener by listening to and watching the “one or more behaviors …user.”  
The Claim includes no technological method or component for its “capture.”  Is there a “microphone” recording the voice of the user?  Is there a video camera capturing the image of the user?  Not that such components alone would have been sufficient but even such minor and well-known technological ties are absent.
The Claim provides no technological method or component for the “vocalization.”  Is there a speaker involved? 
The Claim provides no technological method or algorithm for the “generating a first vocalization that incorporates a first emotional component based on the first emotional state and a target emotional state of the user.”  How does it go from the capture of the emotional state to the vocalization?  What is the “first emotional component” and how is it arrived at?
The Claim provides no technological method or component for the “first operation that is performed to assist the user.”  Applicant mentions a “virtual personal assistant” in the arguments.  The Claim does not include a “virtual personal assistant.”  For all we know the vocalization may be by a grandmother who observes that her grandchild is in distress and says “there, there you are going to be ok” in order to take the grandchild from a state of distress to a state of calm.
All of the claimed steps may be performed by a human being listening to and watching another human being and saying something to soothe the other human being.  The Claim is starkly deplete of technological components and ties and thus Abstract.

Applicant argues with respect to methods of organizing human activity:
. 
    PNG
    media_image2.png
    163
    513
    media_image2.png
    Greyscale

Response at 8.
Certain Methods of Organizing Human Activity include: “3- managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions).”
Applicant refers back to the “personal virtual assistant” that is entirely absent from the Claim language. 
Applicant argues that the claimed approach enables a virtual personal assistant to more accurately determine operations to perform on behalf of a user based on the user’s current and target emotional states and provides the example of use in a vehicle.  In reply, please note that virtual personal assistant or vehicle are not present in the Claim.  Additionally, and when added to the Claim, a tie must be established by providing technological details pertinent to a machine.  A mere reference to a technological component without establishing an indivisible nexus would have been tantamount to:  this is what people normally do; we tell the machine to do it.  Details of analysis and operation that are specific to a machine must be in the Claim.  As is, the independent Claim lacks both the technological and machine related components and any detail related to enabling a machine component to perform what humans generally do.

In the “B. Supplemental Alice Analysis,” the Applicant provides repeated mentions to features and components and functions that are not present in the Claim language.  Refer the language of Claim 1 that is provided at the outset:  no “virtual personal assistant,” no “vehicle” or “driving” or “interaction with the vehicle features,” may be found in the claimed language.  There is no recitation of steps that would lead to a “more accurate determination of operations to perform.”  Rather the Claim jumps from “determining a first emotional state of the user” to “generating a first vocalization that … relates to a first operation that is being performed to assist the user.”  The Claim indeed lacks so many connections between the steps that cannot be done by anything other than a human being that knows what he is doing and is not in need of instructions.  

See the following for an example of all of gaps and missing particularities that render the Claim inordinately broad and abstract:
1. A computer-implemented method for interacting with a user while assisting the user,  [Where is the user?  What is the user doing?  With what does he need assistance?  Is the a car involved?  Is there a VPA involved?  They need to be in the Claim.]
the method comprising: 
capturing a first input that indicates one or more behaviors associated with the user; [What type of input?  Is the user texting his input via keyboard?  No microphone; no camera; no specifying of what the input might be?  What are the “behaviors” associated with the user?  Is it his voice? His image?  His gestures?]
determining a first emotional state of the user based on the first input; [How?  What is the Claim analyzing and according to which algorithm to arrive at this “determination”?  Is it extracting acoustic features of the voice of the user as the user is providing a command to the VPA?  How are the acoustic features being analyzed?  Is it conducting semantic analysis on the content of the speech?  Is there speech recognition involved?  Is the content of speech subject to keyword search?  Is there a more complex natural language processing involved?  Is the emotional state determined from a frown extracted from the visual features of an image of the user?  Is it determined from motions and gestures of the user obtained by a video camera?  How are such images analyzed?]
generating a first vocalization that incorporates a first emotional component based on the first emotional state and a target emotional state of the user, [How is the target determined?  Is it input to the system?  Does the machine decide what target emotional state is desirable?  On what basis?  Is the user driving?  The claim does not say that.  Is the target state always a state of calm?  The Claims says nothing about how the target emotional state is arrived at.  What is the “first emotional component”?  what type of thing is it?  Is it a sound? An image? A haptic sensation? A drug injected into the driver?]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [What operation?  So far we don’t know where the user is or what he is doing?  For all we know he could be skiing the Alps?  Is the Claim providing ski instructions?]
outputting the first vocalization to the user. [With what?]

Further note: while some of the above-posed questions are addressed in some of the dependent Claims (e.g. Claim 2 makes clear the user is in a vehicle but still does not mention a VPA or clarify what the role of the user is in said vehicle) a critical mass of particularity is required to overcome the Abstract Idea rejection and such particularity was not found in the dependent Claims.  The scene in which the Claims operate needs to be set in the independent Claim and after that the dependent Claims may add details to the scene.  The scene is not set in the independent Claim and therefore the details that come in through the dependent Claims happen to flail.

Reply to Arguments regarding the 103 Rejection
	With respect to the obviousness rejection, the Applicant focuses on the primary reference McDuff and argues that McDuff does not teach the added limitation.  Response at 14-15.
	The added limitation is addressed by modified grounds of rejection that render the arguments moot.
McDuff teaches in Figure 5, “512: identify sentiment of the user’s input” and “generate a response dialogue 514,” that the response by the machine is generated in response to the sensed emotion of the user.  McDuff teaches:  “[0093] … The prosodic qualities of the response dialogue may also be modified based on a facial expression of the user 102 if that data is available. For example, if the user 102 is making a sad face, the tone of the response dialogue may be lowered to make the conversational agent also sound sad….”  This teaching suggests a type of adjusting the machine response to fit a target emotional state as claimed; it may be that when the user is sad, the machine ought not be too jittery and impertinent to further aggravate a sad user.
	McDuff does not expressly teach that the response by the machine is calculated to put the user in a good mood.
	Kim is added which is a 102 reference but is applied as a secondary to maintain the previous grounds of rejection.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Under step 2A, prong 1, the claims fall under mental processes thus falling within a judicial exception (one person talking to another and helping him out with turning the radio on in a car).  Under step 2A, prong 2, the judicial exception needs to be integrated into a practical application. The additional limitation here is simply a computer as noted in preamble. This is a mere attempt to “apply” the steps to a computer. Therefore, the claims are directed to an abstract idea. Under step 2B, the additional limitations as noted earlier with prong 2, include a mere attempt to apply the exception using a generic computing component which does not result in and inventive step.
Claim 1 is a generic automation of a mental process since a human passenger can sense the emotional state of another passenger nd adjust his or her behavior accordingly. We have a new question (prong 2 of step 2A) in the 101 analysis that asks whether the abstract idea is integrated with a practical application. The answer is no in this instance because there is no technological solution in the Claim that “integrates” the abstract idea. The Claim only suggests that the abstract idea be applied. It does not describe an application.   Tie the steps to machine components (microphone; NLP, speaker, spectral analyzer; etc.) and tie them tightly.

Step 1: The independent Claims are directed to statutory categories: 
Claim 1 is a method claim and directed to the process category of patentable subject matter.
Claim 11 is a computer-readable-storage device claim and is directed to the machine or manufacture category of patentable subject matter.
Claim 20 is a system claim and directed to the machine or manufacture category of patentable subject matter.

Step 2A, Prong One: Does the Claim recite a Judicially Recognized Exception? Abstract Idea? Are these Claims nevertheless are considered Abstract as a Mathematical Concept (mathematical relationships, mathematical formulas or equations, mathematical calculations), Mental Process (concepts performed in the human mind (including an observation, evaluation, judgment, opinion), or Certain Methods of Organizing Human Activity (1-fundamental economic principles or practices (including hedging, insurance, mitigating risk), 2-commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations), 3- managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions) and fall under the judicial exception to patentable subject matter?)
The rejected Claims are directed to Mental Processes or Methods of Organizing Human Activity.
Step 2A, Prong Two: Additional Elements that Integrate the Judicial Exception into a Practical Application? Identifying whether there are any additional elements recited in the claim beyond the judicial exception(s), and evaluating those additional elements to determine whether they integrate the exception into a practical application of the exception. “Integration into a practical application” requires an additional element(s) or a combination of additional elements in the claim to apply, rely on, or use the judicial exception in a manner that imposes a meaningful limit on the judicial exception, such that the claim is more than a drafting effort designed to monopolize the exception. Uses the considerations laid out by the Supreme Court and the Federal Circuit to evaluate whether the judicial exception is integrated into a practical application.
The rejected Claims do not include additional limitations that point to integration of the abstract idea into a practical application.
Claim 1 is a generic automation of a mental process since a human agent can sense the emotional state of a customer and adjust his or her behavior accordingly. We have a new question (prong 2 of step 2A) in the 101 analysis that asks whether the abstract idea is integrated with a practical application. The answer is no in this instance because there is no technological solution in the Claim that “integrates” the abstract idea. The Claim only suggests that the abstract idea be applied. It does not describe an application. 

1. A computer-implemented method for interacting with a user while assisting the user, the method comprising: 
capturing a first input that indicates one or more behaviors associated with the user; [Listener is listening to Speaker/User and captures the User’s behavior based on what the User says.]
determining a first emotional state of the user based on the first input; [Listener can tell that the Speaker is angry or sad based on what the Speaker/User says or based on his tone of voice.]
generating a first vocalization that incorporates a first emotional component based on the first emotional state and a target emotional state of the user, [Listener decides to console the Speaker or calm him down or do a little of each and therefore “incorporates” /takes into account the mood of the Speaker/User.]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [Listener is trying to help the Speaker/User to do something /perform a first operation and what the Listener says pertain to that operation.]
outputting the first vocalization to the user. [Listener speaks out.]

Step 2B: Search for Inventive Concept: Additional Element Do not amount to Significantly More: Claim 1 has no extra limitation and he limitations of "memory" and “processor,” in the system Claim 20, or the limitation of “computer readable medium” in Claim 11 are well-understood, routine, and conventional machine components that and are being used for their conventional and rather generic functions. Additionally, these limitations are expressed parenthetically and lack nexus to the Claim language and as such are a separable and divisible mention to a machine. Accordingly, they are not sufficient to cause the Claim to amount to significantly more than the underlying abstract idea. 
The technological aspect has to be claimed with more particularity and woven into the fabric of the Claim. The Claim has to integrate the abstract idea into a technological application. Not: here is the idea of listening to someone and consoling them and here is “computer-implemented” stated broadly, and we put them together. The “How” of each step and the definition of the terms in the Claim if added contribute to possibly integrating the abstract idea into a technological application.
The Dependent Claims do not add limitations that could help the Claim as a whole to amount to significantly more than the Abstract idea identified for the Independent Claim.  For example:
In Claim 2, the User may be trying to do something in his car and the Passenger can talk to him and help him.  
Set the scene from the beginning:  the method is to provide spoken natural language assistance to a driver or passenger of a vehicle who is engaged in using a feature of the vehicle and expresses his frustration or his question by speaking to the car and whose voice is captured via a microphone and analyzed by a natural language processing software or his captured voice is converted to the spectral domain and the spectral content is analyzed and the response is output by the car through a speech synthesizer.  Say something about the machine components that are involved; their respective roles; their interface with the physical environment of the car.  This application is about an automobile and the term “vehicle” is introduced only in Claim 2 and the other claims are not even in the chain or dependency. 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-6, 11-14, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over McDuff (U.S. 20200279553) in view of Kim (U.S. 20190355351).
Regarding Claim 1, McDuff teaches:
1. A computer-implemented method for interacting with a user while assisting the user, [McDuff, Figure 1, “Local Computing Device 110.”]
the method comprising: 
capturing a first input that indicates one or more behaviors associated with the user; [McDuff, Figure 1, “Communication Interface 116.”  “User 102” providing “Speech 104” and Figure 5, “Determine Linguistic Style 508” and “Identify Sentiment of User’s Input 512” both teach “behaviors associated with the user” of the Claim.]
determining a first emotional state of the user based on the first input; [McDuff, Figure 5, “Identify Sentiment of User’s Input 512” and Figure 7, “Sentiment Analysis Module 716.”  Figure 4, “Text Sentiment Recognizer 404.”  [0065].  “Facial Expression Recognizer 416.”  [0075].]
generating a first vocalization that incorporates a first emotional component based on the first emotional state and a target emotional state of the user, [McDuff, Figure 4, “Synthesized Output 422” which is fed by the “Emotion and Head Pose Synthesizer 420” which takes into account the sentiment of the User 102 as provided by the “text sentiment recognizer 404” and the “facial expression recognizer 416.”  [0079]-[0080].]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [McDuff, the Local Computing Device 106” is a “Virtual Personal Assistant” and therefore it “performs” some “operation” that assists the “User 102”:  “[0021] FIG. 1 shows a conversational agent system 100 in which a user 102 uses speech 104 to interact with a local computing device 106 such as a smart speaker (e.g., a FUGOO Style-S Portable Bluetooth Speaker). The local computing device 106 may be any type of computing device such as a smartphone, a smartwatch, a tablet computer, a laptop computer, a desktop computer, a smart TV, a set-top box, a gaming console, a personal digital assistant, a vehicle computing system, a navigation system, or the like. In order to participate in audio-based interactions with the user 102, the local computing device 106 includes or is connected to a speaker 108 and a microphone 110. The speaker 108 generates audio output which may be music, a synthesized voice, or other type of output.”  Each of the examples indicate an “operation” such as making a phone call, telling the time, providing navigation assistance, etc.]
outputting the first vocalization to the user. [McDuff, Figure 3, “Speakers 312” and “Display 314” provide the output including the synthesized speech from “Speech Synthesizer 220” of Figure 4.]
	
	McDuff teaches in Figure 5, “512: identify sentiment of the user’s input” and “generate a response dialogue 514,” that the response by the machine is generated in response to the sensed emotion of the user.  McDuff teaches:  “[0093] … The prosodic qualities of the response dialogue may also be modified based on a facial expression of the user 102 if that data is available. For example, if the user 102 is making a sad face, the tone of the response dialogue may be lowered to make the conversational agent also sound sad….”  This teaching suggests a type of adjusting the machine response to fit a target emotional state as claimed; it may be that when the user is sad, the machine ought not be too jittery and impertinent to further aggravate a sad user.
	McDuff does not expressly teach that the response by the machine is calculated to put the user in a good mood.
	Kim teaches:
1. A computer-implemented method for interacting with a user while assisting the user, [Kim, Figure 1, showing the user 162 in the “vehicle 160.”  “[0019] The processor 102, the memory 104, and the sensors 110, 112 are implemented in a vehicle 160, such as a car. (In other implementations, the processor 102, the memory 104, and the sensors 110, 112 are implemented in other devices or systems, such as a smart speaker system or a mobile device, as described further below). The first sensor 110 and the second sensor 112 are each configured to capture user input received from a user 162, such as an operator of the vehicle 160….”]
the method comprising: 
capturing a first input that indicates one or more behaviors associated with the user; [Kim, Figures 1 and 2.  “Sensors 110/112” capturing “user speech 108” as the “first input.”  “[0019] …For example, the first sensor 110 may include a microphone configured to capture user speech 108, and the second sensor 112 may include a camera configured to capture images or video of the user 162. The first sensor 110 and the second sensor 112 are configured to provide user input to the processor 102. For example, the first sensor 110 is configured to capture and provide to the processor 102 a first user input 140 (e.g., first audio data) indicating a user's command. The user speech 108 may be an utterance from the user 162, such as a driver or passenger of the vehicle 160. In a particular implementation, the first user input 140 corresponds to keyword-independent speech (e.g., speech that does not include a keyword as the first word). The second sensor 112 is configured to provide a second user input 152 (e.g., a video input including non-verbal user information) to the processor 102.”]
determining a first emotional state of the user based on the first input; [Kim, Figure 2, “emotion analyzer 266” as part of the “user experience evaluation unit 132.”]
generating a first vocalization that incorporates a first emotional component based on the first emotional state and a target emotional state of the user, [Kim, Figure 2, “Remedial Action 126” as part of the “Experience Manage 124” incorporates the detected emotion in the voice of the user and tries to provide a response in a soothing voice.  “[0027] In another example, the remedial action 126 is selected to reduce a negative aspect of the user experience by improving a mood of the user 162. To illustrate, the remedial action 126 may include one or more of playing soothing music, adjusting a voice interface to speak to the user 162 in a calming manner or to have a calming effect, or recommending a relaxing activity for the user 162…..”]
wherein the first vocalization relates to a first operation that is being performed to assist the user; and [Kim, Figure 2 or Figure 4.  The remedial action occurs in response to sensed user input which may be in response to his experience with some function/operation of the vehicle.  “[0020] The memory 104 includes a mapping unit 130 and a user experience evaluation unit 132. The mapping unit 130 is executable by the processor 102 to map received commands, such as a command 142, into operations (also referred to as "tasks" or "skills") to be performed responsive to the command 142. Examples of skills that may be supported by the system 100 include "navigate to home," "turn on radio," "call Mom," or "find a gas station near me." The mapping unit 130 is executable to return a particular skill 144 that corresponds to a received command 142. The user experience evaluation unit 132 is configured to determine, based on one or more received inputs, experience data 146 indicating aspects of an experience of the user 162. For example, the user experience evaluation unit 132 may evaluate a user experience based on at least one of speech keyword detection, audio emotion analytics, video emotion analytics, prosody analytics, or audio event detection. In some implementations, the mapping unit 130, the user experience evaluation unit 132, or both, is dynamically adjustable and updated based on user feedback. Implementations of the mapping unit 130 and the user experience evaluation unit 132 are described in further detail with reference to FIGS. 2-4.”]
outputting the first vocalization to the user. [Kim, Figure 2, “speaker 238” outputting “speech 209.”  “[0038] …  The speaker 238 may be configured to output audible information to the user 162, such as speech 209.”]

McDuff and Kim pertain to detection of emotion from a user and adjusting the response accordingly and it would have been obvious to combine the feature of McDuff which adjusts the response of the machine to the user to soothe and calm the user in response to detecting irritation in the sensory inputs from the user with the system of McDuff which teaches adjusting the response to sound sad in order to possibly sympathize with the speaker/user which is an indirect way of putting the user in a better mood.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 2, McDuff teaches:
2. The computer-implemented method of claim 1, 
wherein the user resides within a vehicle where the first input is captured, and [McDuff teaches that the “local computing device 106” of Figure 1 may be “a vehicle computing system” ([0021]) and also teaches that the User 102 may be driving:  “[0076] The emotion identified by the facial expression recognizer 416 may be provided to the conversational style manager 402 to modify the utterance of the embodied conversational agent 302. … For example, a forward-facing camera on a smartphone may provide the video input 410 of the user's 102 face, but the conversational agent app on the smartphone may provide audio-only output without displaying an embodied conversational agent 302 (e.g., in a "driving mode" that is designed to minimize visual distractions to a user 102 who is operating vehicle).”]
wherein the first operation is performed on behalf of the user by a vehicle subsystem included in the vehicle. 
McDuff teaches that the “local computing device 106” of Figure 1 may be “a vehicle computing system” ([0021]) and that it may be used in a “driving mode” ([0076]). Accordingly, McDuff at the least suggests that the operations performed by the computing device would pertain to “a vehicle subsystem.”
McDuff does not teach this expressly.
Instant Application, “Description of Related Art” includes an example of the VPA activating the air conditioner of the car via the VPA (virtual personal assistant) and another example of the VPA changing the volume of the car radio.  (Published Application [0003]-[0004].) Accordingly, the combination of the Applicant’s Admitted Prior art and McDuff can teach this Claim.  
Another reference is also added.
Kim teaches:
herein the user resides within a vehicle where the first input is captured, and [Kim, Figure 1, “user 162” inside the “vehicle 160.”  “[0019] The processor 102, the memory 104, and the sensors 110, 112 are implemented in a vehicle 160, such as a car….”]
wherein the first operation is performed on behalf of the user by a vehicle subsystem included in the vehicle. [Kim logs and responds to user reaction to an operation of a vehicle subsystem such as navigation or turning on the radio.  “[0020] The memory 104 includes a mapping unit 130 and a user experience evaluation unit 132. The mapping unit 130 is executable by the processor 102 to map received commands, such as a command 142, into operations (also referred to as "tasks" or "skills") to be performed responsive to the command 142. Examples of skills that may be supported by the system 100 include "navigate to home," "turn on radio," "call Mom," or "find a gas station near me."….”]  “[0022] The navigation engine 122 is configured to perform one or more operations associated with the vehicle 160. For example, the navigation engine 122 may be configured to determine a position of the vehicle 160 relative to one or more electronic maps, plot a route from a current location to a user-selected location, or navigate the vehicle 160 (e.g., in an autonomous mode of vehicle operation), as illustrative, non-limiting examples.”  “[0023] The experience manager 124 is configured to receive the experience data 146 from the user experience evaluation unit 132. The experience data 146 may include a classifier of the user experience as "good" or "bad" (e.g., data having a value between 0 and 1, with a "1" value indicating the user experience is positive and a "0" value indicating the user experience is negative)….”]
McDuff and Kim pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and responding to the user/speaker accordingly and it would have been obvious to modify the system of McDuff which can be used in a car with the system of Kim that specifies that the functions/operations requested from the VPA are functions pertaining to the vehicle in order to draw vehicle-related utility from the VPA.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 3, McDuff teaches, Figure 2, [0035]-[0037], a “custom intent recognizer 214” which generates “domain-specific scripted dialogue.”  The emotion as detected in Figure 4 is sued to impact the dialog that is generated in response.  The emotion does not change the performance of a physical operation. 
Kim teaches:
3. The computer-implemented method of claim 1, further comprising the steps of: 
determining the first operation based on the first emotional state; and [Kim, Figure 3, “evaluate user experience 306” and “perform remedial action 308.”  Kim teaches that the operation of  playing a soothing music or adjusting the voice interface to speech calmly or driving to the user’s sister’s house is makes her feel better.  So, if the emotion detected is negative, the device suggests the operation of driving to the sister.  “[0027] In another example, the remedial action 126 is selected to reduce a negative aspect of the user experience by improving a mood of the user 162. To illustrate, the remedial action 126 may include one or more of playing soothing music, adjusting a voice interface to speak to the user 162 in a calming manner or to have a calming effect, or recommending a relaxing activity for the user 162….”  “[0028] … As an example, the processor 102 may determine, during analysis of a history of interactions with the user 162, a high correlation between travelling to a house of a sister of the user 162 and a detected transition from a negative user experience to a positive user experience. As a result, the experience manager 124 may generate an output to be presented to the user 162, such as "Would you like to visit your sister today?" as the remedial action 126….”]
performing the first operation to assist the user. [Kim, Figure 3, “perform remedial action 308.”  “[0055] In the event that the user experience is evaluated to be a negative user experience, a remedial action is performed, at 308….”  “16. The method of claim 10, wherein the remedial action includes at least one of: playing soothing music, adjusting a voice interface to generate speech to have a calming effect, or recommending a relaxing activity for a user.”  [0026]-[0028].  Note also Figure 1, “Vehicle 160,” and “[0019] The processor 102, the memory 104, and the sensors 110, 112 are implemented in a vehicle 160, such as a car….”  “[0049] In response to receiving the non-audio input prompted by the GUI 218, the processor 102 is configured to process the user command. To illustrate, when the user command corresponds to a car-related command (e.g., "go home"), the processor 102 may process the user command by performing the user-selected skill to control the car (e.g., a navigation task to route the car to a "home" location).”]
McDuff and Kim pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and responding to the user/speaker based on the parameters corresponding to the user’s emotion and it would have been obvious to modify the system of McDuff which adjusts the output speech of the VPA to the style and emotion of the user with the system of Kim which actually performs a remedial action in response to detecting a negative user emotion.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 4, McDuff teaches:
4. The computer-implemented method of claim 1, wherein the steps of determining the first emotional state of the user comprises the steps of: 
determining a first feature of the first input; and [McDuff, Figure 4.  Two features of the image input are determined:  “Facial Expression Recognizer 416” and “Head Pose Estimator 418” are both extracted from the Video input / First input.]
determining a first type of emotion corresponding to the first feature. [McDuff, Figure 4. “Emotion and Head Pose Synthesizer 420” takes into account both 416 and 418.]  

Regarding Claim 5, McDuff, Figures 2 and 4, teaches the input of audio and extracting prosody/tone of the speech by “prosody style extractor 218.”  McDuff does mention determining sentiment from speech of the user:  “[0091] At 512, a sentiment of the user's 102 (i.e. speech 104 or text) may be identified….”  It also teaches that “features of speech such as emphasis and intonation.”  See [0038] and [0131].  But it does not elaborate on determining emotion from the tone/prosody of the voice. 
Kim teaches:
5. The computer-implemented method of claim 4, 
wherein the first input comprises an audio input, and [Kim teaches that it extracts the tone of the input speech by the user. “[0062] The second processing stage 404 includes prosody analysis 430, keyword detection 432, and video analytics 434. The prosody analysis 430 is configured to process the audio and speech data 420 to detect one or more prosody elements, such as emphasis, tonality, pitch, speech rate, or one or more other elements that may provide contextual information regarding the detected text 424, such as a particularly long duration. … A relatively complex mapping correlation may exist between all features of prosody and good/bad user experience. The prosody analysis 430 extracts prosody related features as one of the inputs to an emotion analysis 440, as described below.”]
wherein the first feature comprises a tone of voice associated with the user. [Kim teaches that it adjusts the tone of voice in the response: “[0044] … The voice interface 220 may adjust a tone, rate of speech, vocabulary, one or more other factors, or a combination thereof, to present speech 209 having qualities designed to improve an emotional state of the user 162.”]
McDuff and Kim pertain to evaluation of emotion of a user while interacting with a machine and responding to the user/speaker based on the parameters corresponding to the user’s conversational style and emotion and it would have been obvious to modify the system of McDuff which receives speech as input and evaluates speech but does not elaborate on the evaluation of prosodical features of speech (such as tone of voice) with the system of Kim which is more elaborate with respect to the evaluation of prosodical aspects of speech, and specifically mentions “tonality,” for completeness.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
 (Manfredi: “0007] The present invention provides a digital assistant that detects user emotion and modifies its behavior accordingly. In one embodiment, a modular system is provided, with the desired emotion for the virtual assistant being produced in a first module. A transforming module then converts the emotion into the desired output medium. For example, a happy emotion may be translated to a smiling face for a video output on a website, a cheerful tone of voice for a voice response unit over the telephone, or smiley face emoticon for a text message to a mobile phone. Conversely, input from these various media is normalized to present to the first module the user reaction.”)

Regarding Claim 6, McDuff teaches:
6. The computer-implemented method of claim 4, 
wherein the first input comprises a video input, and [McDuff, “Video Input 410.”]
wherein the first feature comprises a facial expression made by the user. [McDuff, “Face Detector 412” to “Facial Expression Recognizer 416” which feeds the “Emotion and Head Pose Synthesizer 420.”]

Claim 11 is a computer program product system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale.  Additionally:
11. A non-transitory computer-readable medium storing program instructions that, when executed by a processor, cause the processor to interact with a user while assisting the user by performing the steps of: [McDuff, Figure 1, “Local Computing Device 110.”  “Processor 112” and “Memory 114.” Figure 7, “Processor 702” and “Memory 704.”  “[0118] Computer-readable media can also store instructions executable by external processing units such as by an external CPU ….” “[0156] Clause 8. A computer-readable storage medium having computer-executable instructions stored thereupon, when executed by one or more processors of a computing system, cause the computing system to perform the method of any of clauses 1-6.”]
capturing a first input that indicates one or more behaviors associated with the user; 
determining a first emotional state of the user based on the first input; 
generating a first vocalization that incorporates a first emotional component based on the first emotional state, 
wherein the first vocalization relates to a first operation that is being performed to assist the user; and 
outputting the first vocalization to the user. 
Claim 12 is a computer program product system claim with limitations corresponding to the limitations of method Claim 2 and is rejected under similar rationale.

Claim 13 is a computer program product system claim with limitations corresponding to the limitations of method Claim 3 and is rejected under similar rationale.

Claim 14 is a computer program product system claim with limitations corresponding to the limitations of method Claims 4 and 5 OR 6 and is rejected under similar rationale.
14. The non-transitory computer-readable medium of claim 11, wherein the step of determining the first emotional state of the user comprises the steps of:
 determining a first feature of the first input; and [Claim 4]
determining a first type of emotion corresponding to the first feature, [Claim 4]
wherein the first feature comprises a tone of voice associated with the user or a facial expression made by the user. [Claim 5 Or Claim 6]
This Claim has the scope of Claim 6 and is rejected under similar mapping.

Regarding Claim 18, McDuff teaches:
18. The non-transitory computer-readable medium of claim 11, 
wherein the step of generating the first vocalization comprises the steps of combining the first emotional component with a first semantic component. [McDuff, Figure 4, the “synthesized output 242” uses the “semantics” / meaning of the input speech, obtained at “Text Sentiment Recognizer 404” combined with the emotion obtained from the “Facial Expression Recognizer 416” to generate the output.]

Regarding Claim 19, McDuff teaches:
19. The non-transitory computer-readable medium of claim 18, further comprising: 
generating a transcription of the first input that indicates one or more semantic components included in the first input; and  [McDuff, Figure 4, “Text Sentiment Analyzer 404” determines sentiment from the meaning of the text of the speech.  “[0065] The text sentiment recognizer 404 recognizes sentiments in the content of an input by the user 102. The sentiment as identified by the text sentiment recognizer 404 may be a part of the conversational context. The input is not limited to the user's 102 speech 104 but may include of the forms of input such as text (e.g., typed on the keyboard 310 or entered using any other type of input device). Text output by the speech recognizer 206 or text entered as text is processed by the text sentiment recognizer 404 according to any suitable sentiment analysis technique. Sentiment analysis makes use of natural language processing, text analysis, and computational linguistics, to systematically identify, extract, and quantify affective states and subjective information. The sentiment of the text may be identified using a classifier model trained on a large number of labeled utterances. The sentiment may be mapped to categories such as positive, neutral, and negative. Alternatively, the model used for sentiment analysis may include a greater number of classifications such as specific emotions like anger, disgust, fear, joy, sadness, surprise, and neutral. The text sentiment recognizer 404 is a point of crossover from the audio pipeline to the visual pipeline and is discussed more below.”]
generating the first semantic component based on the one or more semantic components. [McDuff, Figure 5, the “Identify A Sentiment of the User’s Input 512” feeds the “Generate a Synthetic Facial Expression 616” in Figure 6 in addition to “Generate a Response Dialogue 514” in Figure 5.]

Claim 20 is a system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale.  Additionally:
20. A system, comprising: 
a memory storing a software application; and [McDuff, Figure 1, “Local Computing Device 110.”  “Memory 114.” Figure 7, “Memory 704.”  “[0118] Computer-readable media can also store instructions executable by external processing units such as by an external CPU ….”   “[0155] Clause 7. A system comprising one or more processors and memory storing instructions that, when executed by the one or more processors, cause the one or more processors perform the method of any of clauses 1-6.”]
a processor that, when executing the software application, [McDuff, Figure 1, “Local Computing Device 110.”  “Processor 112.”   Figure 7, “Processors 702.”]
is configured to perform the steps of: 
…


Claims 7-10 and 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over McDuff and Kim in view of Manfredi (U.S. 20080096533).
Regarding Claim 7, McDuff teaches a “Prosody Recognizer 208” and teaches the elements of “prosody” as: “[0042] The prosody style extractor 218 uses the acoustic variables identified from the speech 104 of the user 102 to modify the utterance of the conversational agent. The prosody style extractor 218 may modify that SSML file to adjust the pitch, loudness, and speech rate of the conversational agent's utterances. For example, the representation of the utterance may include five different levels for both pitch and loudness (or a greater or lesser number of variations). …”   [See also [0006] and [0029].  McDuff uses prosody to match the prosody style of the synthesized output to that of the user and does not discuss determination of emotion/sentiment from prosody.  
Kim teaches detection of emotion from voice and image of the speaker/user.  “4. The device of claim 1, wherein the user experience evaluation unit is executable to perform at least one of: speech keyword detection, audio emotion analytics, video emotion analytics, prosody analytics, or audio event detection.”  Kim further teaches some form of calculation for emotions:  “[0023] The experience manager 124 is configured to receive the experience data 146 from the user experience evaluation unit 132. The experience data 146 may include a classifier of the user experience as "good" or "bad" (e.g., data having a value between 0 and 1, with a "1" value indicating the user experience is positive and a "0" value indicating the user experience is negative). In other examples the experience data 146 may include multiple values, such as a first value indicating a measurement of happiness, a second value indicating a measurement of anger, a third value indicating a measurement of frustration, a fourth value indicating a measurement of sadness, and a fifth value indicating a measurement of excitement, as illustrative, non-limiting examples.”
Kim arguably teaches or suggests the “valence value” of Claim 7.
Manfredi more expressly teaches the “spectrum of emotion types”:
7. The computer-implemented method of claim 1, wherein the step of determining the first emotional state of the user comprises the steps of: 
determining a first valence value based on the first input that indicates a location within a spectrum of emotion types; and [Manfredi is directed to: “[0007] … a digital assistant that detects user emotion and modifies its behavior accordingly….”  Manfredi provides a “spectrum” of emotions and calculates the particular emotion of the speaker as a combination of different emotions.  “[0011] In one embodiment, various primary emotional input indicators are combined to determine a more complex emotion or secondary emotional state. For example, primary emotions may include fear, disgust, anger, joy, etc. Secondary emotions may include outrage, cruelty, betrayal, disappointment, etc. If there is ambiguity because of different emotional inputs, additional prompting, as described above, can be used to resolve the ambiguity.”  “Janus” is the name of the Virtual Assistant.]
determining a first intensity value based on the first input that indicates a location within a range of intensities corresponding to the location within the spectrum of emotion types. [Manfredi, See Figure 3 teaching that the percentage/ “first intensity value” of each emotion, within a range of 100%, in the speech is determined:  “[0092] FIG. 3 is a diagram of an embodiment of an array which is passed to Janus as a result of neural network computation. Each position of the array represents a basic emotion. For each basic emotion a percentage (e.g., 37.9% fear, 8.2% disgust, etc.) is provided to the other modules. In this case, the values represent a situation of surprise and fear…..”]
McDuff/Kim and Manfredi pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and it would have been obvious to modify the system of McDuff/Kim which receives speech as input and evaluates speech and prosody but does not elaborate on the evaluation of prosodical features of speech with the system of Manfredi which is more elaborate with respect to the evaluation of sound/audio aspects of speech for completeness.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 8, McDuff teaches that the output voice is modified to match the linguistic style and speech of the user and his emotion.  McDuff does not elaborate on the emotion being derived from voice.
Kim is more elaborate but does not teach the “valence value” which is claimed.
Manfredi teaches:
8. The computer-implemented method of claim 7, wherein the first emotional component corresponds to the first valence value and the first intensity value. [Manfredi teaches that the voice of the Virtual Assistant (Janus) including the “first emotional component” is adjusted according to the emotion (and its intensity) obtained from the input voice of the user.  “[0147] There are well-established techniques to obtain a user's emotive state from the vocal spectrum. ….”  “7. A virtual assistant comprising: a user input device for providing input information from a user; an emotion detection module configured to detect a user's emotion from said input information; a core module for producing a virtual assistant emotion for the virtual assistant based on said user's emotion.”  See [0202] to [0225] regarding the type of primary and secondary emotions that are detected and calculated.  Then:  “[0230] An AI engine (based on neural networks) is to compute VA's emotion (selected among catalogued emotions, see .sctn. "User's emotion calculation") with regard to …”  “Emotional Status Definition” is done by:  “[0119] This is performed in two ways: [0120] by extracting emotional valence by proposed stimulus (valence is a static value previously allocated to stimulus) [0121] by dynamically deducing a emotional status by dialogue flow status and by context”  “Virtual Asssitant’s Emotion Calcuation” is based on:  “[0235] Emotive valence of discussed subject (taken from knowledge base)” “[0236] Emotive valence of answer to be provided (taken from knowledge base)”.]
Rationale for combination as provided for Claim 7.

Regarding Claim 9, McDuff and Kim suggest this feature but Manfredi was cited and Manfredi teaches: 
9. The computer-implemented method of claim 7, wherein the first emotional component corresponds to at least one of a second valence value or a second intensity value. [Manfredi, Figure 3, the percentages of various primary emotions are calculated and the secondary emotions are calculated based on the primary emotions.  The percentage pertaining to a second type of primary emotion can be mapped to “a second intensity value.”  The “first emotional component” the emotion reflected in the voice of the Virtual Assistant and it is based on all the first, second, etc. intensity values of different types of primary emotions.  The secondary emotions that are calculated as a combination of primary emotions, also have a % associated with them, and can also teach the “second intensity value” of the Claim:  “[0027] … So, if the system has computed a state of disappointment at 57% (combination of basic emotions of Sadness and Surprise) than VA could directly ask: "Are you disappointed by my answer?".”]
Rationale for combination as provided for Claim 7.

Regarding Claim 10, McDuff teaches that the synthesized speech for the response dialogue (Figure 5, 516) is modified to convey emotion (“first emotions component”) based on the identified sentiment (“first emotional state”) of the user input (Figure 5, 512).  Therefore, an inherent mapping has to occur.  Same is true of Kim.
Mandfredi, however, expressly uses the word “mapping.” 
Manfredi expressly teaches: 
10. The computer-implemented method of claim 1, further comprising the steps of
generating the first emotional component based on the first emotional state and [Manfredi, Figure 4, “input collection 2” to “input contextualizing 4” determines the emotional state of the speaker/user which teaches the “first emotional state” of the Claim.  Figure 4, “emotional status definition 6” and “output preparation 7” generates the emotional tone of the output by the Virtual Assistant (Janus) and teaches the generating of the “first emotional component” of the Claim.  “[0118] Before sending a further stimulus (question, sentence, action) or the answer, there is an "emotional part loading." That is, the Virtual Assistant is provided with an emotional status appropriate for dialogue flow, stimulus to be sent or the answer.”]
a response mapping that translates emotional states to emotional components. [Manfredi, the “Emotional Status Definition 6” which sets the emotional status of the response of the VA is obtained from mapping of the emotions of the speaker to the response:  “[0122] The Virtual Assistant makes use of an additional model of artificial intelligence representing an emotive map and thus dedicated to identify the emotional status suitable for that situation.”]
McDuff/Kim and Manfredi pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and it would have been obvious to modify the system of McDuff/Kim which would inherently perform a mapping between the emotional state detected in the voice/face of the speaker to an “emotional component” for the output with the system of Manfredi which expressly teaches an “emotive map” for doing the same.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 15 is a computer program product system claim with limitations corresponding to the limitations of method Claim 7 and is rejected under similar rationale.

Claim 16 is a computer program product system claim with limitations corresponding to the limitations of method Claim 10 and is rejected under similar rationale.

Regarding Claim 17, McDuff teaches that the “synthesized output 422” of Figure 4 is being modified according to the input (voice and image) by the “user 102.”  Accordingly, the output will adjust according to the next incoming input.  Additionally, “dialogue” implies a back and forth and more than one turn of speech.  
Kim expressly teaches:
17. The non-transitory computer-readable medium of claim 16, further comprising the steps of: 
capturing a second input that indicates at least one behavior the user performs in response to the outputting of the first vocalization; and [Kim, Figure 3 shows a training process where the models are updated based on second, third, etc. inputs by the user.  The user input (Figure 2) are in response to the output by device because Kim is directed to “User Experience Evaluation” in response to a task performed corresponding to a user input command.  First user input is the command; Second user input is his reaction to the performance of the command.]
modifying the response mapping based on the second input and a first objective function that is evaluated to determine how closely the at least one behavior corresponds to a target behavior. [Kim, Figures 2 and3.  Figure 2 shows a “Mapping Unit 130” which maps the commands to skills/tasks and a “User experience evaluation unit 132” which includes an “emotion analyzer 266” for analyzing the emotion of the user in response to the experience of receiving the result.  Figure 3 shows that based on the response of the user both “User Experience Model” and “Skill Model” are updated/modified.  The “Skill Model” teaches the “Response Mapping” because it determines which function/skill will be invoked in response to a command.  The “User Experience Model” teaches the “Objective Function” because it determines based on the detected emotion of the user (See Figure 4, 3rd and 4th processing stages 406, 408) how close is the emotion/behavior of the user to target behavior/emotion of a “good experience” (Figure 4, 450).  See [0056]-[0057] and “[0068] … The user experience model and the skill matching model may be updated based on the user feedback, such as described with reference to FIG. 3.”]
McDuff and Kim pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and adjusting the response.  It would have been obvious to combine the evaluation of the user experience/behavior in response to the task (skill) detected and performed by the Virtual Assistant and update of the model that shows how satisfied the user is (User behavior Corresponding to Target behavior of satisfaction) from Kim with the system of McDuff in order to improve the accuracy of the model that detects the emotion of the user.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
For Claim 17 see also Yang (U.S. 2021/0074261): Figure 14.
Manfredi expressly teaches: 
17. The non-transitory computer-readable medium of claim 16, further comprising the steps of: 
capturing a second input that indicates at least one behavior the user performs in response to the outputting of the first vocalization; and [Manfredi, Figure 4, first there is a “Self-Introduction 1” by the machine, in response to which the user provides input to “Input Collection 2.”  Manfredi also teaches disambiguation when the Virtual Assistant is unable to decide the emotion of the user in a determinative way and asks further questions followed by “second input” from the user.  “19. The method of claim 14 further comprising: detecting an ambiguous user emotion; and forming a virtual assistant question, unrelated to a current dialogue with said user, to elicit more information on an emotion of said user.”  “[0011] …If there is ambiguity because of different emotional inputs, additional prompting, as described above, can be used to resolve the ambiguity.”  See also [0226].] [The added language of “in response to outputting of the first vocalization” further clarifies the language and is consistent with the previous interpretation of the language.]
modifying the response mapping based on the second input and a first objective function that is evaluated to determine how closely the at least one behavior corresponds to a target behavior.  [Manfredi adjusts the output according to modified evaluation of emotion: “A modular digital assistant that detects user emotion and modifies its behavior accordingly. …”  Abstract.  “[0118] Before sending a further stimulus (question, sentence, action) or the answer, there is an "emotional part loading." That is, the Virtual Assistant is provided with an emotional status appropriate for dialogue flow, stimulus to be sent or the answer.”  See [0226]-[0227].  “[0238] The outcome is an expressive and emotional dynamic nature of the VA which, based on some consolidated elements (emotive valence of discussed subject, answer to be provided and VA's emotional model) may dynamically vary, real time, with regard to interaction with the interface and the context.”  The “first objective function” of the Claim is taught or suggested by the “VA’s emotional model.”]

Regarding Claim 2, McDuff teaches:
2. The computer-implemented method of claim 1, 
wherein the user resides within a vehicle where the first input is captured, and [McDuff teaches that the “local computing device 106” of Figure 1 may be “a vehicle computing system” ([0021]) and also teaches that the User 102 may be driving:  “[0076] The emotion identified by the facial expression recognizer 416 may be provided to the conversational style manager 402 to modify the utterance of the embodied conversational agent 302. … For example, a forward-facing camera on a smartphone may provide the video input 410 of the user's 102 face, but the conversational agent app on the smartphone may provide audio-only output without displaying an embodied conversational agent 302 (e.g., in a "driving mode" that is designed to minimize visual distractions to a user 102 who is operating vehicle).”]
wherein the first operation is performed on behalf of the user by a vehicle subsystem included in the vehicle. 
McDuff teaches that the “local computing device 106” of Figure 1 may be “a vehicle computing system” ([0021]) and that it may be used in a “driving mode” ([0076]). Accordingly, McDuff at the least suggests that the operations performed by the computing device would pertain to “a vehicle subsystem.”
McDuff does not teach this expressly.
Instant Application, “Description of Related Art” includes an example of the VPA activating the air conditioner of the car via the VPA (virtual personal assistant) and another example of the VPA changing the volume of the car radio.  (Published Application [0003]-[0004].) Accordingly, the combination of the Applicant’s Admitted Prior art and McDuff can teach this Claim.  
Another reference is also added.
Wang (U.S. 20180314689)teaches:
herein the user resides within a vehicle where the first input is captured, and [Wang, Figure 36, the User/Speaker is the driver inside a car/vehicle and is speaking to the car VPA and the car VPA is responding.  “[0522] FIG. 36 illustrates an example where a virtual personal assistant has been implemented in an automobile 3610….  In various implementations, given that a driver 3600 may ask for any information, or may desire to execute tasks unrelated to driving, the domain may be broader….”]
wherein the first operation is performed on behalf of the user by a vehicle subsystem included in the vehicle. [Wang teaches that when the VPA is implemented in a vehicle, the functions/operations that it performs could be primarily related to the functions/ “first operation” of a car:  “[0522] … In this example, the domain can be defined as primarily vehicle and travel related. As such, the domain may include information about the functionality of the automobile 3610, about vehicle maintenance, and about driving, among other things. The domain may also include information such as maps, address books, routes, and so on. In implementations where the automobile 3610 is able to communicate with a network (e.g., through a cellular connection, radio signals, and/or through a mobile device 3632 that has been connected to the automobile 3610), the domain information may also include, for example, weather, road conditions, traffic, and other real-time information….”]
McDuff and Wang pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and responding to the user/speaker accordingly and it would have been obvious to modify the system of McDuff which can be used in a car with the system of Wang that specifies that the functions/operations requested from the VPA are functions pertaining to the vehicle in order to draw vehicle-related utility from the VPA.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.


    PNG
    media_image3.png
    504
    627
    media_image3.png
    Greyscale


Andruszkiewicz (U.S. 202002927) teaches:
5. The computer-implemented method of claim 4, 
wherein the first input comprises an audio input, and [Andruszkiewicz, Figures 6, 7, 8, 9A and 16A.  The input includes “User’s Utterance”  at 601, 701, 910.  “[0212] According to an embodiment, the emotion identification module 1650 may identify the user's emotion based on information 1610 about the user's utterance.”  One type of information obtained from the utterance is audio features: 1612, another type is the text 1611 obtained from the input utterance.  Information regarding both audio characteristics (e.g. frequency) and text helps determine which words are emphasized.]
wherein the first feature comprises a tone of voice associated with the user. [Andruszkiewicz, Figure 16A and Figure 7 (intonation).  “[0215] The emotion identification module 1650 may analyze the frequency of the voice signal 1612. For example, the emotion identification module 1650 may identify a high-frequency portion and a low-frequency portion in the voice signal and identify where the speech is emphasized according to the amplitude or pitch of frequency. The emotion identification module 1650 may identify a word or text corresponding to the emphasized portion in the voice through the frequency signal and determine that the emotion corresponding to the word is the user's emotion.”  “[0224] According to an embodiment, the conversation style identification module 1750 may analyze at least one of conversation content and conversation history corresponding to the user's utterance. The conversation style identification module 1750 may also analyze the intonation of the user's utterance….”]
McDuff/Kim and Andruszkiewicz pertain to evaluation of emotion of a user while interacting with a virtual personal assistant and responding to the user/speaker based on the parameters corresponding to the user’s conversational style and emotion (Andruszkiwicz, Abstract) and it would have been obvious to modify the system of McDuff/Kim which receives speech as input and evaluates speech but does not elaborate on the evaluation of prosodical features of speech (such as tone of voice) with the system of Andruszkiewicz which is more elaborate with respect to the evaluation of sound/audio aspects of speech, and specifically mentions “intonation,” for completeness.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659