DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
2.	Applicant’s arguments with respect to claims 1, 5 – 7, 10, 11, 13, 14, 17, 18, 20 – 23, 25, 27 - 31 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant argues that the prior art of record does not teach assigning a prosodic quality to the response dialogue of the conversational agent based on the facial expression of the user and on the acoustic variables of the speech/conversational input of the user; the linguistic style comprises utterance length (Amendment, pages 9 – 16).

Claim Rejections - 35 USC § 103
3.	The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
4.	Claims 1, 5 - 7, 10, 11, 13, 21, 22, 25, 27, 28, 30 are rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Galley et al. (US PAP 2016/0352656); and further in view Mahoor et al. (US PAP 2020/0114521).
As per claims 1, Amini et al., teach a method comprising:

identifying an acoustic variable of the speech (“One or more vocal characteristics (e.g., volume, pitch, speed, frequency, energy, and/or intonation) of the user can be classified into one or more voice emotion categories (e.g., happy, angry, sad, surprised, disgusted, and/or scared).”; paragraphs 64, 102, 108);
receiving video input including a face of the user; identifying a facial expression of the user (“better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user)… determining, by the computer, at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the particular user utterance”; paragraphs 10, 11, claim 1);
generating a response dialogue of a conversational agent based on the content of the speech (“better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user)… determining, by the computer, at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the particular user utterance”; paragraphs 10, 11, claim 1).
However, Amini et al. do not specifically teach generating a response dialogue based on the content of the speech, through use of a neural network.
Galley et al. disclose that the neural network 322 can be trained from end to end on massive amounts of social media conversational data.  In this example, the response 
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to generate a response dialogue based on the content of the speech, through use of a neural network as taught by Galley et al. in Amini et al., because that would help provide improved quality and accuracy of machine generated responses enables more efficient communication between users and the response generation systems (paragraph 37).
However, Amini et al., in view of Galley et al. do not specifically teach assigning a prosodic quality to the response dialogue of the conversational agent based on the facial expression of the user and on the acoustic variables of the speech of the user; outputting synthesized speech representing the response dialogue having the prosodic quality.
Mahoor et al. disclose a dialog manager may use a task file that determines 
how the companion robot may interact with a user in response to the user's speech and affect (e.g., engagement, frustration, excitement, tone, facial expression, etc.)… In some embodiments, a dialog manager may interpret a user's facial expressions, eye gaze, and/or speech prosody to convey affect and/or produce corresponding facial expressions, eye gaze, and/or speech prosody.  For example, when the user is speaking, the companion robot can smile and nod to indicate it understands or agrees with what the user is saying.  As another example, the head may be moved by neck mechanisms to produce head nods or rotation of the head while 
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to output synthesized speech representing the response dialogue having the prosodic quality as taught by Mahoor et al. in Amini et al., in view of Galley et al., because that would help provide improved quality and accuracy of machine generated responses enables more efficient communication between users and the response generation systems (Galley et al., paragraph 37).

As per claim 5, Amini et al., in view of Galley et al. further disclose generating a synthetic facial expression for an embodied conversational agent based on a sentiment identified from the response dialogue (“facial expression”; Amini et al., paragraphs 11, 22; Galley et al. paragraph 137).

As per claim 6, Amini et al., in view of Galley et al. further disclose generating a synthetic facial expression for an embodied conversational agent based on the facial expression of the user (“Applying the emotion vector of the user, the mood vector of the user and/or the personality vector of the user to the virtual agent can involve instructing the virtual agent to modify one or more statements, facial expressions, vocal expressions, or body language to match and/or change an emotional state of the user.”; Amini et al., paragraphs 41, 65, 82).

As per claim 7, Amini et al. teach a system comprising:

memory storing instructions that, when executed by the one or more processors,
cause the one or more processors (paragraphs 248, 249) to:
detect speech in the audio signal; recognize a content of the speech (“a content of the user utterance”; paragraph 11);
determine a conversational context associated with the speech (“better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user)… determining, by the computer, at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the particular user utterance”; paragraphs 10, 11, claim 1);
identifying a facial expression of a user in the image generated by the camera (“determining, by the computer, at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the particular user utterance”; paragraphs 10, 11, claim 1).
However, Amini et al. do not specifically teach the conversational context comprises physical factors of an environment sensed by the system; assigning a prosodic quality to the response dialogue of the conversational agent based on the facial expression of the user and on the acoustic variables of the speech of the user; cause the speaker to generate the response dialogue of the conversational agent having the prosodic quality.

Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to detect physical factors of the environment sensed as taught by Galley et al. in Amini et al., because that would help provide improved quality and accuracy of machine generated responses enables more efficient communication between users and the response generation systems (Galley et al., paragraph 37).
However, Amini et al. in view of Galley et al. do not specifically teach assigning a prosodic quality to the response dialogue of the conversational agent based on the facial expression of the user and on the acoustic variables of the speech of the user; cause the speaker to generate the response dialogue of the conversational agent having the prosodic quality.
Mahoor et al. disclose a dialog manager may use a task file that determines 
how the companion robot may interact with a user in response to the user's speech and affect (e.g., engagement, frustration, excitement, tone, facial expression, etc.)… In some embodiments, a dialog manager may interpret a user's facial expressions, eye gaze, and/or speech prosody to convey affect and/or produce corresponding facial expressions, eye gaze, and/or speech prosody.  For example, when the user is speaking, the companion robot can smile and nod to indicate it understands or agrees 
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to output synthesized speech representing the response dialogue having the prosodic quality as taught by Mahoor et al. in Amini et al., in view of Galley et al., because that would help provide improved quality and accuracy of machine generated responses enables more efficient communication between users and the response generation systems (Galley et al., paragraph 37).

As per claim 10, Amini et al., further disclose a display, and wherein the instructions cause the one or more processors to generate an embodied conversational agent on the display, and wherein the embodied conversational agent has a synthetic facial expression based on the conversational context associated with the speech (“automatically generating at least one of facial expressions, body gestures, vocal expressions, or verbal expressions for a virtual agent”; Abstract; paragraphs 11, 22, 134).

As per claim 11, Amini et al., further disclose the conversational context comprises a sentiment identified from the response dialogue and the synthetic facial expression of the embodied conversational agent is based on the sentiment (“determining an emotion vector for the virtual agent based on an emotion vector 


As per claim 13, Amini et al., in view of Mahoor et al. further disclose a camera, wherein the instructions cause the one or more processors to identify a head orientation of the user in the image generated by the camera, and wherein the embodied conversational agent has head pose based on the head orientation of the user (Amini et al., paragraphs 41, 133, 166- 169; Mahoor et al. paragraph 84).

As per claim 21, Amini et al., in view of Galley et al. further disclose the neural network uses a neural model built from a large-scale unconstrained database of human conversations (Galley et al. paragraph 98; Galley et al. paragraphs 58, 59, 137).

As per claim 22, Amini et al., in view of Mahoor et al. further disclose that the conversational context comprises an indication of where the user is looking as determined by eye tracking (Mahoor et al. paragraph 84).



As per claim 27, 30, Amini et al., in view of Galley et al., and further in view of Mahoor et al. further disclose identifying an emotion from the facial expression of the user and wherein the prosodic quality of the response dialogue is based on the emotion (Mahoor et al paragraphs 84, 85).

As per claim 28, Amini et al., in view of Galley et al., and further in view of Mahoor et al. further disclose the emotion is sadness and the prosodic quality is a lowering of tone (“select a voice tone or words, phrases, or sentences based on a mood to the user via the speakers… The integrated emotion may include the emotions of anger, anxiety, disgust, dejection, fear, grief, guilt, joy, loneliness, love, sadness, shame, and/or surprise, etc.”; Mahoor et al., paragraphs 68, 84, 85).

5.	Claims 14, 17, 18, 20, 31 are rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Mahoor et al. (US PAP 2020/0114521).
As per claim 14, Amini et al., teach a  computer-readable storage medium having computer-executable instructions stored thereupon, when executed by one or more processors of a computing system, cause the computing system to:

determine a linguistic style of the conversational input of the user, wherein the linguistic style comprises acoustic variables (“One or more vocal characteristics (e.g., volume, pitch, speed, frequency, energy, and/or intonation) of the user can be classified into one or more voice emotion categories (e.g., happy, angry, sad, surprised, disgusted, and/or scared).  paragraphs 64, 102, 108); 
determine a facial expression of the user (“better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user)… determining, by the computer, at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the particular user utterance”; paragraphs 10, 11, claim 1). 
	However, Amini et al. do not specifically teach assigning a prosodic quality to a response dialogue of a conversational agent based on the facial expression of the user and on the acoustic variables of the conversational input of the user; cause the conversational agent to generate the response dialogue having the prosodic quality.
Mahoor et al. disclose a dialog manager may use a task file that determines 
how the companion robot may interact with a user in response to the user's speech and affect (e.g., engagement, frustration, excitement, tone, facial expression, etc.)… In some embodiments, a dialog manager may interpret a user's facial expressions, eye gaze, and/or speech prosody to convey affect and/or produce corresponding facial expressions, eye gaze, and/or speech prosody.  For example, when the user is 
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to output synthesized speech representing the response dialogue having the prosodic quality as taught by Mahoor et al. in Amini et al., because that would help produce contextually appropriate responses, facial expressions, and/or neck movements (Mahoor et al., paragraph 28).

	As per claim 17, Amini et al., in view of Mahoor et al. further disclose determination of the facial expression of the user comprises identifying an emotional of the user and wherein the prosodic quality of the response dialogue is based on the emotion (“a companion robot may express facial expressions, track a user's body and/or face, recognize a user's expressions, and/or react appropriately to the user's emotional state.”; Mahoor et al. paragraphs 25, 84; Amini et al., Abstract).

	As per claim 18, Amini et al., in view of Mahoor et al. further disclose the computing system is further caused to: identify a head orientation of the user; and cause the embodied conversational agent to have a head pose that is based on the head orientation of the user (“the head may be moved by neck mechanisms to produce head nods or rotation of the head while listening or thinking, which may be synchronized 

	As per claim 20, Amini et al., in view of Mahoor et al. further disclose the conversational agent is an embodied conversational agent and wherein a synthetic facial expression of the embodied conversational agent is based on a sentiment identified in the speech of the user (“determining an emotion vector for the virtual agent based on an emotion vector of a user for a user utterance…determining at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the user utterance and at least one of the emotion vector for the virtual agent, the mood vector for the virtual agent and the personality vector for the virtual agent... determining the emotion vector of the user further comprises determining a sentiment score of the utterance by determining a sentiment score for each word in the utterance”; Amini et al., paragraphs 11, 22).

	As per claim 31, Amini et al., in view of Mahoor et al. further disclose the emotion is sadness and the prosodic quality is a lowering of tone (“select a voice tone or words, phrases, or sentences based on a mood to the user via the speakers… The integrated emotion may include the emotions of anger, anxiety, disgust, dejection, fear, grief, guilt, joy, loneliness, love, sadness, shame, and/or surprise, etc.”; Mahoor et al. paragraphs 68, 84, 85).

23 is rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Filev et al. (US PAP 2014/0313208).
As per claim 23, Amini et al., in view of Mahoor et al. do not specifically teach the linguistic style comprises utterance length. 
Filev et al. disclose that Syntactic analysis algorithms may use factors in the spoken speech such as sentence length (paragraphs 90, 140).
	Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to determine utterance length as taught by Mahoor et al. in Amini et al., because that would help produce contextually appropriate responses, (Mahoor et al., paragraph 28).

7.	Claim 29 is rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Galley et al. (US PAP 2016/0352656); further in view Mahoor et al. (US PAP 2020/0114521); and further in view of Kuramitsu et al. (W0/2018230669).
However, Amini et al., in view of Galley et al. do not specifically teach detecting speech of the user during the outputting of the synthesized speech: and ceasing the outputting of the synthesized speech.
Kuramitsu et al. disclose that when a response other than a response to prompt reproduction of the next partial content is input (S36: end), the voice analysis unit 511 instructs the processing unit 510 to stop the output of the voice. In step S37, the processing unit 510 temporarily stops the output of the synthesized speech of the partial 
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to cease the outputting of the synthesized speech when detecting speech of the user as taught by Kuramitsu et al. in Amini et al., in view of Galley et al., because that would help provide improved quality and accuracy of machine generated responses enables more efficient communication between users and the response generation systems (Galley et al., paragraph 37).

Conclusion
8.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.