DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
2.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 04/20/2021 has been entered.
 
Response to Arguments
3.	Applicant’s arguments with respect to claims 1 – 3, 5 – 7, 10 – 14, 17 - 26 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant argues that the prior art of record does not teach outputting synthesized speech representing the response dialogue; detecting speech of the user during the outputting of the synthesized speech: and ceasing the outputting of the synthesized speech; the conversational context comprises physical factors of the environment sensed by the system; assign a prosodic quality to the response dialogue 

4.	Applicant's arguments filed 04/20/21 have been fully considered but they are not persuasive. 
Applicant argues that Amini et al. in view of Horling et al. do not teach determine a conversational context associated with the speech; selecting one of the multiple response dialog choices based on the conversational context associated with the speech (Amendment, pages 12, 13).
The examiner disagrees, since Amini et al. disclose “Some advantages of the invention can include generating a BML for a virtual agent during a dialogue with a user.  Some advantages of the invention can include better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user).”(paragraph 10).  And Horling et al. disclose “a response selected from a plurality of candidate responses based on the state expressed by the user.  In various implementations, the state expressed by the user may be a negative sentiment (context), and the response selected from the plurality of candidate responses may include an empathetic statement” (paragraph 8).
Thus, the combination of Amini et al. in view of Horling et al. teach all parts of the limitation.
 
Claim Rejections - 35 USC § 103
5.	The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
6.	Claims 1 – 3, 5, 6, 21 are rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Galley et al. (US PAP 2016/0352656); and further in view Kuramitsu et al. (W0/2018230669).
As per claims 1, Amini et al., teach a method comprising:
receiving audio input representing speech of a user; recognizing a content of the speech (“a content of the user utterance”; paragraph 11);
determining a linguistic style of the speech; generating a response dialogue based on the content of the speech; and modifying the response dialogue based on the linguistic style of the speech/prosodic qualities (“better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user)… determining, by the computer, at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the particular user utterance”; paragraphs 10, 11, claim 1).
However, Amini et al. do not specifically teach generating a response dialogue based on the content of the speech, through use of a neural network.
Galley et al. disclose that the neural network 322 can be trained from end to end on massive amounts of social media conversational data.  In this example, the response generation engine 318 utilizes the neural network 322 model to improve open-domain response generation in conversations (paragraph 98).

However, Amini et al., in view of Galley et al. do not specifically teach outputting synthesized speech representing the response dialogue; detecting speech of the user during the outputting of the synthesized speech: and ceasing the outputting of the synthesized speech.
Kuramitsu et al. disclose that when a response other than a response to prompt reproduction of the next partial content is input (S36: end), the voice analysis unit 511 instructs the processing unit 510 to stop the output of the voice.  In step S37, the processing unit 510 temporarily stops the output of the synthesized speech of the partial content. In step S38, the processing unit 510 performs processing according to the input voice of the user (paragraphs 68, 69).
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to cease the outputting of the synthesized speech when detecting speech of the user as taught by Kuramitsu et al. in Amini et al., in view of Galley et al., because that would help provide improved quality and accuracy of machine generated responses enables more efficient communication between users and the response generation systems (Galley et al., paragraph 37).



As per claims 3, Amini et al., in view of Galley et al. further disclose the content variables include at least one of repetition, or utterance length (“the one or more dialogue metrics relative length of dialogue, number of misunderstandings, number of repetitions”; Amini et al., paragraph 21).

As per claim 5, Amini et al., in view of Galley et al. further disclose generating a synthetic facial expression for an embodied conversational agent based on a sentiment identified from the response dialogue (“facial expression”; Amini et al., paragraphs 11, 22; Galley et al. paragraph 137).

As per claim 6, Amini et al., in view of Galley et al. further disclose identifying a facial expression of the user; and generating a synthetic facial expression for an embodied conversational agent based on the facial expression of the user (“Applying the emotion vector of the user, the mood vector of the user and/or the personality vector of the user to the virtual agent can involve instructing the virtual agent to modify one or more statements, facial expressions, vocal expressions, or body language to match and/or change an emotional state of the user.”; Amini et al., paragraphs 41, 65, 82).

.

7.	Claims 7, 10 – 13, are rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Horling et al. (US PAP 2018/0197542); and further in view of Galley et al. (US PAP 2016/0352656).
As per claim 7, Amini et al. teach a system comprising:
a microphone configured to generate an audio signal representative of sound; a speaker configured to generate audio output; one or more processors (paragraphs 3, 41, 245); and
memory storing instructions that, when executed by the one or more processors,
cause the one or more processors (paragraphs 248, 249) to:
detect speech in the audio signal; recognize a content of the speech (“a content of the user utterance”; paragraph 11);
determine a conversational context associated with the speech (“better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user)… determining, by the computer, at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the particular user utterance”; paragraphs 10, 11, claim 1).
However, Amini et al. do not specifically teach generate multiple response dialogue choices having response content based on the content of the speech; and

Horling et al. disclose that a response selected from a plurality of candidate responses based on the state expressed by the user.  In various implementations, the state expressed by the user may be a negative sentiment, and the response selected from the plurality of candidate responses may include an empathetic statement… In 
various implementations, the one or more signals may include detection of a change in a context of the user since a last interaction between the user and the chatbot (paragraphs 8, 9).
	Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to select one response dialog based on prosodic qualities as taught by Horling et al. in Amini et al., because that would help provide an appropriate response (paragraph 39).
However, Amini et al. in view Horling et al. do not specifically teach the conversational context comprises physical factors of the environment sensed by the system.
Galley et al. disclose deriving conversational context data further comprises capturing non-linguistic context data from a set of sensors associated with the user in real-time, wherein the set of sensors comprises at least one of a camera, an audio sensor, a global positioning system (GPS) sensor, an infrared sensor, a pressure sensor, a motion sensor, an orientation sensor, temperature sensor, medical sensor, physical activity sensor, or speed sensor (paragraph 221).


As per claim 10, Amini et al., further disclose a display, and wherein the instructions cause the one or more processors to generate an embodied conversational agent on the display, and wherein the embodied conversational agent has a synthetic facial expression based on the conversational context associated with the speech (“automatically generating at least one of facial expressions, body gestures, vocal expressions, or verbal expressions for a virtual agent”; Abstract; paragraphs 11, 22, 134).

As per claim 11, Amini et al., further disclose the conversational context comprises a sentiment identified from the response dialog and the synthetic facial expression of the embodied conversational agent is based on the sentiment (“determining an emotion vector for the virtual agent based on an emotion vector 
of a user for a user utterance…determining at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the user utterance and at least one of the emotion vector for the virtual agent, the mood vector for the virtual agent and the personality vector for the virtual agent... determining the emotion vector of the user further comprises determining a sentiment 

As per claim 12, Amini et al., further disclose a camera, wherein the instructions cause the one or more processors to identify a facial expression of a user in an image generated by the camera, and wherein the conversational context comprises the facial expression of the user (“can involve receiving voice of the user (e.g., via a microphone) and/or facial expression of the user (e.g., via a camera).  In some embodiments, the method involves receiving one or more dialogue performance metrics (e.g., from a dialogue manager of the virtual agent).”; paragraphs 41, 65, 82).

As per claim 13, Amini et al., further disclose a camera, wherein the instructions cause the one or more processors to identify a head orientation of a user in an image generated by the camera, and wherein the embodied conversational agent has head pose based on the head orientation of the user (paragraphs 41, 133, 166- 169).

As per claim 25, Amini et al., in view of Galley et al., and further in view of Horling et al. further disclose the physical factors of the environment comprise at least one of location, movement, acceleration, orientation, ambient light levels, network connectivity, temperature, or humidity (Galley et al., paragraph 221).

s 14, 17, 18, 20, 24 are rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Horling et al. (US PAP 2018/0197542); and further in view of Heckmann (US PAP 2013/0262117).
As per claim 14, Amini et al., teach a  computer-readable storage medium having computer-executable instructions stored thereupon, when executed by one or more processors of a computing system, cause the computing system to:
receive conversational input from a user, wherein the input comprises speech of the user; receive video input including a face of the user (“a content of the user utterance”; paragraphs 11, 41);
determine a linguistic style of the conversational input of the user, wherein the linguistic style comprises acoustic variables (“One or more vocal characteristics (e.g., volume, pitch, speed, frequency, energy, and/or intonation) of the user can be classified into one or more voice emotion categories (e.g., happy, angry, sad, surprised, disgusted, and/or scared).  paragraphs 64, 102, 108); 
determine a facial expression of the user (“better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user)… determining, by the computer, at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the particular user utterance”; paragraphs 10, 11, claim 1); and 
generate an embodied conversational agent having lip movement based on the response dialogue and a synthetic facial expression based on the facial expression of the user (“applying, by the computer, the facial expression, body gesture, 

	However, Amini et al. do not specifically teach generating a plurality of response dialogue choices based on the conversational input of the user, wherein each of the response dialogue choices is a possible response to the conversational input of the user and each is characterized by linguistic style variables; select a response dialog from the plurality of response dialog choices based on linguistic style variables of the plurality of response dialog choices and the linguistic style of conversational input of the user.
	Horling et al. disclose that a response selected from a plurality of candidate responses based on the state expressed by the user.  In various implementations, the state expressed by the user may be a negative sentiment, and the response selected from the plurality of candidate responses may include an empathetic statement… In 
various implementations, the one or more signals may include detection of a change in a context of the user since a last interaction between the user and the chatbot… online semantic processor 54 may handle other types of voice inputs, such as conversational statements from a user expressing the user's state (e.g., sentiment). [paragraphs 8, 9, 35].
	Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to select one response dialog based on linguistic style variables as taught by Horling et al. in Amini et al., because that would help provide an appropriate response (paragraph 39).

Heckmann discloses that the prosodic cues may be either extracted from the 
speech/acoustical signal, the video signal, e.g. representing a recording of a user's upper body, preferably including the head and face…the invention uses prosodic speech features in a human machine dialog to infer the importance of different parts of an utterance and to use this information to make the dialog more intuitive and more robust. The dialog is more intuitive because users can speak more natural, i.e. using prosody, and the system also uses prosodic cues to give feedback to the user (paragraphs 27, 70, 71).
	Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to assign a prosodic quality to the response dialogue as taught by Heckmann in Horling et al. in view of Amini et al., because that would help improve spoken dialog systems (paragraph 19).

	As per claim 17, Amini et al., further disclose determination of the facial expression of the user comprises identifying an emotional expression of the user (Abstract).

	As per claim 18, Amini et al., further disclose the computing system is further caused to: identify a head orientation of the user; and cause the embodied 

	As per claim 20, Amini et al., further disclose the synthetic facial expression is based on a sentiment identified in the speech of the user (“determining an emotion vector for the virtual agent based on an emotion vector of a user for a user utterance…determining at least one of a facial expression, body gesture, vocal expression, or verbal expression for the virtual agent based on a content of the user utterance and at least one of the emotion vector for the virtual agent, the mood vector for the virtual agent and the personality vector for the virtual agent... determining the emotion vector of the user further comprises determining a sentiment score of the utterance by determining a sentiment score for each word in the utterance”; paragraphs 11, 22).

	As per claim 24, Amini et al. in view of Horling et al. further disclose the linguistic style further comprises content variables and the response dialog is selected from the plurality of response dialog choices based on the content variables (“the two or more inputs further comprises one or more dialogue metrics, the one or more dialogue metrics relative length of dialogue, number of misunderstandings, number of repetitions, and/or number of clarification requests by dialogue manager.”; Amini et al. paragraph 21, 59; Horling et al. paragraphs 8, 9, 35).

s 22, 23, are rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Horling et al. (US PAP 2018/0197542); and further in view of Gong (US PAP 2003/0167167).
As per claim 22, Amini et al., do not specifically teach the conversational context comprises an indication of where the user is looking as determined by eye tracking.
Gong discloses that a video camera or a vision tracking device may provide non-verbal data about the user's eye focus, head orientation, and other body position information (paragraph 25).  
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to use an eye tracking device as taught by Gong in Horling et al. in view Amini et al., because that would help provide an improved experience for the user as the agent assists the user in operating a computing device or computing device application program (paragraph 21).

As per claim 23, Amini et al., in view of Horling et al. do not specifically teach the   comprise word choice and utterance length. 
Gong discloses that the verbal extractor 322 also parses the verbal content to 
determine the linguistic style of the user, such as word choice, grammar choice, and syntax style. Speech style may include speech rate, pitch average, pitch range, intensity, voice quality, pitch changes, and level of articulation (paragraphs 27, 69).
	Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to determine linguistic style variables as taught by Gong in Horling et al. in view Amini et al., because that would help provide an improved .

10.	Claim 26 is rejected under 35 U.S.C. 103 as being unpatentable over Amini et al. (US PAP 2018/0144761) in view of Galley et al. (US PAP 2016/0352656); further in view Kuramitsu et al. (W0/2018230669); and further in view of further in view of Heckmann (US PAP 2013/0262117).
As per claim 26, Amini et al. in view of Galley et al. do not specifically teach assigning a prosodic quality to synthesized speech based on the facial expression of the user and on the acoustic variables of the speech of the user.
Heckmann discloses that the prosodic cues may be either extracted from the 
speech/acoustical signal, the video signal, e.g. representing a recording of a user's upper body, preferably including the head and face…the invention uses prosodic speech features in a human machine dialog to infer the importance of different parts of an utterance and to use this information to make the dialog more intuitive and more robust. The dialog is more intuitive because users can speak more natural, i.e. using prosody, and the system also uses prosodic cues to give feedback to the user (paragraphs 27, 70, 71).
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to assign a prosodic quality to the response dialogue as taught by Heckmann in Galley et al. in view of Amini et al., because that would help improve spoken dialog systems (paragraph 19).


Conclusion
11.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD SAINT CYR whose telephone number is (571)272-4247.  The examiner can normally be reached on Monday- Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/LEONARD SAINT CYR/Primary Examiner, Art Unit 2658