DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claims 1-19 have been considered but are moot because of the new ground of rejection in view of Lindahl and Chandrasekaran for claims 1-2, 4-9, 11-16 and 19; and Lindahl, Chandrasekaran and Gong for claims 10 and 18. Specifically, the claim limitations regarding the language “wherein the contextually-adjusted language adjustment changes what is said by the voice assistant, wherein generating the contextually-adjusted language adjustment comprises: accessing a library of phrases; comparing the base characteristics and/or the media content characteristics to a phrase characteristic associated with a phrase; and selecting the phrase based on a similarity between the base characteristics and/or the media content characteristics and the phrase characteristic; generating a contextually-adjusted speech adjustment based at least in part on the base characteristics and the media content characteristics, wherein contextually-adjusted speech adjustment changes how something is said by the voice assistant; and generating the synthesized speech based on the contextually-adjusted language adjustment and the contextually-adjusted speech adjustment” in claims 1, 11 and 19 caused the new grounds of rejection.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-9, 11-16 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Lindahl et al. [US PG Pub 20110066438] in view of Chandrasekaran et al. [US PG Pub 20190266999].

	As per Claim 1, Lindahl discloses:
A method for generating synthesized speech of a voice assistant having a contextually-adjusted audio output using a voice-enabled device, the method comprising: 	identifying media content currently being played by a media playback system (Lindahl; Fig. 8, item 146; p. 0077 - The method 144 may include a step 146 of analyzing primary audio material (e.g., music, speech, or a video soundtrack) of a media item); 	identify first media content characteristics associated with the first media content currently being played (Lindahl; Fig. 8, item 148; p. 0077-0078 - From such analysis, a reverberation characteristic of the primary audio material may be determined in a step 148); 
identify first base characteristics of audio output (Lindahl; Fig. 8; p. 0079 - if it is determined that a music track has significant reverberation, the reverberation of a voiceover announcement associated with the music track (e.g., a song title, artist name, or playlist name) may be increased to make the voiceover announcement sound as if it were recorded in the same venue as the music track; see also p. 0069); 	generating a contextually-adjusted language adjustment output based at least in part on the base characteristics and the media content characteristics (Lindahl; Fig. 8, item 150; p. 0079 - The reverberation characteristic of the voiceover announcement may be modified to more closely approximate that of the primary audio material, which may result in a user perceiving a voiceover announcement (played concurrently with or close in time to the primary audio material) to be more natural; p. 0089 - the voice feedback may be altered to add different linguistic accents to the speech depending on the genre or some other contextual aspect of the media item). 
Lindahl, however, fails to disclose wherein the contextually-adjusted language adjustment changes what is said by the voice assistant, wherein generating the contextually-adjusted language adjustment comprises: accessing a library of phrases; comparing the base characteristics and/or the media content characteristics to a phrase characteristic associated with a phrase; and selecting the phrase based on a similarity between the base characteristics and/or the media content characteristics and the phrase characteristic; generating a contextually-adjusted speech adjustment based at least in part on the base characteristics and the media content characteristics, wherein contextually-adjusted speech adjustment changes how something is said by the voice assistant; and generating the synthesized speech based on the contextually-adjusted language adjustment and the contextually adjusted speech adjustment.	Chandrasekaran does teach wherein the contextually-adjusted language adjustment changes what is said by the voice assistant, wherein generating the contextually-adjusted language adjustment comprises: accessing a library of phrases; comparing the base characteristics and/or the media content characteristics to a phrase characteristic associated with a phrase; and selecting the phrase based on a similarity between the base characteristics and/or the media content characteristics and the phrase characteristic; generating a contextually-adjusted speech adjustment based at least in part on the base characteristics and the media content characteristics, wherein contextually-adjusted speech adjustment changes how something is said by the voice assistant (Chandrasekaran; Fig. 4; p. 0062 - FIG. 4 illustrates how, given a conversational state 20 and inferred emotional state 24 (base characteristics), response selector 26 may vary the text response sent to the TTS subsystem 28 whereby responses may be varied based on rules or lookups against pre-selected responses or encoded as responses generated by a learned model such as a deep neural network to generate conversationally and emotionally appropriate responses having the appropriately contextualized text and tone according to an example embodiment (the contextually-adjusted language adjustment changes what is said by the voice assistant). In sample embodiments, the neural network generates a representation of the user's emotional state. For example, the representation of the user's emotional state may be a high-dimensional vector of weights that can richly represent the emotional state of the user. In one instantiation, this vector can be compared against similar vectors computed over the space of possible responses (library of phrases). The distance between the two vectors (e.g., using cosine similarity) is one way in which the deep neural network system could select the best response (closer=better); p. 0112 - Example 5 is an example as in Example 1 wherein the memory device further comprises instructions stored therein, which when executed by the processing circuitry, configure the processing circuitry to output the generated response to the user using language, volume, and tone that accounts for the determined emotional state of the user by modifying at least one of a speed, tone, and language in the response to the user); and generating the synthesized speech based on the contextually-adjusted language adjustment and the contextually adjusted speech adjustment (Chandrasekaran; p. 0112 - Example 5 is an example as in Example 1 wherein the memory device further comprises instructions stored therein, which when executed by the processing circuitry, configure the processing circuitry to output the generated response to the user using language, volume, and tone that accounts for the determined emotional state of the user by modifying at least one of a speed, tone, and language in the response to the user).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lindahl to include wherein the contextually-adjusted language adjustment changes what is said by the voice assistant, wherein generating the contextually-adjusted language adjustment comprises: accessing a library of phrases; comparing the base characteristics and/or the media content characteristics to a phrase characteristic associated with a phrase; and selecting the phrase based on a similarity between the base characteristics and/or the media content characteristics and the phrase characteristic; generating a contextually-adjusted speech adjustment based at least in part on the base characteristics and the media content characteristics, wherein contextually-adjusted speech adjustment changes how something is said by the voice assistant; and generating the synthesized speech based on the contextually-adjusted language adjustment and the contextually adjusted speech adjustment, as taught by Chandrasekaran, because an empathetic response from a PVA (personalized for the user based on previous interactions) will drive outcomes such as more engaging interactions, increased user satisfaction, and increased usage (Chandrasekaran; p. 0002).

	As per Claim 2, Lindahl in view of Chandrasekaran discloses: 
	The method of claim 1, wherein the contextually-adjusted characteristics of audio output are further based on user-specific adjustments to the base characteristics of audio output (Lindahl; p. 0058 - once the primary media item is received (step 84), a user may select an option to speak a desired voice feedback announcement into an audio receiver, such as a microphone device connected to the host device 68, or the audio input/output elements 44 on the handheld device 10. The spoken portion recorded through the audio receiver may be saved as the voice feedback audio data that may be played back concurrently with the primary media item).  

	As per Claim 4, Lindahl in view of Chandrasekaran discloses: 
The method of claim 1, wherein identifying the media content characteristics comprises: analyzing audio of the media content to determine musical characteristics of the media content (Lindahl; p. 0068 - The primary audio material 96 may include a song or other music, an audiobook, a podcast, or any other audio and/or video data that is electronically stored for future playback; see also p. 0077); and analyzing media content metadata to determine metadata-based characteristics (Lindahl; p. 0068 - The media file 94 may also include metadata 98, such as various tags that store data pertaining to the primary audio material 96).		As per Claim 5, Lindahl in view of Chandrasekaran discloses:
The method of claim 4, wherein generating a contextually-adjusted audio output is based at least in part upon the musical characteristics of the media content (Lindahl; p. 0068 - The primary audio material 96 may include a song or other music, an audiobook, a podcast, or any other audio and/or video data that is electronically stored for future playback; see also p. 0077). 
	As per Claim 6, Lindahl in view of Chandrasekaran discloses:
The method of claim 5, wherein generating the contextually- adjusted audio output comprises generating mood-related attributes that are compatible with the musical characteristics of the media content currently being played (Lindahl; p. 0087 - the audio filter may be applied in a step 200 to make the speech of the voiceover announcement sound more "smooth" (mood), such as by varying the relative intensities of overtones of the voiceover announcement to emphasize harmonic overtones).	As per Claim 7, Lindahl in view of Chandrasekaran discloses:
	The method of claim 5, wherein generating the contextually- adjusted audio output comprises generating mood-related attributes that are compatible with metadata-based characteristics of the media content currently being played (Lindahl; p. 0087 - the audio filter may be applied in a step 200 to make the speech of the voiceover announcement sound more "smooth" (mood), such as by varying the relative intensities of overtones of the voiceover announcement to emphasize harmonic overtones).

	As per Claim 8, Lindahl in view of Chandrasekaran discloses:
	The method of claim 1, upon which claim 8 depends.	And further, Chandrasekaran teaches wherein the user-specific adjustments are based on the user's listening history (Chandrasekaran; p. 0052 - As also depicted in FIG. 3, the PVA system may include a data store 53 that stores a history of past user interactions with the PVA and/or emotional states for use with the heuristics 50. The data store 53 is updated with interaction data, emotional state inferences, etc. over time. As described herein, the historical data is useful for establishing a baseline emotional state for each user. Also, the history of past interactions (defined at the session/conversation level, for each dialog turn, or for each user action (in a non-conversational instantiation)) can be used to normalize feature values in the ML classifier 30, among other things that will be apparent to those skilled in the art).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lindahl to include the contextually-adjusted characteristics including a language adjustment and a speech adjustment, as taught by Chandrasekaran, because an empathetic response from a PVA (personalized for the user based on previous interactions) will drive outcomes such as more engaging interactions, increased user satisfaction, and increased usage (Chandrasekaran; p. 0002).		As per Claim 9, Lindahl in view of Chandrasekaran discloses:
	The method of claim 1, wherein using the contextually-adjusted audio output to generate synthesize speech further comprises: determining a pronunciation and an emotion for speaking the words based upon the speech adjustments adjustment associated with the contextually-adjusted audio output characteristics (Lindahl; p. 0089 - the voice feedback may be altered to add different linguistic accents to the speech depending on the genre or some other contextual aspect of the media item).	And further, Chandrasekaran teaches selecting words to be spoken by the voice assistant using a natural language generator based upon the language adjustments adjustment associated with the contextually-adjusted audio output characteristics (Chandrasekaran; Fig. 4; p. 0062 - FIG. 4 illustrates how, given a conversational state 20 and inferred emotional state 24, response selector 26 may vary the text response sent to the TTS subsystem 28 whereby responses may be varied based on rules or lookups against pre-selected responses or encoded as responses generated by a learned model such as a deep neural network to generate conversationally and emotionally appropriate responses having the appropriately contextualized text and tone according to an example embodiment. In sample embodiments, the neural network generates a representation of the user's emotional state. For example, the representation of the user's emotional state may be a high-dimensional vector of weights that can richly represent the emotional state of the user. In one instantiation, this vector can be compared against similar vectors computed over the space of possible responses. The distance between the two vectors (e.g., using cosine similarity) is one way in which the deep neural network system could select the best response (closer=better); p. 0112 - Example 5 is an example as in Example 1 wherein the memory device further comprises instructions stored therein, which when executed by the processing circuitry, configure the processing circuitry to output the generated response to the user using language, volume, and tone that accounts for the determined emotional state of the user by modifying at least one of a speed, tone, and language in the response to the user).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lindahl to include selecting words to be spoken by the voice assistant using a natural language generator based upon the language adjustments adjustment associated with the contextually-adjusted audio output characteristics, as taught by Chandrasekaran, because an empathetic response from a PVA (personalized for the user based on previous interactions) will drive outcomes such as more engaging interactions, increased user satisfaction, and increased usage (Chandrasekaran; p. 0002).
  
	As per Claim 11, Lindahl discloses:
A voice assistant system comprising:	at least one processing device (Lindahl; Fig. 2, item 50; p. 0039-0040); and 	at least one computer readable storage device storing data instructions that, when executed by the at least one processing device (Lindahl; p. 0042 - Various software programs may be stored in the memory 52 and/or the non-volatile storage 54 (or in some other memory or storage of a different device, such as host device 68 (FIG. 3)), and may include application instructions for execution by a processor), cause the at least one processing device to: 
at a first time (Lindahl; p. 0073 - a feedback event may be a track change or playlist change (a track change means that subsequent media is identified) that is manually initiated by a user or automatically initiated by a media player application (e.g., upon detecting the end of a primary media track)):	identify a first media content currently being played by a media playback system (Lindahl; Fig. 8, item 146; p. 0077 - The method 144 may include a step 146 of analyzing primary audio material (e.g., music, speech, or a video soundtrack) of a media item); 	identify first media content characteristics associated with the first media content currently being played (Lindahl; Fig. 8, item 148; p. 0077-0078 - From such analysis, a reverberation characteristic of the primary audio material may be determined in a step 148); 	identify first base characteristics of audio output (Lindahl; Fig. 8; p. 0079 - if it is determined that a music track has significant reverberation, the reverberation of a voiceover announcement associated with the music track (e.g., a song title, artist name, or playlist name) may be increased to make the voiceover announcement sound as if it were recorded in the same venue as the music track; see also p. 0069); 	generate first contextually-adjusted audio output characteristics based at least in part on the base characteristics of audio output and the media content characteristics (Lindahl; Fig. 8, item 150; p. 0079 - The reverberation characteristic of the voiceover announcement may be modified to more closely approximate that of the primary audio material, which may result in a user perceiving a voiceover announcement (played concurrently with or close in time to the primary audio material) to be more natural); 	generate first synthesized speech based on the first contextually- adjusted audio output characteristics (Lindahl; Fig. 8, item 150; p. 0079 - The reverberation characteristic of the voiceover announcement may be modified to more closely approximate that of the primary audio material, which may result in a user perceiving a voiceover announcement (played concurrently with or close in time to the primary audio material) to be more natural; p. 0056 - The voice synthesis program may process the extracted information to generate one or more audio files representing synthesized speech, such that when played back, a user may hear the song title, album name, and/or artist name being spoken); 
at a second time (Lindahl; p. 0073 - a feedback event may be a track change or playlist change (a track change means that subsequent media is identified) that is manually initiated by a user or automatically initiated by a media player application (e.g., upon detecting the end of a primary media track)):	identify a second media content currently being played by the media playback system (Lindahl; Fig. 8, item 148; p. 0077-0078 - From such analysis, a reverberation characteristic of the primary audio material may be determined in a step 148); 	identify second media content characteristics associated with the second media content currently being played, the first media content characteristics are different from the second media content characteristics (Lindahl; Fig. 8, item 148; p. 0077-0078 - From such analysis, a reverberation characteristic of the primary audio material may be determined in a step 148; Lindahl; p. 0073 - a feedback event may be a track change or playlist change (a track change means that subsequent media is identified) that is manually initiated by a user or automatically initiated by a media player application (e.g., upon detecting the end of a primary media track))); 	identify second base characteristics of audio output (Lindahl; Fig. 8; p. 0079 - if it is determined that a music track has significant reverberation, the reverberation of a voiceover announcement associated with the music track (e.g., a song title, artist name, or playlist name) may be increased to make the voiceover announcement sound as if it were recorded in the same venue as the music track; see also p. 0069); 	generate second contextually-adjusted audio output characteristics based at least in part on the base characteristics of audio output and the second media content characteristics (Lindahl; Fig. 8, item 150; p. 0079 - The reverberation characteristic of the voiceover announcement may be modified to more closely approximate that of the primary audio material, which may result in a user perceiving a voiceover announcement (played concurrently with or close in time to the primary audio material) to be more natural). Reply to Final Office Action of July 7, 2021	Lindahl, however, fails to disclose wherein the contextually-adjusted language adjustment changes what is said by the voice assistant, wherein generating the contextually-adjusted language adjustment comprises: accessing a library of phrases; comparing the base characteristics and/or the media content characteristics to a phrase characteristic associated with a phrase; and selecting the phrase based on a similarity between the base characteristics and/or the media content characteristics and the phrase characteristic; generating a contextually-adjusted speech adjustment based at least in part on the base characteristics and the media content characteristics, wherein contextually-adjusted speech adjustment changes how something is said by the voice assistant; and generating the synthesized speech based on the contextually-adjusted language adjustment and the contextually adjusted speech adjustment; and transition from the first synthesized speech to second synthesized speech, wherein the first synthesized speech is different from the second synthesized speech.	Chandrasekaran does teach wherein the contextually-adjusted language adjustment changes what is said by the voice assistant, wherein generating the contextually-adjusted language adjustment comprises: accessing a library of phrases; comparing the base characteristics and/or the media content characteristics to a phrase characteristic associated with a phrase; and selecting the phrase based on a similarity between the base characteristics and/or the media content characteristics and the phrase characteristic; generating a contextually-adjusted speech adjustment based at least in part on the base characteristics and the media content characteristics, wherein contextually-adjusted speech adjustment changes how something is said by the voice assistant (Chandrasekaran; Fig. 4; p. 0062 - FIG. 4 illustrates how, given a conversational state 20 and inferred emotional state 24 (base characteristics), response selector 26 may vary the text response sent to the TTS subsystem 28 whereby responses may be varied based on rules or lookups against pre-selected responses or encoded as responses generated by a learned model such as a deep neural network to generate conversationally and emotionally appropriate responses having the appropriately contextualized text and tone according to an example embodiment (the contextually-adjusted language adjustment changes what is said by the voice assistant). In sample embodiments, the neural network generates a representation of the user's emotional state. For example, the representation of the user's emotional state may be a high-dimensional vector of weights that can richly represent the emotional state of the user. In one instantiation, this vector can be compared against similar vectors computed over the space of possible responses (library of phrases). The distance between the two vectors (e.g., using cosine similarity) is one way in which the deep neural network system could select the best response (closer=better); p. 0112 - Example 5 is an example as in Example 1 wherein the memory device further comprises instructions stored therein, which when executed by the processing circuitry, configure the processing circuitry to output the generated response to the user using language, volume, and tone that accounts for the determined emotional state of the user by modifying at least one of a speed, tone, and language in the response to the user); and generating the synthesized speech based on the contextually-adjusted language adjustment and the contextually adjusted speech adjustment; and transition from the first synthesized speech to second synthesized speech, wherein the first synthesized speech is different from the second synthesized speech (Chandrasekaran; p. 0112 - Example 5 is an example as in Example 1 wherein the memory device further comprises instructions stored therein, which when executed by the processing circuitry, configure the processing circuitry to output the generated response to the user using language, volume, and tone that accounts for the determined emotional state of the user by modifying at least one of a speed, tone, and language in the response to the user).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the system of Lindahl to include wherein the contextually-adjusted language adjustment changes what is said by the voice assistant, wherein generating the contextually-adjusted language adjustment comprises: accessing a library of phrases; comparing the base characteristics and/or the media content characteristics to a phrase characteristic associated with a phrase; and selecting the phrase based on a similarity between the base characteristics and/or the media content characteristics and the phrase characteristic; generating a contextually-adjusted speech adjustment based at least in part on the base characteristics and the media content characteristics, wherein contextually-adjusted speech adjustment changes how something is said by the voice assistant; and generating the synthesized speech based on the contextually-adjusted language adjustment and the contextually adjusted speech adjustment; and transition from the first synthesized speech to second synthesized speech, wherein the first synthesized speech is different from the second synthesized speech, as taught by Chandrasekaran, because an empathetic response from a PVA (personalized for the user based on previous interactions) will drive outcomes such as more engaging interactions, increased user satisfaction, and increased usage (Chandrasekaran; p. 0002).

	As per Claim 12, Lindahl in view of Chandrasekaran discloses:
	The voice assistant system of claim 11, further comprising a voice-enabled device configured for interaction with a user via voice (Lindahl; p. 0038 - The electronic device 10 may also include various audio input and output elements. For example, the audio input/output elements, depicted generally by reference numeral 44, may include an input receiver, which may be provided as one or more microphone devices), wherein the voice-enabled device comprises the at least one processing device and the at least one computer readable storage device (Lindahl; Fig. 2, item 50; p. 0039-0042).  

	As per Claim 13, Lindahl in view of Chandrasekaran discloses:
	The voice assistant system of claim 11, further comprising a media delivery system comprising at least one server computing device comprising the at least one processing device (Lindahl; Fig. 3; p. 0051 - a networked system 66 through which media items may be transferred between a host device (e.g., a personal desktop computer) 68, the portable handheld device 10, or a digital media content provider 76 is illustrated).  

As per Claim 14, Lindahl in view of Chandrasekaran discloses: 
The voice assistant system of claim 11, upon which claim 14 depends.	And, further Chandrasekaran discloses wherein the base characteristics of audio output are user-specific characteristics of audio output generated based at least in part on a listening history of a user and brand characteristics of audio output (Chandrasekaran; p. 0052 - As also depicted in FIG. 3, the PVA system may include a data store 53 that stores a history of past user interactions with the PVA and/or emotional states for use with the heuristics 50. The data store 53 is updated with interaction data, emotional state inferences, etc. over time. As described herein, the historical data is useful for establishing a baseline emotional state for each user. Also, the history of past interactions (defined at the session/conversation level, for each dialog turn, or for each user action (in a non-conversational instantiation)) can be used to normalize feature values in the ML classifier 30, among other things that will be apparent to those skilled in the art).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the system of Lindahl to include wherein the base characteristics of audio output are user-specific characteristics of audio output generated based at least in part on a listening history of a user and brand characteristics of audio output, as taught by Chandrasekaran, because an empathetic response from a PVA (personalized for the user based on previous interactions) will drive outcomes such as more engaging interactions, increased user satisfaction, and increased usage (Chandrasekaran; p. 0002).	Claim 15 contains subject matter respectively similar to claims 4-7, and thus, is rejected under similar rationale.		As per Claim 16, Lindahl in view of Chandrasekaran discloses:
The voice assistant system of claim 11, wherein generating the first or second synthesized speech is performed by a contextual audio output adjuster, and wherein the contextual audio output adjuster further comprises data instructions that cause the at least one processing device to: send the speech adjustments to a text-to-speech engine, the speech adjustments defining pronunciation adjustments and emotion adjustments to be applied to the words when spoken by the voice assistant (Lindahl; p. 0089 - the voice feedback may be altered to add different linguistic accents to the speech depending on the genre or some other contextual aspect of the media item; p. 0087 - the audio filter may be applied in a step 200 to make the speech of the voiceover announcement sound more "smooth" (mood), such as by varying the relative intensities of overtones of the voiceover announcement to emphasize harmonic overtones).	And, further, Chandrasekaran teaches send the language adjustments to a natural language generator to select words to be spoken by the voice assistant (Chandrasekaran; Fig. 4; p. 0062 - FIG. 4 illustrates how, given a conversational state 20 and inferred emotional state 24 (base characteristics), response selector 26 may vary the text response sent to the TTS subsystem 28 whereby responses may be varied based on rules or lookups against pre-selected responses or encoded as responses generated by a learned model such as a deep neural network to generate conversationally and emotionally appropriate responses having the appropriately contextualized text and tone according to an example embodiment (the contextually-adjusted language adjustment changes what is said by the voice assistant). In sample embodiments, the neural network generates a representation of the user's emotional state. For example, the representation of the user's emotional state may be a high-dimensional vector of weights that can richly represent the emotional state of the user. In one instantiation, this vector can be compared against similar vectors computed over the space of possible responses (library of phrases). The distance between the two vectors (e.g., using cosine similarity) is one way in which the deep neural network system could select the best response (closer=better); p. 0112 - Example 5 is an example as in Example 1 wherein the memory device further comprises instructions stored therein, which when executed by the processing circuitry, configure the processing circuitry to output the generated response to the user using language, volume, and tone that accounts for the determined emotional state of the user by modifying at least one of a speed, tone, and language in the response to the user).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lindahl to include send the language adjustments to a natural language generator to select words to be spoken by the voice assistant and generate language adjustments based on the first or second contextually-adjusted audio output, as taught by Chandrasekaran, because an empathetic response from a PVA (personalized for the user based on previous interactions) will drive outcomes such as more engaging interactions, increased user satisfaction, and increased usage (Chandrasekaran; p. 0002).

Claim 19 contains subject matter respectively similar to claim 11, and thus, is rejected under similar rationale.

Claims 10 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Lindahl et al. [US PG Pub 20110066438] in view of Chandrasekaran et al. [US PG Pub 20170125008] and further in view of Gong [US PG Pub 20030167167].

	As per Claim 10, Lindahl in view of Chandrasekaran discloses:
The method of claim 1, upon which claim 10 depends.	Lindahl in view of Chandrasekaran, however, fail to disclose generating a mood associated with the contextually-adjusted audio output, the mood comprising: the contextually-adjusted audio output; one or more audio cues; and one or more visual representations.	Gong teaches generating a mood associated with the contextually-adjusted audio output, the mood comprising: the contextually-adjusted audio output (Gong; p. 0047 - The affect generator 360 produces facial expressions and vocal expressions for the intelligent social agent 350 based on an indication from the dynamic adaptor module 336 as to what emotion the intelligent social agent 350 should express); one or more audio cues (Gong; p. 0047 - The affect generator 360 produces facial expressions and vocal expressions for the intelligent social agent 350 based on an indication from the dynamic adaptor module 336 as to what emotion the intelligent social agent 350 should express); and one or more visual representations (Gong; p. 0073 - The processor then applies an appeal rule to further analyze the basic user profile and to select a visual appearance for the intelligent social agent that may be appealing to the target user population).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lindahl and Chandrasekaran to include generating a mood associated with the contextually-adjusted audio output, the mood comprising: the contextually-adjusted audio output; one or more audio cues; and one or more visual representations, as taught by Gong, because creating the visual appearance, voice, and personality of an intelligent social agent that is based on the personal and professional characteristics of the target user population may help the intelligent social agent be appealing to the target users (Gong; p. 0021).

	As per Claim 18, Lindahl in view of Chandrasekaran discloses:
The method of claim 1, wherein the speech adjustment includes pronunciation adjustments to be applied to the words of the synthesized speech (Lindahl; p. 0089 - the voice feedback may be altered to add different linguistic accents to the speech depending on the genre or some other contextual aspect of the media item).	Lindahl in view of Chandrasekaran, however, fail to disclose but Gong teaches wherein the speech adjustment includes emotion adjustments to be applied to the words of the synthesized speech (Gong; p. 0047 - The affect generator 360 produces facial expressions and vocal expressions for the intelligent social agent 350 based on an indication from the dynamic adaptor module 336 as to what emotion the intelligent social agent 350 should express).
Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lindahl and Chandrasekaran to include wherein the speech adjustment includes emotion adjustments to be applied to the words of the synthesized speech, as taught by Gong, because creating the visual appearance, voice, and personality of an intelligent social agent that is based on the personal and professional characteristics of the target user population may help the intelligent social agent be appealing to the target users (Gong; p. 0021).
	
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
McDuff (US PG Pub 20200279553) which discloses a conversational agent that is implemented as a voice-only agent or embodied with a face may match the speech and facial expressions of a user. Linguistic style-matching by the conversational agent may be implemented by identifying prosodic characteristics of the user's speech and synthesizing speech for the virtual agent with the same or similar characteristics. The facial expressions of the user can be identified and mimicked by the face of an embodied conversational agent. Utterances by the virtual agent may be based on a combination of predetermined scripted responses and open-ended responses generated by machine learning techniques. A conversational agent that aligns with the conversational style and facial expressions of the user may be perceived as more trustworthy, easier to understand, and create a more natural human-machine interaction. (McDuff; Abstract)
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139.  The examiner can normally be reached on Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 5712727602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/RODRIGO A CHAVEZ/Examiner, Art Unit 2658

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658