DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the Office Action mailed 11/27/2020, applicant has submitted an amendment filed 2/1/2021.
Claim(s) 1, 2, 6, 9-10, 14, 17, has/have been amended.  Claim(s) 7-8 and 15-16 has/have been cancelled.  
Response to Arguments
	Applicant argues that “Osotio does not disclose collecting the audio data from the user as the first audio data every preset time interval in such a case” “in which the user is talking but does not dialogue with the electronic device” (Amendment, page 8) and similarly argues that Gong and Ueyama do not teach these limitations (Amendment, pages 9-10).
	The independent claims 1 and 9 are rejected based on Osotio, Ueyama, and a new prior art references as discussed below (new rejection necessitated by Applicant’s amendment to include, among other things, “in a case that the user is talking but does not dialogue with the electronic device” which was not previously claimed).
	Some things worth noting are:
	“collecting audio data from the user as the first audio data every preset time interval” does not exclude receiving/capturing audio at a particular sampling rate (because sampling captures/”collects” audio every sampling period/”time interval”, and 
	“collecting audio data from the user as the first audio data when the audio data is detected” also does not exclude receiving/capturing audio at a particular sampling rate (because the samples are collected at a time when the audio data that forms the samples are detected by the microphone and then converted into the samples).
	Therefore, to the extent that Applicant is trying to claim where two different types of audio data collection are performed under two different conditions (i.e. when the user is dialoguing to the device and when the user is talking but is not dialoguing to the device), this is not required by the claim language because the claim language used to describe the audio data collection in the two cases can be two accurate descriptions of the same audio data collection.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4, 5, 9, 12, 13, 17, is/are rejected under 35 U.S.C. 103 as being unpatentable over Osotio (US 2018/0061393), in view of Ogaz (US 2017/0221336), .

As per Claim 1, 9, 17, Osotio teaches (along with its apparatus and medium equivalents) A method for dialoguing based on a mood of a user, executed by an electronic device, comprising: collecting first audio data from the user; determining the mood of the user according to a feature of the first audio data; and dialoguing with the user using second audio data corresponding to the mood of the user (paragraphs 5-12, 46, 48, 53-57, 59, 61-63; 65-66, 70, Figures 1 and 4;
[all paragraphs and Figures are cited for each limitation with “key” paragraphs and Figures pertaining to each limitation identified below, i.e. all other paragraphs and Figures not specifically referenced for any particular limitation are eligible to provide context and additional support]
“A method for dialoguing based on a mood of a user, executed by an electronic device”: Paragraphs 53-54 describe comparing a determined user emotion to an evolution threshold and, if the evolution threshold is breached, the sound of the currently utilized AI voice is modified based on, among other things, user emotion, where modifying the sound of the voice includes modifying one or more of duration, pitch, volume, and timbre based on user emotion.  Paragraphs 48 and 55 describes where the AI voice “reflects or responds to a determined user emotion” and examples of where an AI voice is modified to have particular duration/volume/characteristics based on particular determined user emotions.  Paragraph 55 describes where the modified/evolved AI voice is used to respond to the user input.  Figure 4 also describes 
“comprising: collecting first audio data from the user;”: Paragraph 6 describes receiving a user input via a microphone, and paragraph 46 describes where input received via a microphone is spoken language input from a user, and paragraph 48 describes where user input may be spoken language/voice input.  Osotio thus describes receiving/”collecting” speech [“first audio data”] from a user [paragraph 6, 46, 48, 57]
“determining the mood of the user according to a feature of the first audio data;”: Paragraphs 6-7 describes determining a user emotion/”mood” based on evaluating the user input received via a microphone, and paragraph 46 describes where user input received by a microphone is spoken language input, and paragraph 48 describes determining a user’s emotion/”mood” by evaluating the user input [in one embodiment, a 
“and dialoguing with the user using second audio data corresponding to the mood of the user”: Paragraphs 53-54 describe comparing a determined user emotion to an evolution threshold and, if the evolution threshold is breached, the sound of the currently utilized AI voice is modified based on, among other things, user emotion, where modifying the sound of the voice includes modifying one or more of duration, pitch, volume, and timbre based on user emotion.  Paragraphs 48 and 55 describes where the AI voice “reflects or responds to a determined user emotion” and examples of where an AI voice is modified to have particular duration/volume/characteristics based on particular determined user emotions.  Paragraph 55 describes where the modified/evolved AI voice is used to respond to the user input.  Figure 4 also describes where responding with a modified AI voice [414] leads to receiving a user input [402] which leads [through 404 and 406 and 408] to responding with a previously utilized AI voice [410] which leads to receiving user input.  Paragraph 61 more specifically recites where “the” previously utilized AI voice “that was utilize[d] to respond to the last user input” is provided to respond to the user input.  Osotio thus describes where a modified AI voice corresponding to a determined user emotion/”mood” is used to provide multiple responses to multiple user inputs [thus conducting a “dialog” “based on a mood of a user”].  Osotio thus describes “dialoguing with the user using second audio data corresponding to the mood of the user” [conducting a dialog with a user using modified AI voice characteristics that correspond to the determined user emotion, where the modified AI voice characteristics can be interpreted as “second audio data” which is not 
For claim 9: paragraph 46 and 56 and 65-66 and 70 and Figure 1 describe a client-side computing device embodiment where a client device may be, among other things, a personal computer and other devices and paragraphs 65-66 and 70 describe where the method of Figure 4 can be performed by executing programs using a processing unit.)
wherein the collecting the first audio data from the user comprises: in a case that the user is dialoguing with the electronic device, collecting audio data from the user as the first audio data…;  (paragraphs 5-12, 46, 48, 53-57, 59, 61-63; Figures 1 and 4;
Paragraph 6 describes receiving a user input via a microphone, and paragraph 46 describes where input received via a microphone is spoken language input from a user, and paragraph 48 describes where user input may be spoken language/voice input.  Figure 4 depicts a sequence where a user provides an input, a response is provided with a voice, and then a user provides another input [suggesting a “dialogue” involving the user].  Paragraph 46 describes AI voice system 100 implemented on a client computing device and where a client computing device is configured to receive spoken language input from a user, and Figure 4 and paragraph 56-57 describes where 
These portions describe “wherein the collecting of the first audio data from the user comprises: in a case that the user is dialoguing with the electronic device, collecting audio data from the user as the first audio data” [collecting a spoken language input/”audio data”/“first audio data” from the user via a microphone when/while/”in a case that” the user is dialoguing with the computing device implementing AI voice system 100 and method 400]).
Osotio does not, but Ogaz suggests collecting first audio data from the user; determining the mood of the user according to a feature of the first audio data; and dialoguing with the user using second audio data corresponding to the mood of the user;… and in a case that the user is talking…, collecting audio data from the user as the first audio data… (Paragraphs 5, 8, 9, 12, 100, 101, 132, 141;
Paragraph 8 describes monitoring a user’s voice and providing feedback to the user, where a device monitors the voice of a user and includes a sensor.  Paragraph 12 describes where a sensor is a microphone and where collected data relates to a voice of a user [suggesting, together with paragraph 8, that monitoring a user’s voice involves collecting audio data with a microphone].  Paragraph 9 describes analyzing “data regarding the user’s voice” included in sensor data, analyzing the data to determine an emotional state of the user, and providing an alert.  Paragraph 5 describes where some people have difficulty controlling emotions or tone/pitch of their voice, may become angry without being aware that they are angry, or sound angry without knowledge of how their voice sounds to others [suggesting that the monitored user’s voice is not 
Ogaz suggests where the functions of Osotio’s computing device which conducts a dialog with a user additionally include “collecting first audio data from the user; determining the mood of the user according to a feature of the first audio data; and dialoguing with the user using second audio data corresponding to the mood of the user;…in a case that the user is talking…, collecting audio data from the user as the first audio data…” [where Osotio's system, in addition to determining emotion from a dialogue spoken input and responding using a voice based on the determined emotion, additionally monitors the user’s speech/voice collected by the microphone to determine an emotion based on voice parameters and generates an audible phrase alert based on the determined emotion, where the audible phrase alert is, for example, “calm down”/”remain calm” if the determined emotion is angry, where collecting the monitored user’s speech/voice, determining an emotion of the monitored speech/voice based on voice parameters, and providing the audible phrase alert is also a form of “collecting first 
In Applicant’s Specification, paragraphs 45-47 describes, as part of describing “dialoguing with the user using second audio data corresponding to the mood of the user”, “Alternatively, the second audio data determined by the electronic device may be independent from the content of the first audio data, that is, for example, when the electronic device determines that the mood of the user is ‘sad’ according to the first audio data, the electronic device determines the second audio data corresponding to ‘sad’ is ‘Don't be sad, tell you a joke’, so that the electronic device may inquire the user according to the mood of the user positively, instead of passively answering the user's question.” and paragraphs 39-40 describe where “the first audio data” is collected when “the user is talking all the time, but does not dialogue with the electronic device”.  One of the described embodiments, is, therefore, where the first audio data is not speech that is part of a dialogue with the electronic device and where the second audio data is provided in response to speech that is not part of a dialogue with the electronic device.  Therefore, a described embodiment of Applicant’s claimed “dialoguing with the user using second audio data corresponding to the mood of the user” is providing a word output to the user in response to speech which is not directed to the system as part of a dialogue, and therefore Ogaz’s audible phrase alert based on the emotion determined from the monitored user speech/voice falls within the scope of “dialoguing with the user 
	Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to combine prior art elements according to known methods because the prior art included each element claimed, although not necessarily in a single prior art reference, with the only difference between the claimed invention and the prior art being the lack of actual combination of the elements in a single prior art reference (Osotio teaches a dialog function of a computing device which determines emotion from microphone-input speech based on voice characteristics and responds using a voice based on the determined emotion, and Ogaz teaches a monitoring function of a computing device which determines emotion from microphone-input speech based on voice parameters and responds with an audible phrase alert corresponding to the determined emotion).  One of ordinary skill in the art could have combined the elements as claimed by known methods (by adding the monitoring function in Ogaz to the functions performed by the computing device in Osotio), and that in combination, each element merely performs the same function as it does separately (the monitoring function is a separate function from the dialog function).  The combination is the predictable results of a system which receives a speech input from a user using a microphone, determines an emotion of the user based on voice characteristics of the speech input, determines user-related information based on meaning of words in the speech input, and responds to the speech input using a voice having characteristics corresponding to the emotion of the user (as per Osotio) where the system also monitors speech input, determines emotion based on voice parameters 
	Osotio, in view of Ogaz, do not, but Cameron suggests in a case that the user is talking but does not dialogue with the electronic device, collecting audio data from the user as the first audio data… (Paragraphs 441 and 447; 
	Ogaz teaches monitoring speech and where suggests where speech may not be directed to the device, but does not specifically teach that the monitored speech is speech spoken while the user is not talking-to/”dialoguing with” the electronic device.
	Cameron [paragraph 447] similarly describes where identifying emotion/mood in speech audio and the identified emotion/mood of the speech audio is used to cause a system to respond based on the identified mood/emotion [specifically selecting calming musing in response to identifying angry or aggressive mood/emotion in the live speech audio].  Cameron [paragraph 447] also describes where the speech audio is “live speech audio, such as a live conversation in a meeting room or other formal or informal setting or a phone conversation” and paragraph 441 further describes speech audio such as “a real-time phone call” or “formal or informal meeting or conversation between people” [at least suggested to be speech of at least two people talking to each other and not to the system].  
	Cameron thus suggests where the monitored speech/voice in Ogaz’s monitoring function [added to the dialog function performed by the computing device of Osotio] is, more specifically speech/voice of the user conversing with another person and not with the computing device [i.e. such that “in a case that the user is talking but does not dialogue with the electronic device, collecting audio data from the user as the first audio 
	Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of input audio with another because the prior art teaches the claimed invention except for the substitution of input audio which is not necessarily speech that is not part of a dialog with an electronic device with input audio which is.  Cameron teaches that input audio which is speech that is not part of a dialog with an electronic device was known in the art.  One of ordinary skill in the art could have substituted one type of input audio with another to obtain the predictable results of a system which receives a speech input from a user using a microphone, determines an emotion of the user based on voice characteristics of the speech input, determines user-related information based on meaning of words in the speech input, and responds to the speech input using a voice having characteristics corresponding to the emotion of the user (as per Osotio) where the system also monitors speech input, determines emotion based on voice parameters of the monitored speech input, and provides an audible phrase alert based on the determined emotion (as per Ogaz) where the monitored speech input is live speech audio of a conversation between people (as per Cameron).
	Osotio, in view of Ogaz and Cameron, do not, but Ueyama suggests wherein the collecting the first audio data from the user comprises: in a case that the user is dialoguing with the electronic device, collecting audio data from the user as the first audio data when the audio data is detected; and in a case that the user is talking but does not dialogue with the electronic device, collecting audio data from the user as the first audio data every preset time interval (paragraph 48; “The microphone 100 inputs speech information [language such as Japanese, English, or the like] spoken by the user. The A/D converter 101 samples speech information supplied from the microphone 100 at a predetermined sampling frequency to convert it into digital speech information”, paragraph 52; “the A/D converter 101 receives speech information spoken by the user via the microphone 100, samples the speech information at a predetermined sampling frequency, and converts it into digital speech information. The digital speech information is supplied to the speech processing unit 102”, paragraph 86; “The speech processing unit 102 acoustically analyzes speech information supplied from the A/D converter 101 to obtain speech parameters [to be also referred to as feature parameters] in a predetermined format. The unit 102 then compression-codes the speech parameters. The compression-coded speech parameters are supplied to the network interface”, paragraph 53;
Osotio describes where spoken language input is received via a microphone [paragraph 46] and also suggests analyzing the spoken input for emotion-indicating voice characteristics and words of the spoken input [paragraph 48, where, to determine that a spoken request is for a list of funeral homes, the system is at least suggested to determine that the user spoke the words “funeral homes”] but does not specifically teach where audio data from the user is collected “every preset time interval”.  
Ogaz similarly describes where voice data of a user’s voice is collected via a microphone and is analyzed to determine emotional state [paragraphs 8-9, 12]

Sampled microphone speech information which is subsequently processed by a speech processing unit also suggests audio data which is “collect[ed]… from the user as the first audio data” “when the audio data is detected” [where the samples in the “audio data” are collected at a time when the “audio data” that forms the samples are detected by the microphone and then converted into the samples].  Paragraph 48 further describes a speech recognition system which commonly/conventionally processes speech immediately [as opposed to capturing audio data and then sampling it significantly later when the results of speech recognition may no longer be useful, see e.g. Osotio where a response to a user’s input is logically best provided very shortly after the user’s input]

Ueyama thus suggests where the spoken-language/speech input [“first audio data from the user”] in Osotio and the live speech audio [as per Ogaz and Cameron] which are analyzed/processed to determine, among other things, emotion [and which is suggested to be analyzed/processed to recognize spoken words], are “collected” by sampling the microphone signal at a sampling frequency, thereby “collecting” a sample of “the first audio data from the user” “every preset time interval” [i.e. collecting a sample of the spoken-language/speech input once every fixed sampling “period”/”time interval” which is “preset” by the sampling frequency] “when the audio data is detected” [the samples in the “audio data” are collected at a time when the “audio data” that forms the samples are detected by the microphone and then converted into the samples]
such that “in a case that the user is dialoguing with the electronic device, collecting audio data from the user as the first audio data when the audio data is detected; and in a case that the user is talking but does not dialogue with the electronic device, collecting audio data from the user as the first audio data every preset time interval”)


As per Claim 4, Osotio teaches wherein the dialoguing with the user using the second audio data corresponding to the mood of the user comprises: determining the second audio data corresponding to the mood of the user by searching a first mapping relationship, wherein the first mapping relationship comprises at least one correspondence between a mood and audio data; and dialoguing with the user using the second audio data (paragraphs 5-12, 46, 48, 53-57, 59, 61-63; Figures 1 and 4;
As discussed in the rejection of claim 1, Osotio teaches using a modified AI voice defined by audio characteristics [where the audio characteristics can be interpreted as “second audio data”] that corresponds to a determined user emotion/”mood” to conduct a dialog with the user [paragraphs 48, 53-55, 61-63, 6-7, 46; Figure 4]
The examples in paragraph 55 describe where particular AI voice characteristics correspond to particular determined emotions [including particular emotions listed in paragraph 48].  This implies [or at least suggests] where the system includes data indicating a mapping relationship that maps/links/relates an emotion/”mood” to corresponding AI voice characteristics [“audio data”] such that the determined emotion can be used by the system to determine what AI voice characteristics to use [by “searching”/analyzing the mapping relationship data to determine the AI voice characteristics corresponding to the determined emotion]
As discussed above, the AI voice characteristics [“second audio data”] corresponding to the determined emotion/”mood” [i.e. determined based on the mapping/correspondence data] are used to conduct a dialog with the user).

As per Claim 12, its limitations are similar to those in claim 4 and so is rejected under similar rationale.

wherein the dialoguing with the user using the second audio data corresponding to the mood of the user comprises: determining an audio data processing manner corresponding to the mood of the user by searching a second mapping relationship, wherein the second mapping relationship comprises at least one correspondence between a mood and an audio data processing manner; processing the second audio data with the audio data processing manner; and dialoguing with the user using the processed second audio data (paragraphs 5-12, 46, 48, 53-57, 59, 61-63; Figures 1 and 4;
As discussed in the rejection of claim 1, Osotio teaches using a modified AI voice defined by audio characteristics [where the audio characteristics can be interpreted as “second audio data”] that corresponds to a determined user emotion/”mood” to conduct a dialog with the user [paragraphs 48, 53-55, 61-63, 6-7, 46; Figure 4]
The actual generation of a voice response using a particular set of characteristics corresponding to a determined emotion can be interpreted as an “audio data processing manner”
The examples in paragraph 55 describe where particular voices with particular AI voice characteristics correspond to particular determined emotions [including particular emotions listed in paragraph 48].  This implies [or at least suggests] where the system includes data that maps/links/relates an emotion/”mood” to corresponding AI voice characteristics [“audio data”] such that the determined emotion can be used by the system to determine what AI voice characteristics to use to generate a particular type of voice.  Since determining particular voice characteristics [“second audio data”] leads to using those particular voice characteristics to generate a particular type of voice 
The collective set of implied [or at least suggested] data that associates/maps/links a determined emotion [“mood”] to a voice response generation process that generates a voice response using particular voice characteristics corresponding to the determined emotion [“audio data processing manner”] can be interpreted as “a second mapping relationship” that comprises a “correspondence between a mood and an audio data processing manner” which is “searched”/analyzed to “determin[e] an audio data processing manner corresponding to the mood of the user” [i.e. the collective set of data that maps a determined emotion to particular voice characteristics and which causes those particular voice characteristics to be used in a voice response generation process that uses those particular voice characteristics is “searched”/analyzed for what the system should do when the determined emotion is detected/identified/determined, and that “searching” leads to the system determining that it should use a voice generation process that uses those particular voice characteristics to produce a response]
Additionally/alternatively, the collective set of data discussed in the previous paragraph can be interpreted as the “second mapping relationship”, and “determining an audio data processing manner corresponding to the mood of the user by searching” 
The actual generation of voice using the voice response generation process that generates a voice response using particular voice characteristics corresponding to the determined emotion can be interpreted as “processing the second audio data with the audio data processing manner” [i.e. the voice response generation process that uses the particular voice characteristics processes the particular voice characteristics to generate a response with a particular type of voice having those particular voice characteristics]
Figure 4 and paragraph 61 describes where a previously provided voice used to respond to the last user input, which, as discussed in the rejection of claim 1, describes where a modified AI voice corresponding to a determined user emotion/”mood” is used to provide multiple responses to multiple user inputs [thus conducting a “dialog” “based on a mood of a user”].  Generating another response using the same determined-emotion-based voice characteristics can be interpreted as generating another response 

As per Claim 13, its limitations are similar to those in claim 5 and so is rejected under similar rationale.

Claims 2-3, 6, 10-11, 14, is/are rejected under 35 U.S.C. 103 as being unpatentable over Osotio, in view of Ogaz, Cameron, and Ueyama, as applied to Claims 1, 5, 9, and 13, above, and further in view of Gong (US 2003/0167167).

As per Claim 2, Osotio teaches wherein the determining of the mood of the user according to the feature of the first audio data comprises: determining the mood of the user according to an attribute of the first audio data (paragraphs 5-12, 46, 48, 53-57, 59, 61-63; Figures 1 and 4;
As discussed in the rejection of claim 1: 
Paragraphs 6-7 describes determining a user emotion/”mood” based on evaluating the user input received via a microphone, and paragraph 46 describes where user input received by a microphone is spoken language input, and paragraph 48 describes determining a user’s emotion/”mood” by evaluating the user input [in one 
Characteristics/”features” can also be interpreted as “attributes”, such that the voice characteristics of the spoken language input can each be interpreted as “an attribute of the first audio data” which is used to “determin[e] the mood of the user”)
Osotio, in view of Ogaz, Cameron, and Ueyama, do not, but Gong suggests wherein the determining of the mood of the user according to the feature of the first audio data comprises: determining the mood of the user according to an attribute of the first audio data; wherein the attribute comprises at least one of the following: amplitude, tone, timbre, frequency, and duration of the first audio data (paragraphs 29, 50-52;
Osotio teaches determining a user’s emotion from voice characteristics of a spoken input but does not specifically teach where the voice characteristics that are analyzed to determine emotion are one or more of the attributes/features listed in claim 2.
In Gong, paragraph 29 describes determining an “affective state” of the user from “voice feature data such as speech rate and amplitude” and a user’s verbal content [at least suggested to be words whose meaning indicates emotion], and paragraphs 50-52 describe determining affective state based on, among other things, vocal analysis data such as pitch range, volume, degree of breathiness, where louder and faster speech compared to the user’s basic pattern may indicate that a user is happy, and where quieter and slower speech than normal may indicate that a user is sad [at least 
Gong thus suggests where the voice characteristics of the spoken input which are analyzed to determine a user’s emotion in Osotio can be, among other things, amplitude [where volume/loudness of speech is typically determined based on amplitude, which is one of the listed types of attributes in claim 2])
	Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of feature/attribute used to determine emotion with another because the prior art teaches the claimed invention except for the substitution of a feature/attribute used to determine emotion which is not necessarily amplitude with a feature/attribute used to determine emotion which is.  Gong teaches that a feature/attribute used to determine emotion which is amplitude was known in the art.  One of ordinary skill in the art could have substituted one type of feature/attribute used to determine emotion with another to obtain the predictable results of a system which receives a speech input from a user using a microphone, determines an emotion of the user based on voice characteristics of the speech input, determines user-related information based on meaning of words in the speech input, and responds to the speech input using a voice having characteristics corresponding to the emotion of the user (as per Osotio) where the system also monitors speech input, determines emotion based on voice parameters of the monitored speech input, and provides an audible phrase alert based on the determined emotion (as per Ogaz) where the monitored speech input is live speech audio of a conversation 
	
As per Claim 3, Osotio teaches wherein the determining of the mood of the user according to the feature of the first audio data comprises: determining the mood of the user and suggests determining… of the user according to semantics of the first audio data (paragraphs 5-12, 46, 48, 53-57, 59, 61-63; Figures 1 and 4;
As discussed in the rejection of claim 1: 
Paragraphs 6-7 describes determining a user emotion/”mood” based on evaluating the user input received via a microphone, and paragraph 46 describes where user input received by a microphone is spoken language input, and paragraph 48 describes determining a user’s emotion/”mood” by evaluating the user input [in one embodiment, a received spoken language/voice input] based on voice characteristics [“features”] of the spoken language input [“first audio data”].  [paragraphs 6-7, 46, 48, 59].
Paragraph 48 further describes determining a “context of the user” based on the user input [which may be a spoken language/voice input] requesting a list of nearby funeral homes, which at least suggests that the meaning/semantics of the words of the user input are used to determine a “user context” [because a voice request’s substantive content is typically defined by the meaning of the words spoken in the voice request, and therefore a voice request is at least suggested to be a request for funeral homes based on the meaning of at least one word in the voice request that is directed 
Osotio thus suggests determining user context and user emotion based on semantics and voice characteristics [respectively] of the same spoken language input [i.e. based on the “first audio data”])
Osotio, in view of Ogaz, Cameron, and Ueyama, do not, but Gong suggests wherein the determining of the mood of the user according to the feature of the first audio data comprises: determining the mood of the user according to semantics of the first audio data (paragraphs 29, 50-52, 54;
As discussed above, Osotio teaches determining the emotion/mood of a user based on voice characteristics/features of a spoken input, and also suggests where user context is determined based on semantics of the same spoken user input [e.g. a user’s request is suggested to be a request for funeral homes based on the meaning of words used to speak the request]
In Gong, paragraph 29 describes determining an “affective state” of the user from “voice feature data such as speech rate and amplitude” and a user’s verbal content [at least suggested to be words whose meaning indicates emotion], and paragraphs 50-52 and 54 describe determining affective state based on a combination of vocal analysis data [features/attributes] and verbal content [at least suggested to be based on semantics of words]

Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of emotion determination with another because the prior art teaches the claimed invention except for the substitution of emotion determination which does not necessarily use both semantics and voice features to determine emotion with emotion determination which does.  Gong teaches that emotion determination which uses both semantics and voice features to determine emotion was known in the art.  One of ordinary skill in the art could have substituted one type of emotion determination with another to obtain the predictable results of a system which receives a speech input from a user using a microphone, determines an emotion of the user based on voice characteristics of the speech input, determines user-related information based on meaning of words in the speech input, and responds to the speech input using a voice having characteristics corresponding to the emotion of the user (as per Osotio) where the system also monitors speech input, determines emotion based on voice parameters of the monitored speech input, and provides an audible phrase alert based on the determined emotion (as per Ogaz) where the monitored speech input is live speech audio of a conversation between people (as per Cameron) where the speech input and monitored speech input are received by sampling speech information collected using the microphone at a predetermined sampling frequency (as per Ueyama) where the emotion of the user is 

As per Claims 10-11, their limitations are similar to those in claims 2-3 and so are rejected under similar rationale.

As per Claim 6, Osotio teaches wherein before the determining of the audio data processing manner corresponding to the mood of the user by searching the second mapping relationship, further comprising: determining the second audio data according to… the first audio data and suggests determining… according to semantics of the first audio data (paragraphs 5-12, 46, 48, 53-57, 59, 61-63; Figures 1 and 4;
As discussed in the rejection of claim 5, Osotio teaches [or at least suggests] “determining… the audio data processing manner corresponding to the mood of the user by searching the second mapping relationship” [i.e. the collective set of data that maps a determined emotion to particular voice characteristics and which causes those particular voice characteristics to be used in a voice response generation process that uses those particular voice characteristics is “searched”/analyzed for what the system should do when the determined emotion is detected/identified/determined, and that “searching” leads to the system determining that it should use a voice generation process that uses those particular voice characteristics to produce a response].  
Additionally/alternatively, the collective set of data discussed in the previous paragraph can be interpreted as the “second mapping relationship”, and “determining 
Based on the alternate/additional interpretation discussed in the previous paragraph:
Logically, before the system can determine that it should use a voice generation process that uses those particular determined-emotion-based voice characteristics to produce a response, the system determines what those particular voice characteristics are, based on the determined user emotion that is determined based on voice characteristics of the spoken language input [i.e. “before the determining of the audio data processing manner corresponding to the mood of the user by searching the second mapping relationship”, the system “determines the second audio data according to… the first audio data”].  
As discussed in the rejection of claim 1: 

Paragraph 48 further describes determining a “context of the user” based on the user input [which may be a spoken language/voice input] requesting a list of nearby funeral homes, which at least suggests that the meaning/semantics of the words of the user input are used to determine a “user context” [because a voice request’s substantive content is typically defined by the meaning of the words spoken in the voice request, and therefore a voice request is at least suggested to be a request for funeral homes based on the meaning of at least one word in the voice request that is directed to funeral homes], and also determining user emotion based on “voice characteristics of the spoken language input of above” and “based on the same user input above” [at least suggested to be the request for a list of funeral homes, where a user can be sad or angry because someone close to the user has passed away].  
Paragraph 48 and 55 further describes where a determined emotion [determined based on voice characteristics of the spoken language input] leads to using particular AI voice characteristics [“second audio data”] to respond to the user.
Osotio thus suggests determining user context and user emotion based on semantics and voice characteristics [respectively] of the same spoken language input 
Osotio, in view of Ogaz, Cameron, and Ueyama, do not, but Gong suggests wherein before the determining of the audio data processing manner corresponding to the mood of the user by searching the second mapping relationship, further comprising: determining the second audio data according to the semantics of the first audio data (paragraphs 29, 50-52, 54;
As discussed above, Osotio teaches determining the emotion/mood of a user based on voice characteristics/features of a spoken input, and also suggests where user context is determined based on semantics of the same spoken user input [e.g. a user’s request is suggested to be a request for funeral homes based on the meaning of words used to speak the request] and where “second audio data”/AI-voice-characteristics-for-a-response-voice are determined based on a user emotion determined from voice characteristics of the spoken language input [i.e. “according to… the first audio data”]
In Gong, paragraph 29 describes determining an “affective state” of the user from “voice feature data such as speech rate and amplitude” and a user’s verbal content [at least suggested to be words whose meaning indicates emotion], and paragraphs 50-52 and 54 describe determining affective state based on a combination of vocal analysis data [features/attributes] and verbal content [at least suggested to be based on semantics of words]

Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of emotion determination with another because the prior art teaches the claimed invention except for the substitution of emotion determination which does not necessarily use both semantics and voice features to determine emotion with emotion determination which does.  Gong teaches that emotion determination which uses both semantics and voice features to determine emotion was known in the art.  One of ordinary skill in the art could have substituted one type of emotion determination with another to obtain the predictable results of a system which receives a speech input from a user using a microphone, determines an emotion of the user based on voice characteristics of the speech input, determines user-related information based on meaning of words in the speech input, and responds to the speech input using a voice having characteristics corresponding to the emotion of the user (as per Osotio) where the system also monitors speech input, determines emotion based on voice parameters of the monitored speech input, and provides an audible phrase alert based on the determined emotion (as per Ogaz) where the monitored speech input is live speech audio of a conversation between people (as per Cameron) where the speech input and monitored speech input 

As per Claims 14, its limitations are similar to those in claim 6 and so is rejected under similar rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
2014/0214421 teaches “Dialog systems are continually evolving to handle less constrained spoken input, interpret user intent, and engage in natural dialog to accomplish complex tasks. Addressee detection is used in spoken dialog systems to detect whether or not user speech is directed toward the system. In single-user human-computer (H-C) contexts, the alternate addressee may be the user (self-talk), or others in the environment who are not interacting with the system. When multiple users interact jointly with a system (H-H-C dialog), addressee detection becomes even more of a challenge. Human-human (H-H) conversation about the shared task may contain the same keywords a system would listen for. When system-addressed utterances contain more than only commands or keywords, word sequences can begin to look more like those in H-H speech. Other cues such as gaze may also become less reliable. For example, when the users are looking at a system display while talking with each other” 
	2015/0179168 teaches “Addressee detection attempts to differentiate between utterances addressed to another human or to the computer allowing users to speak to the computer naturally without any intervention, such as requiring the user to speak an addressing term (e.g., "computer") or make an addressing gesture (e.g., pushing a button or looking at a camera) in conjunction with making a request” (paragraph 3) 
2007/0192097 teaches “The ability to determine the affect of a person can be helpful or even very important in certain situations. For example, the ability to determine an angry state of a driver could be used to reduce the probability of an accident that is caused by the direct or side affects of the anger, such as by alerting the driver to calm down. One aspect of human behavior that could be useful to determine the affect of a person is a change of speech characteristics that occurs when the person's affect changes. However, the benefits available from determining a person's affect are difficult to achieve using current methods of detecting a persons affect from the person's speech, because the methods use static methods (i.e, statistics) of speech signal characteristics, which are difficult to be implemented in real-time and are not very reliable” (paragraph 3).  Paragraph 15 describes where, when a driver advocacy processor for a vehicle is an electronic device and an affect is “anger”, the driver advocacy processor may be programmed to provide an audible message to the driver of a vehicle that is intended to reduce the probability of an accident.  This reference does not appear to specifically describe that the speech being analyzed is speech that is not 
5647834 teaches monitoring an emotional reaction of a child and “IF the child is angry, this will be reflected in his voice and the toy can respond ‘calm down, don’t be angry!’” (col. 2, lines 22-29)
2017/0047063 teaches a control unit performing “speech recognition on an audio signal collected by the microphone 12 using the speech recognition unit 10a to determine whether or not there is the user's speech directed to the system” (paragraph 79, i.e. where speech recognition is used to determine whether the user is speaking to the system)
9293134 teaches “In certain implementations, the base device 102 may provide either the base device audio signal 210 or the remote device audio signal 212 to the speech service 108 at any given time, depending on whether the user is directing speech to the base device 102 or the handheld device 104. For example, the base device audio signal 210 may be provided to the speech service 108 after a preconfigured keyword or wake word is detected by the base device 102 as having been spoken by the user 106. The remote device audio signal 212 may be provided to the speech service 108 during times when the PTT button 206 is pressed by the user 106. When providing one or the other of the base device audio signal 210 or the remote device audio signal 212, the base device 102 may provide an indication to the speech service 108 regarding which of the audio signals is being provided” (suggesting that the system can determine whether a user is directing speech at a particular device)
2017/0372695 teaches “When the button for indicating an instruction for starting voice recognition is not provided, for example, the voice recognition unit 20 constantly receives the voice collected by the microphone 6, and detects a speaking period corresponding to the content spoken by the user B, to thereby recognize the voice in the speaking period”, paragraph 67; 
2015/0154964 teaches a machine listening component continuously captures voice commands while a human speaker is talking on another application for human listening (paragraph 24).
2018/0047395 teaches where operations including capturing ambient sounds, detecting presence of speech, determining a source of speech, determining whether there is a change in the source, can be performed continuously or periodically at a sampling frequency when the system and the audio sensors are turned on (paragraph 182; Figure 14A)

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC YEN whose telephone number is (571)272-4249.  The examiner can normally be reached on M-F 9:00AM -5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access 






EY 3/1/2021
/ERIC YEN/Primary Examiner, Art Unit 2658