DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on June 01, 2020 and April 28, 2021 is/are being considered by the examiner.

Claim Objections
Claim 14 objected to because of the following informalities:  
In the preamble of claim 14, the phrase “he system” should read “The system.” 
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. §112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. §112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2 and 3 are rejected under 35 U.S.C. §112(b) or 35 U.S.C. §112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. §112, the applicant), regards as the invention.
At lines 5-6 of claim 2 and at numerous other points throughout the claim, applicant indicates “a third NLU hypothesis and a fourth NLU hypothesis corresponding to the third ASR 
At lines 5-6 of claim 3 and at numerous other points throughout the claim, applicant indicates “a third NLU hypothesis corresponding to the third ASR output data.” However, as applicant has already indicated in claim 1, “a third NLU hypothesis corresponding to the second ASR output data,” this element of claim 3 is unclear. 
Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. §102 and 103 (or as subject to pre-AIA  35 U.S.C. §102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. §103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, and 4 is/are rejected under 35 U.S.C. §103 as being unpatentable over Divakaran (U.S. Pat. App. Pub. No. 2017/0160813, hereinafter Divakaran) in view of Sinha (U.S. Pat. App. Pub. No. 2014/0365226, hereinafter Sinha), Khan (U.S. Pat. App. Pub. No. 2016/0019915, hereinafter Khan), and Aleksic (U.S. Pat. App. Pub. No. 2017/0270929, hereinafter Aleksic).

Regarding claim 1, Divakaran discloses A computer-implemented method comprising (The systems and methods for speech recognition described with reference to the virtual personal assistant.; Divakaran, ¶¶ [0062]) : receiving first audio data representing a first utterance (In embodiments describing a request for a prescription, the system discloses “the person tells the system, ‘I’d like to refill a prescription.’ {a first utterance}” where “The system detects {receiving…} that the person is speaking slowly and hesitantly. {first audio data representing the first utterance}”; Divakaran, ¶¶ [0063]); associating the first audio data with a first dialogue session identifier (Though not expressly indicated as having an identifier, the example in FIG. 3 is a dialogue session and all dialogue represented in FIG. 3 is understood by the system to be part of the same dialogue session (as indicated by the description of each of the interactions as “dialog sessions” in changing the “dialog approach.” Thus the first audio, represented at element 310 is associated with the first dialogue session identifier.; Divakaran, ¶¶ [0063], [0297], FIG. 3); determining, using automatic speech recognition (ASR) processing, first ASR output data corresponding to the first audio data (“The automatic speech recognition 412 component can identify natural language in audio input” such as in the first utterance described above “and provide the identified words as text {first ASR output} to the rest of the system 400.”; Divakaran, ¶¶ [0063], [0056], [0075]); determining, using natural language understanding (NLU) processing, a first NLU hypothesis corresponding to the first ASR output data (In further embodiments, the system can use “a natural language recognition system {determine, using natural language understanding (NLU) processing}... to understand what the person wants {a first NLU hypothesis corresponding to the first ASR output}” which corresponds to the “natural language in the audio input {the first ASR output data}”; Divakaran, ¶¶ [0063], [0056], [0075]), the first NLU hypothesis associated with a first confidence score (“The interpreter’s 1016 can produce an output what the interpreter 1016 determined, with a statistically high degree of confidence {a first confidence score}, most closely matched the person’s actual Divakaran, ¶¶ [0134])… performing a first action corresponding to the first NLU hypothesis (“Based on the conclusions that the system has made about the speaker’s emotional or cognitive state {corresponding to the first NLU hypothesis}, at step 312, the system determines to change its dialog approach by asking direct yes/no questions {performing a first action}, and responds, ‘Sure, happy to help you with that. I’ll need to ask you some questions first.’ “; Divakaran, ¶¶ [0063]); receiving second audio data representing a second utterance (“At step 330, the person responds, “I think so, I found something here.. but.. <sigh>.’ “; Divakaran, ¶¶ [0068]); associating the second audio data with the first dialogue session identifier (The system changes approach in the dialog session (e.g., “...perhaps a different approach is needed”) where change in approach is responsive to changes in the dialog. Therefore, the system associates the second audio data (which the system understands as indicating the need for a change in approach) with the first dialog session. Further evidence can be found in FIG. 3, which displays a continuing dialog between the system and the user.; Divakaran, ¶¶ [0068], FIG. 3); determining second ASR output data corresponding to the second audio data (“The automatic speech recognition 412 component can identify natural language in audio input” such as in the second utterance described above “and provide the identified words as text {first ASR output} to the rest of the system 400.”; Divakaran, ¶¶ [0068], [0056], [0075]); determining a third NLU hypothesis corresponding to the second ASR output data (In further embodiments, the system can use “a natural language recognition system {determine, using natural language understanding (NLU) processing}... to understand what the person wants {a first NLU hypothesis corresponding to the first ASR output}” which corresponds to the “natural language in the audio input {the first ASR output data}”; Divakaran, ¶¶ [0068], [0056], [0075])…receiving sentiment data indicating a sentiment based on acoustic characteristics of the second audio data (“From this reply {based on… the second audio data}, the system may detect audible {thus, acoustic characteristics} frustration” where frustration indicates a sentiment which is derived from the reply Divakaran, ¶¶ [0068]); determining that the sentiment data indicates frustration (“the system may detect audible frustration,” thus the sentiment data indicates frustration.; Divakaran, ¶¶ [0068]). However, Divakaran fails to expressly recite determining, using natural language understanding (NLU) processing, a second NLU hypothesis corresponding to the first ASR output data, the second NLU hypothesis associated with a second confidence score; associating at least the first ASR output data, the first NLU hypothesis and the second NLU hypothesis with the first dialogue session identifier… determining a fourth NLU hypothesis corresponding to the second ASR output data; associating at least the second ASR output data, the third NLU hypothesis and the fourth NLU hypothesis with the first dialogue session identifier; determining, using the first dialogue session identifier, that the second utterance is a repeat of the first utterance based at least in part on a comparison of the first NLU hypothesis and the third NLU hypothesis …determining, using the first dialogue session identifier, that the second NLU hypothesis corresponds to the fourth NLU hypothesis; in response to determining that the sentiment data indicates frustration... determining output text data including a representation of a second action corresponding to the second NLU hypothesis… determining output audio data corresponding to the output text data using text-to-speech (TTS) processing and sending the output audio data to a device. 
Sinha teaches “systems and methods for detecting errors in speech interactions with a digital assistant.” (Sinha, ¶ [0002]). Regarding claim 1, Sinha teaches determining, using natural language understanding (NLU) processing, a second NLU hypothesis corresponding to the first ASR output data (“The natural language processing module 332 (“natural language processor”) of the digital assistant takes the sequence of words or tokens (“token sequence”) generated by the speech-to-text processing module 330, and attempts to associate the token sequence with one or more “actionable intents” recognized by the digital assistant,” where one or more actionable intents includes a first actionable intent {a first NLU hypothesis} and a second actionable intent {a second NLU hypothesis}; Sinha, ¶¶ [0073]), the second NLU hypothesis associated with a second confidence score (“the natural language processing module 332 will select one of the actionable intents as the task that the user intended the digital assistant to perform... In some implementations, the domain having the highest confidence value (e.g., based on the relative importance of its various triggered nodes) is selected.” As the actionable intent is selected based on the “highest confidence value,” each of the actionable intents {e.g., the second NLU hypothesis} have a confidence value {a second confidence value}.; Sinha, ¶¶ [0083]); associating at least the first ASR output data, the first NLU hypothesis and the second NLU hypothesis with the first dialogue session identifier (The first, second, and third inputs are indicated as being “received within the same dialog session {first dialog session identifier} with the digital assistant.” Thus, the first input {first audio data}, as well as the “actionable intents” {the first NLU hypothesis and the second NLU hypothesis} and the “speech-to-text processing” {first ASR output data} of the first input, are associated with the dialog session {first dialog session identifier}; Sinha, ¶¶ [0127])… determining a fourth NLU hypothesis corresponding to the second ASR output data (“The natural language processing module 332 (“natural language processor”) of the digital assistant takes the sequence of words or tokens (“token sequence”) generated by the speech-to-text processing module 330, and attempts to associate the token sequence with one or more “actionable intents” recognized by the digital assistant,” where one or more actionable intents for the second ASR output includes a third actionable intent {a third NLU hypothesis} and a second actionable intent {a fourth NLU hypothesis}; Sinha, ¶¶ [0073]); associating at least the second ASR output data, the third NLU hypothesis and the fourth NLU hypothesis with the first dialogue session identifier (The first, second, and third inputs are indicated as being “received within the same dialog session {first dialog session identifier} with the digital assistant.” As well, the NLU hypotheses are acted upon within the first dialogue session. Thus, the second input {second audio data}, as well as the “actionable intents” {the third NLU hypothesis and the fourth NLU hypothesis} and the “speech-to-text processing” {second ASR output data} of the second input, are associated with the dialog Sinha, ¶¶ [0127]); determining, using the first dialogue session identifier, that the second utterance is a repeat of the first utterance based at least in part on a comparison of the first NLU hypothesis and the third NLU hypothesis (“Users may also indicate dissatisfaction by repeating the same speech input multiple times {determine... that the second utterance is a repeat of the first utterance} in an effort to make the digital assistant understand his or her words or intent. Accordingly, detecting the same input from a user multiple times within a short period of time and/or within the same dialog with the digital assistant {determining, using the first dialogue session identifier} can indicate that the user is not being properly understood, or that the digital assistant is not properly identifying the user’s intent from the speech input” even when “the words in the first and second speech input may be somewhat different from one another {comparison of the first NLU hypothesis and the second NLU hypothesis}”; Sinha, ¶¶ [0126])…determining, using the first dialogue session identifier, that the second NLU hypothesis corresponds to the fourth NLU hypothesis (Using the fact that the first, second, and third inputs are part of the same dialog session {determining, using the first dialogue session identifier}, the system determines at least two intents for the second input, being the natural language understanding of the second input and the second input being used to indicate that the selected actionable intent of first input was incorrect (i.e., “determining that the second speech input indicates dissatisfaction with the at least one action” from the first speech input) {where either may be the third and fourth NLU hypotheses}. Correspondingly, the first input has at least two intents, being the selected actionable intent {the first NLU hypothesis} and actionable intent corresponding to further information is required {the second NLU hypothesis}. In the case of dissatisfaction, the indication of dissatisfaction {the fourth NLU hypothesis} causes the system to provide a prompt requesting confirmation of the error and further explanation.; Sinha, ¶¶ [0120], [0135]); in response to determining that the sentiment data indicates frustration... determining output text data including a representation of a second action corresponding to the second NLU hypothesis… (“The digital assistant performs at least one  Sinha, ¶¶ [0116]-[0117])... determining output audio data corresponding to the output text data using text-to-speech (TTS) processing (“speech synthesis module 265 synthesizes speech outputs based on text provided by the digital assistant. For example, the digital assistant generates text to provide as an output to a user, and the speech synthesis module 265 converts the text to an audible speech output.”; Sinha, ¶¶ [0049]); and sending the output audio data to a device. (“In some implementations, instead of (or in addition to) using the local speech synthesis module 265, speech synthesis is performed on a remote device (e.g., the server system 108), and the synthesized speech is sent to the user device 104 for output to the user.”; Sinha, ¶¶ [0049]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran to incorporate the teachings of Sinha to include determining, using natural language understanding (NLU) processing, a second NLU hypothesis corresponding to the first ASR output data, the second NLU hypothesis associated with a second confidence score; associating at least the first ASR output data, the first NLU hypothesis and the second NLU hypothesis with the first dialogue session identifier… determining a fourth NLU hypothesis corresponding to the second ASR output data; associating at least the second ASR output data, the third NLU hypothesis and the fourth NLU hypothesis with the first dialogue session identifier; determining, using the first dialogue session identifier, that the second utterance is a repeat of the first utterance based at least in part on a comparison of the first NLU hypothesis and the third NLU hypothesis …determining, using the first dialogue session identifier, that the second NLU hypothesis corresponds to the fourth NLU hypothesis; in response to determining that the sentiment data indicates frustration... determining output text data including a representation of a second action corresponding to the second NLU hypothesis… determining output audio data corresponding to the output text data Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). However, Divakaran and Sinha fail to expressly recite determining that the second confidence score satisfies a threshold value; in response to determining that the sentiment data indicates frustration and that the second confidence score satisfies the threshold value, determining output text data including a representation of a second action corresponding to the second NLU hypothesis.
Khan teaches systems and methods for “recognizing emotion in audio signals.” (Khan, ¶ [0016]). Regarding claim 1, Khan teaches determining that the second confidence score satisfies a threshold value (“Confidence scores for one or more defined emotions are computed, as indicated at block 414, based upon the computed audio fingerprint.” where defined emotions can include “emotion of ‘anger’ “; Khan, ¶¶ [0054]); in response to determining that the sentiment data indicates frustration and that the second confidence score satisfies the threshold value, determining output text data including a representation of a second action corresponding to the second NLU hypothesis (“The action initiating component 232 is configured to initiate any of a number of different actions in response to associating one or more emotions with an audio signal.” where “The matching component 230 is configured to associate one or more emotions with the audio signal based upon the computed confidence scores and whether or not one or more confidence score thresholds has been met or exceeded.”; Khan, ¶¶ [0043], [0044]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran as modified by the speech error detection systems of Sinha to incorporate the teachings of Khan to include determining that the second confidence score satisfies a threshold value; in response to determining that the sentiment data indicates frustration and that the second confidence score Khan. (Khan, ¶ [0005]). However, Divakaran, Sinha, and Khan fail to expressly recite wherein a first dialogue session includes a first dialogue session identifier.
Aleksic teaches systems and methods for “determining dialog states that correspond to voice inputs and for biasing a language model based on the determined dialog states.” (Aleksic, ¶ [0003]). Regarding claim 1, Aleksic teaches wherein a first dialogue session includes a first dialogue session identifier (The system can include a “dialog session identifier [which] is data that indicates a particular dialog session associated with the request 212. The dialog session identifier may be used by the speech recognizer 202 to correlate a series of transcription requests {first audio data} that relate to a same dialog session. {associating... with a first dialogue session identifier}”; Aleksic, ¶¶ [0053]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha and the emotion recognition systems of Khan, to incorporate the teachings of Aleksic to include wherein a first dialogue session includes a first dialogue session identifier. Use of dialogue states can allow a “speech recognizer [to] generate more accurate transcriptions of voice inputs,” as recognized by Aleksic. (Aleksic, ¶ [0024]). 

Regarding claim 2, the rejection of claim 1 is incorporated. Divakaran, Sinha, Khan, and Aleksic disclose all of the elements of the current invention as stated above. However, Divakaran fail(s) to expressly recite further comprising: receiving third audio data representing a third utterance; associating the third audio data with a second dialogue session identifier; determining, 
The relevance of Sinha is described above with relation to claim 1. Regarding claim 2, Sinha teaches further comprising: receiving third audio data representing a third utterance (The system “receives, from a user, a speech input containing a request (402).”; Sinha, ¶¶ [0114], [0110]); associating the third audio data with a second dialogue session identifier (“the digital assistant initiates a new information provision process upon receipt of each new user input, and each existing information provision process terminates either (1) when all of the sub-responses of a complete response to the user request have been provided to the user or (2) when the digital assistant provides a request for additional information or clarification to the user regarding a previous user.” Therefore, the receipt of the request 402, occurring after a request for additional information or a clarification, can be associated with a new dialog session {second dialogue session}; Sinha, ¶¶ [0110]); determining, using ASR processing, third ASR output data corresponding to the third audio data (The system generates a “sequence of words or Sinha, ¶¶ [0073]); determining, using NLU processing, a third NLU hypothesis and a fourth NLU hypothesis corresponding to the third ASR output data, the third NLU hypothesis associated with a third confidence score and the fourth NLU hypothesis associated with a fourth confidence score (“The natural language processing module 332 (“natural language processor”) of the digital assistant takes the sequence of words or tokens (“token sequence”) generated by the speech-to-text processing module 330, and attempts to associate the token sequence with one or more “actionable intents” recognized by the digital assistant,” where one or more actionable intents includes a first actionable intent {a third NLU hypothesis} and a second actionable intent {a fourth NLU hypothesis}; Sinha, ¶¶ [0073]); performing a third action corresponding to the third NLU hypothesis (“The digital assistant performs at least one action in furtherance of satisfying the request (404).”; Sinha, ¶¶ [0116]); receiving fourth audio data representing a fourth utterance (“The digital assistant detects a user interaction (406)” where “ detecting the user interaction comprises detecting a second speech input {fourth audio data representing a fourth utterance}”; Sinha, ¶¶ [0119]-[0120]); associating the fourth audio data with the second dialogue session identifier (“The digital assistant determines whether the user interaction is indicative of a problem in the performing of the at least one action (407),” as the one action was derived from the third audio data, the fourth audio data part of the same dialogue session {second dialogue session} as the third audio data; Sinha, ¶¶ [0119]); determining, using ASR, fourth ASR output data corresponding to the fourth audio data (The system generates a “sequence of words or tokens” from the speech input using “the speech-to-text processing module 330”; Sinha, ¶¶ [0073]); determining, using the second dialogue session identifier, that the fourth utterance is a repeat of the third utterance (“In some implementations, detecting the user interaction comprises detecting a second speech input, and determining whether the user interaction is indicative of a problem comprises determining that the second speech input indicates dissatisfaction with the at least one action (408).” where “Users may also Sinha, ¶¶ [0120], [0126]); receiving second sentiment data corresponding to the fourth audio data (The system receives sentiment data as user dissatisfaction, where dissatisfaction is determined from the second speech input {fourth audio data}.; Sinha, ¶¶ [0126]), the second sentiment data indicating a sentiment based on acoustic characteristics of the fourth audio data (“determining whether the second speech input indicates dissatisfaction includes determining a volume of the second speech input”; Sinha, ¶¶ [0122]); determining that the second sentiment data indicates frustration (Indicates that “users may raise their voices in the second speech input... out of frustration”; Sinha, ¶¶ [0122]); determining, using the second dialogue session identifier, that the fourth NLU hypothesis corresponds to the fourth ASR output data (The system indicates that the second speech input is a repeat of the first input. Thus the ASR output of the fourth ASR output data is the same as the third ASR output data and corresponds to the fourth NLU hypothesis of the third ASR output data.; Sinha, ¶¶ [0126]); determining second output audio data corresponding to the second output text data using TTS processing (“speech synthesis module 265 synthesizes speech outputs based on text provided by the digital assistant. For example, the digital assistant generates text to provide as an output to a user, and the speech synthesis module 265 converts the text to an audible speech output.”; Sinha, ¶¶ [0049]); and sending the second output audio data to the device (“In some implementations, instead of (or in addition to) using the local speech synthesis module 265, speech synthesis is performed on a remote device (e.g., the server system 108), and the synthesized speech is sent to the user device 104 for output to the user.”; Sinha, ¶¶ [0049]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, the emotion recognition systems of Khan, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include further comprising: receiving third audio data representing a third utterance; associating the third audio data with a second dialogue session identifier; determining, using ASR processing, third ASR output data corresponding to the third audio data; determining, using NLU processing, a third NLU hypothesis and a fourth NLU hypothesis corresponding to the third ASR output data, the third NLU hypothesis associated with a third confidence score and the fourth NLU hypothesis associated with a fourth confidence score; performing a third action corresponding to the third NLU hypothesis; receiving fourth audio data representing a fourth utterance; associating the fourth audio data with the second dialogue session identifier; determining, using ASR, fourth ASR output data corresponding to the fourth audio data; determining, using the second dialogue session identifier, that the fourth utterance is a repeat of the third utterance; receiving second sentiment data corresponding to the fourth audio data, the second sentiment data indicating a sentiment based on acoustic characteristics of the fourth audio data; determining that the second sentiment data indicates frustration; determining, using the second dialogue session identifier, that the fourth NLU hypothesis corresponds to the fourth ASR output data; determining second output audio data corresponding to the second output text data using TTS processing; and sending the second output audio data to the device. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). However, Divakaran and Sinha fail(s) to expressly recite determining that the fourth confidence score does not satisfy the threshold value in response to determining that the second sentiment data indicates frustration, [and] determining second output text data including a confirmation to perform the third action corresponding to the third NLU hypothesis.
The relevance of Khan is described above with relation to claim 1. Regarding claim 2, Khan teaches determining that the fourth confidence score does not satisfy the threshold value (“The matching component 230 is configured to associate one or more emotions with the audio signal based upon the computed confidence scores and whether or not one or more confidence score thresholds has been met or exceeded,” thus the confidence score confirms the presence of the emotion, and where the confidence score can either satisfy a threshold or fail to satisfy said threshold (thus, confirming or failing to confirm the hypothesis of the emotion).; Khan, ¶¶ [0043]); in response to determining that the second sentiment data indicates frustration (The system “may associate an emotion of “anger” {frustration} with his/her tone in dictating” where associating an emotion includes “comput[ing] an audio fingerprint from the received audio signal {the second sentiment data indicates...}” where the audio fingerprint indicates the anger {frustration} and the system associates the anger {frustration} “with the audio signal based upon the computed confidence scores and whether or not one or more confidence score thresholds has been met or exceeded.”; Khan, ¶¶ [0040], [0043]-[0044]), determining second output text data including a confirmation to perform the third action corresponding to the third NLU hypothesis (When an audio fingerprint indicates anger and the confidence score threshold is not met, the system will not “prompt the speaker with an “are you sure?” type of message” and continue performing the desired function; Khan, ¶¶ [0044]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, the emotion recognition systems of Khan, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Khan to include determining that the fourth confidence score does not satisfy the threshold value in response to determining that the second sentiment data indicates frustration, [and] determining second output text data including a confirmation to perform the third action corresponding to the third NLU hypothesis. “Understanding of the emotional state of a speaker… provides for improved accuracy and reduced error rate, thus resulting in improved efficiency in carrying out processes reliant on audio detection and interpretation,” as recognized by Khan. (Khan, ¶ [0005]).

Regarding claim 4, the rejection of claim 1 is incorporated. Divakaran, Sinha, Khan, and Aleksic disclose all of the elements of the current invention as stated above. Divakaran further discloses further comprising: receiving third audio data representing a third utterance (The system receives the third audio data at step 326 where “the person says, “Yes, it’s here somewhere, let me.. here it is.”; Divakaran, ¶¶ [0067]); associating the third audio data with the first dialogue session identifier (The third audio data at step 326 is a continuation of the same discussion regarding a prescription refill. As such, the third audio data is associated with the first dialogue session; Divakaran, ¶¶ [0067]); determining, using ASR processing, third ASR output data corresponding to the third audio data (“The automatic speech recognition 412 component can identify natural language in audio input” such as in the third utterance described above “and provide the identified words as text {first ASR output} to the rest of the system 400.”; Divakaran, ¶¶ [0067], [0056], [0075]); determining, using NLU processing, that the third ASR output data corresponds to negative feedback (In further embodiments, the system can use “a natural language recognition system {determine, using NLU processing}... to understand what the person wants {feedback corresponding to the third ASR output}” which corresponds to the “natural language in the audio input {the third ASR output data}” where “the system” detects from the ASR output using the NLU processing “audible frustration {corresponds to negative feedback}”; Divakaran, ¶¶ [0067], [0056], [0075]). However, Divakaran fails to expressly recite determining, using NLU processing, that the third ASR output data corresponds to negative feedback; determining second output text data representing an apology; determining second output audio data corresponding to the second output text data using TTS processing; sending the second output audio data to the device; determining to end a dialogue corresponding to the first dialogue session identifier; and associating subsequently received fourth audio data with a second dialogue session identifier.
Sinha is described above with relation to claim 1. Regarding claim 4, Sinha teaches determining, using NLU processing, that the third ASR output data corresponds to negative feedback (“In some implementations, detecting the user interaction comprises detecting a second speech input and a third speech input… [and] determining that the second speech input and the third speech input indicate dissatisfaction {negative feedback} with the at least one action”; Sinha, ¶¶ [0127]); determining second output text data representing an apology (In response to determining dissatisfaction “the digital assistant prompts the user to provide the speech input, such as by saying “Sorry about that—can you please describe what went’ wrong?” {...representing an apology}” where the speech synthesis includes “generating text to provide as an output to the user {determining second text output data}”; Sinha, ¶¶ [0127], [0131], [0049]); determining second output audio data corresponding to the second output text data using TTS processing (“speech synthesis module 265 synthesizes speech outputs based on text provided by the digital assistant. For example, the digital assistant generates text to provide as an output to a user, and the speech synthesis module 265 converts the text to an audible speech output.”; Sinha, ¶¶ [0049]); sending the second output audio data to the device (“In some implementations, instead of (or in addition to) using the local speech synthesis module 265, speech synthesis is performed on a remote device (e.g., the server system 108), and the synthesized speech is sent to the user device 104 for output to the user.”; Sinha, ¶¶ [0049]); determining to end a dialogue corresponding to the first dialogue session identifier (“In some implementations, the digital assistant initiates a new information provision process upon receipt of each new user input, and each existing information provision process terminates... when the digital assistant provides a request for additional information or clarification to the user regarding a previous user request that started the existing information provision process {determining to end a dialogue}” Thus, when the system asked “Sorry about that—can you please describe what went’ wrong?” the request for clarification ended the current dialog session {the first dialogue session identifier}; Sinha, ¶¶ [0110]); and associating subsequently received fourth audio data with a second dialogue session identifier (“In some implementations, the digital assistant initiates a new information provision process {associating… with a second dialogue session identifier} upon receipt of each new user input {subsequently received fourth audio data}, and each existing information provision process terminates”; Sinha, ¶¶ [0110]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, the emotion recognition systems of Khan, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include determining, using NLU processing, that the third ASR output data corresponds to negative feedback; determining second output text data representing an apology; determining second output audio data corresponding to the second output text data using TTS processing; sending the second output audio data to the device; determining to end a dialogue corresponding to the first dialogue session identifier; and associating subsequently received fourth audio data with a second dialogue session identifier. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]).

Claim 3 is/are rejected under 35 U.S.C. §103 as being unpatentable over Divakaran, Sinha, Khan, and Aleksic as applied to claim 1 above, and further in view of Fox (U.S. Pat. App. Pub. No. 2011/0246189, hereinafter Fox).

Regarding claim 3, the rejection of claim 1 is incorporated. Divakaran, Sinha, Khan, and Aleksic disclose all of the elements of the current invention as stated above. However, Divakaran fails to expressly recite further comprising: receiving third audio data representing a third utterance; associating the third audio data with a second dialogue session identifier; determining, 
The relevance of Sinha is described above with relation to claim 1. Regarding claim 3, Sinha teaches further comprising: receiving third audio data representing a third utterance (The system “receives, from a user, a speech input containing a request (402).”; Sinha, ¶¶ [0114], [0110]); associating the third audio data with a second dialogue session identifier (“the digital assistant initiates a new information provision process upon receipt of each new user input, and each existing information provision process terminates either (1) when all of the sub-responses of a complete response to the user request have been provided to the user or (2) when the digital assistant provides a request for additional information or clarification to the user regarding a previous user.” Therefore, the receipt of the request 402, occurring after a request for additional information or a clarification, can be associated with a new dialog session {second dialogue session}; Sinha, ¶¶ [0110]); determining, using ASR processing, third ASR output data corresponding to the third audio data (The system generates a “sequence of words or tokens” from the speech input using “the speech-to-text processing module 330”; Sinha, ¶¶ [0073]); determining, using NLU processing, a third NLU hypothesis corresponding to the third ASR output data (“The natural language processing module 332 (“natural language processor”) of the digital assistant takes the sequence of words or tokens (“token sequence”) generated by the speech-to-text processing module 330, and attempts to associate the token sequence with one or more “actionable intents” recognized by the digital assistant,” where one or more actionable intents includes at least one actionable intent {a third NLU hypothesis}; Sinha, ¶¶ [0073]), the third NLU hypothesis associated with a third confidence score (“In some Sinha, ¶¶ [0083]); receiving signal-to-noise ratio (SNR) data corresponding to the third audio data (“In some implementations, the context information that accompanies the user input includes sensor information, e.g., lighting, ambient noise, ambient temperature, images or videos of the surrounding environment, etc. In some implementations, the context information also includes the physical state of the device, e.g., device orientation, device location, device temperature, power level, speed, acceleration, motion patterns, cellular signals strength, etc.”; Sinha, ¶¶ [0052]); determining second output audio data corresponding to the second output text data using TTS processing (“speech synthesis module 265 synthesizes speech outputs based on text provided by the digital assistant. For example, the digital assistant generates text to provide as an output to a user, and the speech synthesis module 265 converts the text to an audible speech output.”; Sinha, ¶¶ [0049]); and sending the second output audio data to the device (“In some implementations, instead of (or in addition to) using the local speech synthesis module 265, speech synthesis is performed on a remote device (e.g., the server system 108), and the synthesized speech is sent to the user device 104 for output to the user.”; Sinha, ¶¶ [0049]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, the emotion recognition systems of Khan, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include further comprising: receiving third audio data representing a third utterance; associating the third audio data with a second dialogue session identifier; determining, using ASR processing, third ASR output data corresponding to the third audio data; determining, using NLU processing, a third NLU hypothesis corresponding to the third ASR output data, the third NLU hypothesis associated with a third confidence score; receiving signal-to-noise ratio (SNR) data Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). However, Divakaran, Sinha, Khan, and Aleksic fails to expressly recite determining that the SNR data exceeds a second threshold value indicating signal energy associated with the third utterance is low; in response to determining that the SNR data exceeds the second threshold value, determining second output text data including a request to move closer to the device and repeat the third utterance.
Fox teaches systems and methods for providing feedback during speech recognition. (Fox, ¶ [0004]). Regarding claim 3, Fox teaches determining that the SNR data exceeds a second threshold value indicating signal energy associated with the third utterance is low (“The audio quality manager 200 also may monitor the signal to noise ratio (SNR). Generally, the signal to noise ratio is a comparison of the power of a desired signal to the power of the noise signal {indicating signal energy associated with the third utterance...}. High signal to noise ratios generally mean it is easier to filter the noise from the signal. A low signal to noise ratio may, for example, indicate that the audio is not sufficiently loud, or too quiet for the system to adequately distinguish the signal from the noise.” The difference between high and low SNR is a threshold value, where low SNR exceeds a threshold value and indicates that signal energy associated with the utterance {the third utterance} is low.; Fox, ¶¶ [0037]); in response to determining that the SNR data exceeds the second threshold value, determining second output text data including a request to move closer to the device and repeat the third utterance (“audio quality manager 200 may provide {determining second output data...} feedback to the user {in response to determining that the SNR data exceeds the second threshold value...} to, for example, adjust the microphone location to provide more distance between the microphone and Fox, ¶¶ [0036], [0037]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, the emotion recognition systems of Khan, and by the dialogue state determination of Aleksic, to incorporate the teachings of Fox to include determining that the SNR data exceeds a second threshold value indicating signal energy associated with the third utterance is low; in response to determining that the SNR data exceeds the second threshold value, determining second output text data including a request to move closer to the device and repeat the third utterance. Providing feedback regarding audio quality during transcription of dictation can “allow correction while dictation is on-going” to improve the resulting audio quality, as recognized by Fox. (Fox, ¶ [0004], [0009]).

Claims 5, 8, 10, 12-13, 16, 18, and 20 is/are rejected under 35 U.S.C. §103 as being unpatentable over Divakaran in view of Sinha and Aleksic.

Regarding claim 5, Divakaran discloses A computer-implemented method comprising (The systems and methods for speech recognition described with reference to the virtual personal assistant.; Divakaran, ¶¶ [0062]) : receiving first audio data representing a first utterance (In embodiments describing a request for a prescription, the system discloses “the person tells the system, ‘I’d like to refill a prescription.’ {a first utterance}” where “The system detects {receiving…} that the person is speaking slowly and hesitantly. {first audio data representing the first utterance}”; Divakaran, ¶¶ [0063]); determining, using natural language understanding (NLU) processing, first NLU data corresponding to the first audio data (In further embodiments, the system can use “a natural language recognition system {determine, Divakaran, ¶¶ [0063], [0056], [0075]); causing a first action to be performed corresponding to the first NLU data (“Based on the conclusions that the system has made about the speaker’s emotional or cognitive state {corresponding to the first NLU data}, at step 312, the system determines to change its dialog approach by asking direct yes/no questions {causing a first action to be performed}, and responds, ‘Sure, happy to help you with that. I’ll need to ask you some questions first.’ “; Divakaran, ¶¶ [0063]); receiving second audio data representing a second utterance (“At step 330, the person responds, “I think so, I found something here.. but.. <sigh>.’ “; Divakaran, ¶¶ [0068]); receiving sentiment data corresponding to the second audio data (“From this reply {corresponding to… the second audio data}, the system may detect audible frustration” where frustration indicates a sentiment which is derived from the reply {sentiment data}.; Divakaran, ¶¶ [0068]); determining that the sentiment data indicates frustration (“the system may detect audible frustration,” thus the sentiment data indicates frustration.; Divakaran, ¶¶ [0068]); . However, Divakaran fails to expressly recite determining a repeat indicator based on the second utterance being semantically similar to the first utterance …and in response to the repeat indicator and the sentiment data indicating frustration, determining output data other than performing the first action, wherein the output data corresponds to a system-generated dialogue.
The relevance of Sinha is described above with relation to claim 1. Regarding claim 5, Sinha teaches determining a repeat indicator based on the second utterance being semantically similar to the first utterance (“Users may also indicate dissatisfaction by repeating the same speech input multiple times {determine... that the second utterance is a repeat of the first utterance} in an effort to make the digital assistant understand his or her words or intent. Accordingly, detecting the same input from a user multiple times within a short period of time and/or within the same dialog with the digital assistant {determining, using the first dialogue Sinha, ¶¶ [0126])…and in response to the repeat indicator and the sentiment data indicating frustration, determining output data other than performing the first action, (“In some implementations, upon determining that the user interaction is indicative of a problem (in step (407)) {in response to the repeat indicator and the sentiment data indicating frustration}, the digital assistant provides a first prompt requesting {determining...} the user to confirm whether there was a problem in the performing of the at least one action (430) {output other than performing the first action}.”; Sinha, ¶¶ [0135]) wherein the output data corresponds to a system-generated dialogue (“speech synthesis module 265 synthesizes speech outputs based on text provided by the digital assistant. For example, the digital assistant generates text to provide as an output to a user, and the speech synthesis module 265 converts the text to an audible speech output.”; Sinha, ¶¶ [0049]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, to further incorporate the teachings of Sinha to include determining a repeat indicator based on the second utterance being semantically similar to the first utterance …and in response to the repeat indicator and the sentiment data indicating frustration, determining output data other than performing the first action, wherein the output data corresponds to a system-generated dialogue. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). However, Divakaran and Sinha fail to expressly recite wherein a first dialogue session includes a first dialogue session identifier.
The relevance of Aleksic is described above with relation to claim 1. Regarding claim 5, Aleksic teaches wherein a first dialogue session includes a first dialogue session identifier  Aleksic, ¶¶ [0053]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, to incorporate the teachings of Aleksic to include wherein a first dialogue session includes a first dialogue session identifier. Use of dialogue states can allow a “speech recognizer [to] generate more accurate transcriptions of voice inputs,” as recognized by Aleksic. (Aleksic, ¶ [0024]). 

Regarding claim 8, the rejection of claim 5 is incorporated. Divakaran, Sinha, and Aleksic disclose all of the elements of the current invention as stated above. However, Divakaran fails to expressly recite further comprising: receiving third audio data representing a third utterance; associating the third audio data with a dialogue session identifier; performing a second action responsive to the third utterance; receiving fourth audio data representing a fourth utterance; associating the fourth audio data with the dialogue session identifier; determining that the fourth utterance interrupts performance of the second action; receiving second sentiment data corresponding to the fourth audio data; the second sentiment data indicating frustration; and determining second output data other than performance of the second action.
The relevance of Sinha is described above with relation to claim 1. Regarding claim 8, Sinha teaches further comprising: receiving third audio data representing a third utterance (The system “receives, from a user, a speech input containing a request (402).”; Sinha, ¶¶ [0114], [0110]); associating the third audio data with a dialogue session identifier (“the digital assistant initiates a new information provision process upon receipt of each new user input, and Sinha, ¶¶ [0110]); performing a second action responsive to the third utterance (“The digital assistant performs at least one action in furtherance of satisfying the request (404).”; Sinha, ¶¶ [0116]); receiving fourth audio data representing a fourth utterance (“The digital assistant detects a user interaction (406)” where “ detecting the user interaction comprises detecting a second speech input {fourth audio data representing a fourth utterance}”; Sinha, ¶¶ [0119]-[0120]); associating the fourth audio data with the dialogue session identifier (“The digital assistant determines whether the user interaction is indicative of a problem in the performing of the at least one action (407),” as the one action was derived from the third audio data, the fourth audio data part of the same dialogue session {second dialogue session} as the third audio data; Sinha, ¶¶ [0119]); determining that the fourth utterance interrupts performance of the second action (“if a user becomes aware that the digital assistant is not going to properly satisfy the user’s intent, the user will simply terminate the dialog with the assistant and perform the intended action manually (or simply forgo the action altogether)” where the termination can be determined from an utterance (e.g., the example of correcting “Call Jim Carpenter” to “Tim Carpenter”); Sinha, ¶¶ [0133], [0134]); receiving second sentiment data corresponding to the fourth audio data (“In some implementations, the termination of the dialog session occurs prior to satisfying the user’s intent” where “rejection of the task may indicate that the user was dissatisfied with the proposed task, and that the digital assistant may have made an error. {second sentiment data corresponding to the fourth audio data}”; Sinha, ¶¶ [0133]), the second sentiment data indicating frustration (User dissatisfaction {the second sentiment data} can indicate frustration.; Sinha, ¶¶ [0125]); and determining second output data other than performance of the second action (“upon determining that the user interaction is indicative of a problem (in step (407)), the digital assistant provides {determining...} a first prompt {...second output data other than the performance of the second action} requesting the user to confirm whether there was a problem in the performing of the at least one action (430).”; Sinha, ¶¶ [0135]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include further comprising: receiving third audio data representing a third utterance; associating the third audio data with a dialogue session identifier; performing a second action responsive to the third utterance; receiving fourth audio data representing a fourth utterance; associating the fourth audio data with the dialogue session identifier; determining that the fourth utterance interrupts performance of the second action; receiving second sentiment data corresponding to the fourth audio data; the second sentiment data indicating frustration; and determining second output data other than performance of the second action. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). 

Regarding claim 10, the rejection of claim 5 is incorporated. Divakaran, Sinha, and Aleksic disclose all of the elements of the current invention as stated above. Divakaran further discloses further comprising: receiving third audio data representing a third utterance (The system receives the third audio data with “the person 2400 asks: “Can you find me a Chinese restaurant in Menlo Park?’ “; Divakaran, ¶¶ [0321]); associating the third audio data with a dialogue session identifier (The third audio data is associated with the dialogue session, indicated at FIG. 24; Divakaran, ¶¶ [0321], FIG. 24); determining, using NLU processing, second NLU data corresponding to the third audio data (The system can “the virtual personal assistant can determine that the person’s 2400 intent is to location a particular type of restaurant (Chinese) in a particular city (Menlo Park).” where “the NLU system 2262 can analyze the words and/or phrases produced by the ASR system 2250 and determine the meaning most likely intended by the speaker, given the previous words or phrases spoken by the participant or others involved in the interaction.”; Divakaran, ¶¶ [0321], [0284]); determining second output data representing a confirmation to perform a second action corresponding to the second NLU data (“the virtual personal assistant may draw the conclusion that a restaurant of some other type may satisfy the person’s 2400 request. Using an ontology, the virtual personal assistant may relate “Chinese restaurant” to “Asian restaurant” and further determine that there are Japanese restaurants in Menlo Park. The virtual personal assistant may thus, at step 2406, suggest: “I couldn’t find a Chinese restaurant in Menlo Park. How about a Japanese restaurant?’” which is a request for confirmation regarding offering directions to a Japanese restaurant, corresponding to the intent to visit a Chinese restaurant {the second NLU data}.; Divakaran, ¶¶ [0321]); sending the second output data to a device (The virtual assistant outputs “I couldn’t find a Chinese restaurant in Menlo Park. How about a Japanese restaurant?”; Divakaran, ¶¶ [0321]); However, Divakaran fails to expressly recite further comprising: receiving third audio data representing a third utterance; associating the third audio data with a dialogue session identifier; performing a second action responsive to the third utterance; receiving fourth audio data representing a fourth utterance; associating the fourth audio data with the dialogue session identifier; determining that the fourth utterance interrupts performance of the second action; receiving second sentiment data corresponding to the fourth audio data; the second sentiment data indicating frustration; and determining second output data other than performance of the second action. However, Divakaran fails to expressly recite receiving fourth audio data representing a fourth utterance; associating the fourth audio data with the dialogue session identifier; determining that the fourth utterance corresponds to negative feedback; determining third output data representing an 
The relevance of Sinha is described above with relation to claim 1. Regarding claim 10, Sinha teaches receiving fourth audio data representing a fourth utterance (“In some implementations, detecting the user interaction comprises detecting a second speech input and a third speech input”; Sinha, ¶¶ [0127]); associating the fourth audio data with the dialogue session identifier (“determining that the second speech input and the third speech input indicate dissatisfaction {negative feedback} with the at least one action” where the one action is in response to the first speech input. Ergo, the first, second, and third speech input are associated with the same dialog session.; Sinha, ¶¶ [0127]); determining that the fourth utterance corresponds to negative feedback (“In some implementations, detecting the user interaction comprises detecting a second speech input and a third speech input… [and] determining that the second speech input and the third speech input indicate dissatisfaction {negative feedback} with the at least one action”; Sinha, ¶¶ [0127]); determining third output data representing an acknowledgement of the negative feedback (In response to determining dissatisfaction “the digital assistant prompts the user to provide the speech input, such as by saying “Sorry about that—can you please describe what went’ wrong?” {...representing an apology}” where the speech synthesis includes “generating text to provide as an output to the user {determining second text output data}”; Sinha, ¶¶ [0127], [0131], [0049]); sending the third output data to the device (In response to determining dissatisfaction “the digital assistant prompts the user to provide the speech input, such as by saying “Sorry about that—can you please describe what went’ wrong?’ “ where the speech synthesis includes “generating text to provide as an output to the user {determining third output data}”; Sinha, ¶¶ [0127], [0131], [0049]); and determining to end a dialogue corresponding to the dialogue session identifier (The system teaches that “each existing information provision process {dialogue corresponding to the dialogue session} terminates...when the digital assistant provides a request for additional information or clarification Sinha, ¶¶ [0110]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include receiving fourth audio data representing a fourth utterance; associating the fourth audio data with the dialogue session identifier; determining that the fourth utterance corresponds to negative feedback; determining third output data representing an acknowledgement of the negative feedback; sending the third output data to the device and determining to end a dialogue corresponding to the dialogue session identifier. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). 

Regarding claim 12, the rejection of claim 5 is incorporated. Divakaran, Sinha, and Aleksic disclose all of the elements of the current invention as stated above. Divakaran further discloses further comprising: determining, using NLU processing, second NLU data corresponding to the second audio data (In further embodiments, the system can use “a natural language recognition system {determine, using natural language understanding (NLU) processing}... to understand what the person wants {a second NLU data}” which corresponds to the “natural language in the audio input {the second audio data}”; Divakaran, ¶¶ [0063], [0056], [0075]), the first NLU data including first intent data and the second NLU data including second intent data (the system can use “a natural language recognition system {determine, using natural language understanding (NLU) processing}... to understand what the person wants” Divakaran, ¶¶ [0063], [0056], [0075]). However, Divakaran fails to expressly recite wherein determining the repeat indicator comprises processing the first NLU data with respect to the second NLU data to determine that the second utterance is similar to the first utterance based at least in part on the first intent data corresponding to the second intent data. 
The relevance of Sinha is described above with relation to claim 1. Regarding claim 12, Sinha teaches and wherein determining the repeat indicator comprises processing the first NLU data with respect to the second NLU data (In determining dissatisfaction based on repetition, the system can determine “that the digital assistant is not properly identifying the user’s intent from the speech input” even when “the words in the first and second speech input may be somewhat different from one another {comparison of the first NLU hypothesis and the second NLU hypothesis}”; Sinha, ¶¶ [0126]) to determine that the second utterance is similar to the first utterance based at least in part on the first intent data corresponding to the second intent data (“additional factors are considered in selecting the node as well, such as whether the digital assistant has previously correctly interpreted a similar request from a user.”; Sinha, ¶¶ [0083]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include wherein determining the repeat indicator comprises processing the first NLU data with respect to the second NLU data to determine that the second utterance is similar to the first utterance based at least in part on the first intent data corresponding to the second intent data. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that Sinha. (Sinha, ¶ [0005]). 

Regarding claim 13, Divakaran discloses A system comprising (The systems and methods for speech recognition described with reference to the virtual personal assistant.; Divakaran, ¶¶ [0062]): at least one processor(“A processor(s), implemented in an integrated circuit, may perform the necessary tasks”; Divakaran, ¶¶ [0340]); and at least one memory including instructions that, when executed by the at least one processor, cause the system to (“When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks (e.g., a computer-program product) may be stored in a computer-readable or machine-readable medium.”; Divakaran, ¶¶ [0340]): receive first audio data representing a first utterance (In embodiments describing a request for a prescription, the system discloses “the person tells the system, ‘I’d like to refill a prescription.’ {a first utterance}” where “The system detects {receiving…} that the person is speaking slowly and hesitantly. {first audio data representing the first utterance}”; Divakaran, ¶¶ [0063]); determine, using natural language understanding (NLU) processing, first NLU data corresponding to the first audio data (In further embodiments, the system can use “a natural language recognition system {determine, using natural language understanding (NLU) processing}... to understand what the person wants {a first NLU hypothesis corresponding to the first ASR output}” which corresponds to the “natural language in the audio input {the first ASR output data}”; Divakaran, ¶¶ [0063], [0056], [0075]); cause a first action to be performed corresponding to the first NLU data (“Based on the conclusions that the system has made about the speaker’s emotional or cognitive state {corresponding to the first NLU data}, at step 312, the system determines to change its dialog approach by asking direct yes/no questions {causing a first action to be performed}, and responds, ‘Sure, happy to help you with that. I’ll need to ask you some questions first.’ “; Divakaran, ¶¶ [0063]); receive second audio data representing a second utterance Divakaran, ¶¶ [0068]); receive sentiment data corresponding to the second audio data (“From this reply {corresponding to… the second audio data}, the system may detect audible frustration” where frustration indicates a sentiment which is derived from the reply {sentiment data}.; Divakaran, ¶¶ [0068]); determine that the sentiment data indicates frustration (“the system may detect audible frustration,” thus the sentiment data indicates frustration.; Divakaran, ¶¶ [0068]); . However, Divakaran fails to expressly recite determining a repeat indicator based on the second utterance being semantically similar to the first utterance …and in response to the repeat indicator and the sentiment data indicating frustration, determining output data other than performing the first action, wherein the output data corresponds to a system-generated dialogue.
The relevance of Sinha is described above with relation to claim 1. Regarding claim 5, Sinha teaches determine a repeat indicator based on the second utterance being semantically similar to the first utterance (“Users may also indicate dissatisfaction by repeating the same speech input multiple times {determine... that the second utterance is a repeat of the first utterance} in an effort to make the digital assistant understand his or her words or intent. Accordingly, detecting the same input from a user multiple times within a short period of time and/or within the same dialog with the digital assistant {determining, using the first dialogue session identifier} can indicate that the user is not being properly understood, or that the digital assistant is not properly identifying the user’s intent from the speech input”; Sinha, ¶¶ [0126])…and in response to the repeat indicator and the sentiment data indicating frustration, determine output data other than performing the first action, (“In some implementations, upon determining that the user interaction is indicative of a problem (in step (407)) {in response to the repeat indicator and the sentiment data indicating frustration}, the digital assistant provides a first prompt requesting {determining...} the user to confirm whether there was a problem in the performing of the at least one action (430) {output other than performing the first action}.”; Sinha, ¶¶ [0135]) wherein the output data corresponds to a system-generated dialogue (“speech synthesis module 265 synthesizes speech outputs based on text provided by the digital assistant. For example, the digital assistant generates text to provide as an output to a user, and the speech synthesis module 265 converts the text to an audible speech output.”; Sinha, ¶¶ [0049]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, to further incorporate the teachings of Sinha to include determining a repeat indicator based on the second utterance being semantically similar to the first utterance …and in response to the repeat indicator and the sentiment data indicating frustration, determining output data other than performing the first action, wherein the output data corresponds to a system-generated dialogue. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). However, Divakaran and Sinha fail to expressly recite wherein a first dialogue session includes a first dialogue session identifier.
The relevance of Aleksic is described above with relation to claim 1. Regarding claim 13, Aleksic teaches wherein a first dialogue session includes a first dialogue session identifier (The system can include a “dialog session identifier [which] is data that indicates a particular dialog session associated with the request 212. The dialog session identifier may be used by the speech recognizer 202 to correlate a series of transcription requests {first audio data} that relate to a same dialog session. {associating... with a first dialogue session identifier}”; Aleksic, ¶¶ [0053]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, to incorporate the teachings of Aleksic to include wherein a first dialogue session includes a first dialogue session identifier. Use Aleksic. (Aleksic, ¶ [0024]). 

Regarding claim 16, the rejection of claim 13 is incorporated. Claim 16 is substantially the same as claim 8 and is therefore rejected under the same rationale as above.

Regarding claim 18, the rejection of claim 13 is incorporated. Claim 18 is substantially the same as claim 10 and is therefore rejected under the same rationale as above.

Regarding claim 20, the rejection of claim 13 is incorporated. Claim 20 is substantially the same as claim 12 and is therefore rejected under the same rationale as above.

Claims 6-7, 9, 14-15, 17 is/are rejected under 35 U.S.C. §103 as being unpatentable over Divakaran in view of Sinha and Aleksic as applied to claims 5 and 13 above, and further in view of Khan.

Regarding claim 6, the rejection of claim 5 is incorporated. Divakaran, Sinha, and Aleksic disclose all of the elements of the current invention as stated above. However, Divakaran fails to expressly recite further comprising: determining, using NLU processing, second NLU data corresponding to the first audio data, the second NLU data different than the first NLU data; determining that the first NLU data satisfies a first condition; and wherein determining the output data comprises determining the output data representing a confirmation of the first action.
The relevance of Sinha is described above with relation to claim 1. Regarding claim 6, Sinha teaches further comprising: determining, using NLU processing, second NLU data corresponding to the first audio data (“The natural language processing module 332 (“natural language processor”) of the digital assistant takes the sequence of words or tokens (“token Sinha, ¶¶ [0073]), the second NLU data different than the first NLU data (“scope of a digital assistant’s capabilities is dependent... on the number and variety of ‘actionable intents’,” thus indicating that the actionable intents are different { the second NLU data different than the first NLU data}; Sinha, ¶¶ [0073]); determining that the first NLU data satisfies a first condition (“the domain having the highest confidence value (e.g., based on the relative importance of its various triggered nodes) is selected.” Thus the condition of having the highest confidence value can be satisfied by the first NLU data.; Sinha, ¶¶ [0083]); and wherein determining the output data comprises determining the output data representing a confirmation of the first action (“In some implementations, after all of the tasks needed to fulfill the user’s request have been performed, the digital assistant 326 formulates a confirmation response, and sends the response back to the user through the I/O processing module 328.”; Sinha, ¶¶ [0096]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include further comprising: determining, using NLU processing, second NLU data corresponding to the first audio data, the second NLU data different than the first NLU data; determining that the first NLU data satisfies a first condition; and wherein determining the output data comprises determining the output data representing a confirmation of the first action. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital Sinha. (Sinha, ¶ [0005]). However, Divakaran, Sinha, and Aleksic fail to expressly recite determining that the second NLU data does not satisfy a second condition.
The relevance of Khan is described above with relation to claim 1. Regarding claim 6, Khan teaches determining that the second NLU data does not satisfy a second condition (“The matching component 230 is configured to associate one or more emotions with the audio signal based upon the computed confidence scores and whether or not one or more confidence score thresholds has been met or exceeded,” thus the confidence score confirms the presence of the emotion, and where the confidence score can either satisfy a threshold or fail to satisfy said threshold (thus, confirming or failing to confirm the hypothesis of the emotion {does not satisfy a second condition}).; Khan, ¶¶ [0043]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Khan to include determining that the second NLU data does not satisfy a second condition. “Understanding of the emotional state of a speaker… provides for improved accuracy and reduced error rate, thus resulting in improved efficiency in carrying out processes reliant on audio detection and interpretation,” as recognized by Khan. (Khan, ¶ [0005]).

Regarding claim 7, the rejection of claim 5 is incorporated. Divakaran, Sinha, and Aleksic disclose all of the elements of the current invention as stated above. However, Divakaran fails to expressly recite further comprising: determining, using NLU processing, second NLU data corresponding to the first audio data, the second NLU data different than the first NLU data.
The relevance of Sinha is described above with relation to claim 1. Regarding claim 7, Sinha teaches further comprising: determining, using NLU processing, second NLU data corresponding to the first audio data (“The natural language processing module 332 (“natural Sinha, ¶¶ [0073]), the second NLU data different than the first NLU data (“scope of a digital assistant’s capabilities is dependent... on the number and variety of ‘actionable intents’,” thus indicating that the actionable intents are different {the second NLU data different than the first NLU data}; Sinha, ¶¶ [0073]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include further comprising: determining, using NLU processing, second NLU data corresponding to the first audio data, the second NLU data different than the first NLU data. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). However, Divakaran, Sinha, and Aleksic fail to expressly recite determining that the second NLU data satisfies a condition, and wherein determining the output data comprises determining the output data including a representation of a second action corresponding to the second NLU data. 
The relevance of Khan is described above with relation to claim 1. Regarding claim 7, Khan teaches determining that the second NLU data satisfies a condition (“The matching component 230 is configured to associate one or more emotions with the audio signal based upon the computed confidence scores and whether or not one or more confidence score thresholds has been met or exceeded,” thus the confidence score confirms the presence of the emotion, and where the confidence score can either satisfy a threshold or fail to satisfy said threshold (thus, Khan, ¶¶ [0043]), and wherein determining the output data comprises determining the output data including a representation of a second action corresponding to the second NLU data (“The action initiating component 232 is configured to initiate any of a number of different actions in response to associating one or more emotions with an audio signal.” where “The matching component 230 is configured to associate one or more emotions with the audio signal based upon the computed confidence scores and whether or not one or more confidence score thresholds has been met or exceeded.”; Khan, ¶¶ [0043], [0044]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Khan to include determining that the second NLU data satisfies a condition, and wherein determining the output data comprises determining the output data including a representation of a second action corresponding to the second NLU data. “Understanding of the emotional state of a speaker… provides for improved accuracy and reduced error rate, thus resulting in improved efficiency in carrying out processes reliant on audio detection and interpretation,” as recognized by Khan. (Khan, ¶ [0005]).

Regarding claim 9, the rejection of claim 5 is incorporated. Divakaran, Sinha, and Aleksic disclose all of the elements of the current invention as stated above. However, Divakaran fails to expressly recite further comprising: determining, using automatic speech recognition (ASR) processing, an ASR confidence score corresponding to the first audio data.
The relevance of Sinha is described above with relation to claim 1. Regarding claim 9, Sinha teaches further comprising: determining, using automatic speech recognition (ASR) processing, an ASR confidence score corresponding to the first audio data (“In some Sinha, ¶¶ [0083]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include further comprising: determining, using automatic speech recognition (ASR) processing, an ASR confidence score corresponding to the first audio data. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). However, Divakaran, Sinha, and Aleksic fail to expressly recite determining a NLU confidence score associated with the first NLU data; receiving alternative representation data corresponding to the first utterance; and determining the output data other than performing the first action based at least in part on the sentiment data, the ASR confidence score, the NLU confidence score, and the alternative representation data.
The relevance of Khan is described above with relation to claim 1. Regarding claim 9, Khan teaches determining a NLU confidence score associated with the first NLU data (“Confidence scores for one or more defined emotions are computed, as indicated at block 414, based upon the computed audio fingerprint.” where defined emotions can include “emotion of ‘anger’ “; Khan, ¶¶ [0054]); receiving alternative representation data corresponding to the first utterance (“The action initiating component 232 is configured to initiate any of a number of different actions {alternative representation data} in response to associating one or more emotions with an audio signal {corresponding to the first utterance}” where “The matching component 230 is configured to associate one or more emotions with the audio signal based upon the computed confidence scores and whether or not one or more confidence score thresholds has been met or exceeded.”; Khan, ¶¶ [0043], [0044]); and determining the output data other than performing the first action based at least in part on the sentiment data, the ASR confidence score, the NLU confidence score, and the alternative representation data (When {based at least in part on...} an audio fingerprint indicates anger {the sentiment data} and the computed confidence score {NLU confidence score} threshold is met, the system will “prompt the speaker with an “are you sure?” type of message {determining the output data other than performing the first action}”; Khan, ¶¶ [0044]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Khan to include determining a NLU confidence score associated with the first NLU data; receiving alternative representation data corresponding to the first utterance; and determining the output data other than performing the first action based at least in part on the sentiment data, the ASR confidence score, the NLU confidence score, and the alternative representation data. “Understanding of the emotional state of a speaker… provides for improved accuracy and reduced error rate, thus resulting in improved efficiency in carrying out processes reliant on audio detection and interpretation,” as recognized by Khan. (Khan, ¶ [0005]).

Regarding claim 14, the rejection of claim 13 is incorporated. Claim 14 is substantially the same as claim 6 and is therefore rejected under the same rationale as above.

Regarding claim 15, the rejection of claim 13 is incorporated. Claim 15 is substantially the same as claim 7 and is therefore rejected under the same rationale as above.

Regarding claim 17, the rejection of claim 13 is incorporated. Claim 17 is substantially the same as claim 9 and is therefore rejected under the same rationale as above.

Claims 11 and 19 is/are rejected under 35 U.S.C. §103 as being unpatentable over Divakaran in view of Sinha and Aleksic as applied to claims 5 and 13 above, and further in view of Fox.

Regarding claim 11, the rejection of claim 5 is incorporated. Divakaran, Sinha, and Aleksic disclose all of the elements of the current invention as stated above. However, Divakaran fails to expressly recite further comprising: receiving third audio data representing a third utterance; determining second output text data representing a system request; determining second output audio data corresponding to the second output text data using speech synthesis processing; and sending the second output audio data to a device.
The relevance of Sinha is described above with relation to claim 1. Regarding claim 11, Sinha teaches further comprising: receiving third audio data representing a third utterance (The system “receives, from a user, a speech input containing a request (402).”; Sinha, ¶¶ [0114], [0110]); determining second output text data representing a system request (“audio quality manager 200 may provide {determining second output data...} feedback to the user {in response to determining that the SNR data exceeds the second threshold value...} to, for example, adjust the microphone location to provide more distance between the microphone and the mouth or the user as the input signal amplitude will be decreased with distance {including a request to move closer to the device}” and “a request that the user modulate his/her voice to a lower volume {repeat the third utterance}; Fox, ¶¶ [0036], [0037]); determining second output audio data corresponding to the second output text data using speech synthesis processing (“speech synthesis module 265 synthesizes speech outputs based on text provided by the digital assistant. For example, the digital assistant generates text to provide as an output to a user, and the speech synthesis module 265 converts the text to an audible speech output.”; Sinha, ¶¶ [0049]); and sending the second output audio data to a device (“In some implementations, instead of (or Sinha, ¶¶ [0049]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Sinha to include further comprising: receiving third audio data representing a third utterance; determining second output text data representing a system request; determining second output audio data corresponding to the second output text data using speech synthesis processing; and sending the second output audio data to a device. The systems and methods described in Sinha can “identify particular instances where errors have occurred, so that the source of the errors can be identified and addressed,” which helpful for “improv[ing] the quality of digital assistants,” as recognized by Sinha. (Sinha, ¶ [0005]). However, Divakaran, Sinha, and Aleksic fail to expressly recite receiving signal quality data corresponding to the third audio data; [and] determining that the signal quality data corresponds to a potential error in ASR processing of the third audio data.
The relevance of Fox is described above with relation to claim 3. Regarding claim 11, Fox teaches receiving signal quality data corresponding to the third audio data (“The audio quality manager 200 also may monitor the signal to noise ratio (SNR). Generally, the signal to noise ratio is a comparison of the power of a desired signal to the power of the noise signal {indicating signal energy associated with the third utterance...}.”; Fox, ¶¶ [0037]); determining that the signal quality data corresponds to a potential error in ASR processing of the third audio data (“High signal to noise ratios generally mean it is easier to filter the noise from the signal. A low signal to noise ratio may, for example, indicate that the audio is not sufficiently loud, or too quiet for the system to adequately distinguish the signal from the noise.” The difference between high and low SNR is a threshold value, where low SNR exceeds a threshold value and Fox, ¶¶ [0037]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the virtual personal assistant of Divakaran, as modified by the speech error detection systems of Sinha, and by the dialogue state determination of Aleksic, to further incorporate the teachings of Fox to include receiving signal quality data corresponding to the third audio data; [and] determining that the signal quality data corresponds to a potential error in ASR processing of the third audio data. Providing feedback regarding audio quality during transcription of dictation can “allow correction while dictation is on-going” to improve the resulting audio quality, as recognized by Fox. (Fox, ¶ [0004], [0009]).

Regarding claim 19, the rejection of claim 13 is incorporated. Claim 19 is substantially the same as claim 11 and is therefore rejected under the same rationale as above.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Kulkarni et al. (U.S. Pat. App. Pub. No. 2018/0254035) discloses systems and methods for detecting hyperarticulation is present in repetitive voice queries, which can indicate frustration.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627. The examiner can normally be reached 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached on (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Sean E Serraguard/Patent Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657