Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
All objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.

Response to Amendments 
Applicant’s amendment filed on March 21, 2022 has been entered. 
In view of the amendment to the claim(s), the amendment of claim(s) 1, 5, 15, and 20 has been acknowledged and entered.  
In view of the amendment to claim(s) 5, the objection to claim(s) 5 is withdrawn.
In view of the amendment to claim(s) 5, the rejection of claim(s) 5 under 35 U.S.C. §112 is withdrawn.
In view of the amendments to claims 1, 15 and 20, new grounds for rejection under 35 U.S.C. §112 are provided in the response below.
In view of the amendment to claim(s) 1, 15, and 20, the rejections of claims 1-20 under 35 U.S.C. §102 and 103 are withdrawn.
In light of the amended/newly added claims, new grounds for rejection under 35 U.S.C. §103 are provided in the response below. 

Response to Arguments
Applicant’s arguments regarding the prior art rejections under 35 U.S.C. §102/103, see pages 8-10 of the Response to Non-Final Office Action dated December 21, 2021, which was received on March 21, 2022 (hereinafter Response and Office Action, respectively), have been fully considered.
With respect to the rejection(s) of claim(s) 1 and 15 under 35 U.S.C. §102(a)(1) as anticipated by Azara (U.S. Pat. App. Pub. No. 2005/0182619, hereinafter Azara), applicant asserts that the prior art of record, specifically Azara, Hu (U.S. Pat. App. Pub. No. 2020/0251097, hereinafter Hu), McDaniel (U.S. Pat. No. 11,132,993, hereinafter McDaniel), Hoffmeister (U.S. Pat. No. 10,388,274, hereinafter Hoffmeister), and Non-Patent Literature to Joshi (Aditya Joshi, Pushpak Bhattacharyya, and Mark J. Carman. 2017. Automatic Sarcasm Detection: A Survey. ACM Comput. Surv. 50, 5, Article 73 (September 2018), 22 pages. DOI:https://doi.org/10.1145/3124420, hereinafter Joshi) fail to teach or suggest “wherein either the first interpretation or the second interpretation includes a sarcastic interpretation based on the acoustic characteristics” and "wherein the response includes a preliminary response and the response accounts for the ambiguation." However, this argument is not persuasive.
Regarding "wherein the response includes a preliminary response and the response accounts for the ambiguation," Azara discloses the cited element. With relation to the alternate possible meanings, Azara discloses “The presence of more than one set of candidate discourse functions reflects alternate possible meanings associated with the speech information. Thus, if the recognized speech contains an ambiguity, the candidate discourse functions include the alternate candidate sets of discourse functions corresponding to the identified ambiguities.” (Azara, ¶ [0048]). Thus, Azara discloses multiple preliminary responses in that the system includes candidate discourse for each of the alternate possible meanings and that account for the ambiguities.
Further, though examiner agrees that “wherein either the first interpretation or the second interpretation includes a sarcastic interpretation based on the acoustic characteristics” is not expressly recited by Azara, at least McDaniel cures the deficiency of Azara. As previously presented, McDaniel discloses the use of “prosodic features… extracted 145 from the words and inter-word boundaries of the segment in various embodiments” to determine “the emotional state of the speaker, the form of an utterance (whether a statement, question, or command), the presence of irony or sarcasm, and emphasis, contrast, and focus.” (McDaniel, Col. 6, lines 21-34). McDaniel further explains that “Prosodic features are normally identified as either an auditory measure or an acoustic measure”. (McDaniel, Col. 6, lines 31-34).
Therefore, the rejection of claims 1 and 15 under 35 U.S.C. §102(a)(1) is withdrawn. However, in response to the amendment, new grounds for rejection under 35 U.S.C. §103 are presented below for claims 1 and 15 as obvious in light of Azara in view of McDaniel.
Applicant further argues that dependent claims 2-14 and 16-20 are allowable for at least the same reasons as independent claims 1 and 15. Applicant’s arguments in light of the amended claims are persuasive with regards to Azara. As such, the rejections of claims 2-14 and 16-20 under 35 U.S.C. §102 and 103 are withdrawn and represented in light of Azara in view of McDaniel, as discussed with reference to claims 1 and 15. 
In response to applicant's argument regarding the number of cited references, reliance on a large number of references in a rejection does not, without more, weigh against the obviousness of the claimed invention.  See In re Gorman, 933 F.2d 982, 18 USPQ2d 1885 (Fed. Cir. 1991).
The Applicant has not provided any further statement and therefore, the Examiner directs the Applicant to the below rationale.	

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-3, 13, and 15-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara (U.S. Pat. App. Pub. No. 2005/0182619, hereinafter Azara) in view of McDaniel (U.S. Pat. No. 11,132,993, hereinafter McDaniel).

Regarding claim 1, Azara discloses A method of response generation, comprising (Discloses systems and methods for “resolv[ing] ambiguities… before the command can be properly executed {for response generation}”; Azara, ¶¶ [0037]) : receiving audible human speech from a user (“The speech information {audible human speech} may be obtained from...human-computer commands, human-computer dictation {from a user} and the like.”; Azara, ¶¶ [0050]); determining textual speech data based on the audible human speech (“In step S20, the speech information is determined. The speech information may be obtained from any source of natural language information…. [and] is recognized using an automatic speech recognition system.”; Azara, ¶¶ [0050]); extracting, from the audible human speech, signal speech data that is indicative of acoustic characteristics which correspond to the textual speech data (“After the speech information has been recognized, control continues to step S30 where the prosodic features are determined.”; Azara, ¶¶ [0050]); based on the textual speech data, determining, using a natural language understanding model, a text string comprising an ambiguation, (“if the recognized speech contains an ambiguity, the candidate discourse functions include the alternate candidate sets of discourse functions corresponding to the identified ambiguities.”; Azara, ¶¶ [0048]) wherein the ambiguation comprises a first interpretation of the text string and a second interpretation of the text string, (the ambiguity can include “alternate possible meanings associated with the speech information,” thus including a first interpretation and a second interpretation of the text string.; Azara, ¶¶ [0048]) wherein the first interpretation differs from the second interpretation (“the interpretations have “alternate possible meanings,” thus, the first interpretation differs from the second interpretation; Azara, ¶¶ [0048]); determining the first interpretation is most accurate by corresponding word boundaries determined from the text string with the acoustic characteristics determined from the signal speech data (“A relation is determined between the prosodic features identified in the speech information and the expected prosodic features.” where “The prosodic features include but are not limited to pitch frequency, rate of speech, stress, number of intonational boundaries or any other known or later developed prosodic feature useful in determining discourse functions.” and where “The speech information is disambiguated or resolved based on the rank of the sets of candidate discourse functions. Sets of candidate discourse functions that are more likely prosodically will rank higher.”; Azara, ¶¶ [0051], [0053], [0055]); and generating a response to the audible human speech based on the first interpretation (“based on the correlation of identified prosodic features {based on the first interpretation}…” from the “candidate discourse functions... [indicating] possible alternate meaning of a speaker’s utterance {to the audible human speech},” “a natural language interface implementing the system for resolving ambiguity might decide whether to continue in a dictation or data mode when processing this sentence {generate a response…}.”; Azara, ¶¶ [0080]), wherein the response includes a preliminary response and the response accounts for the ambiguation (With relation to the alternate possible meanings, Azara discloses “The presence of more than one set of candidate discourse functions reflects alternate possible meanings associated with the speech information. Thus, if the recognized speech contains an ambiguity, the candidate discourse functions include the alternate candidate sets of discourse functions corresponding to the identified ambiguities.” Further, by acknowledging and providing alternative responses based on the ambiguity, Azara accounts for the ambiguation; Azara, ¶ [0048]). However, Azara fails to expressly recite wherein either the first interpretation or the second interpretation includes a sarcastic interpretation based on the acoustic characteristics.
McDaniel teaches “analyzing an audio to capture semantic and non-semantic characteristics of the audio.” (McDaniel, Col. 1, lines 45-50). Regarding claim 1, McDaniel teaches wherein either the first interpretation or the second interpretation includes a sarcastic interpretation based on the acoustic characteristics (discloses the use of “prosodic features… extracted 145 from the words and inter-word boundaries of the segment in various embodiments” to determine “the emotional state of the speaker, the form of an utterance (whether a statement, question, or command), the presence of irony or sarcasm, and emphasis, contrast, and focus,” where prosodic features includes “auditory measure or an acoustic measure”. (McDaniel, Col. 6, lines 21-34). 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara to incorporate the teachings of McDaniel to include wherein either the first interpretation or the second interpretation includes a sarcastic interpretation based on the acoustic characteristics. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25).

Regarding claim 2, the rejection of claim 1 is incorporated. Azara disclose all of the elements of the current invention as stated above. However, Azara fail(s) to expressly recite wherein the signal speech data comprises at least one of sarcasm information, emotion information, pause information, or emphasis information.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 2, McDaniel teaches wherein the signal speech data comprises at least one of sarcasm information, emotion information, pause information, or emphasis information (“Prosodic features may reflect various characteristics of a speaker such as, for example, the emotional state of the speaker, the form of an utterance (whether a statement, question, or command), the presence of irony or sarcasm, and emphasis, contrast, and focus.”; McDaniel, Col. 6, lines 24-31).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara to incorporate the teachings of McDaniel to include wherein the signal speech data comprises at least one of sarcasm information, emotion information, pause information, or emphasis information. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25).

Regarding claim 3, the rejection of claim 1 is incorporated. Azara disclose all of the elements of the current invention as stated above. However, Azara fail(s) to expressly recite further comprising determining the first interpretation using pause information in the signal speech data, wherein the pause information corresponds to at least one word boundary in the text string.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 3, McDaniel teaches further comprising determining the first interpretation using pause information in the signal speech data, (“the extraction module extracts prosodic features reflecting pause durations {using pause information in the signal speech data}, phone durations, and pitch information,” where prosodic features are used to determine “the emotional state of the speaker, the form of an utterance (whether a statement, question, or command), the presence of irony or sarcasm, and emphasis, contrast, and focus {e.g., the first interpretation}”; McDaniel, Col. 20, lines 62-64) wherein the pause information corresponds to at least one word boundary in the text string (“pause features are extracted at inter-word boundaries found in the segment, that is to say, the pause features are extracted at spaces/pauses occurring between two consecutive words found in the segment”; McDaniel, Col. 20, lines 64-67).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara to incorporate the teachings of McDaniel to include wherein the signal speech data comprises at least one of sarcasm information, emotion information, pause information, or emphasis information. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25).

Regarding claim 13, the rejection of claim 1 is incorporated. Azara and McDaniel disclose all of the elements of the current invention as stated above. Azara further discloses data regarding a context of the audible human speech (In the example provided, the system detects that “The emphasis on “John” may be used to subordinate the phrase “MAX FELL” to the phrase “JOHN PUSHED HIM”.” where the phrase “john pushed him” is data regarding the context of “Max fell” where “The subordination is then used to infer that John’s push was the cause of Max’s fall.”; Azara, ¶¶ [0087]). However, Azara fail(s) to expressly recite further comprising: prior to generating the response, evaluating, at a dialog management model the first interpretation in light of one or more of: sarcasm information, emotion information, emphasis information, data regarding the user…, or external data relevant to the user or the audible human speech, wherein the external data comprises data regarding a time of the audible human speech, data regarding a location of the audible human speech, or both.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 13, McDaniel teaches further comprising: prior to generating the response, evaluating, at a dialog management model the first interpretation in light of one or more of: sarcasm information (“prosodic features are extracted... [and] may reflect various characteristics of a speaker such as...the presence of irony or sarcasm”; McDaniel, ¶¶ Col. 6, lines 21-31), emotion information (“prosodic features are extracted... [and] may reflect various characteristics of a speaker such as, for example, the emotional state of the speaker”; McDaniel, ¶¶ Col. 6, lines 21-31), emphasis information (“prosodic features are extracted... [and] may reflect various characteristics of a speaker such as… emphasis”; McDaniel, ¶¶ Col. 6, lines 21-31), data regarding the user (“An auditory measure represents a subjective impression produced in the mind of the listener. Popular variables in auditory terms include the pitch of the voice, length of sounds, loudness, and timbre. “; McDaniel, ¶¶ Col. 6, lines 34-35), or external data relevant to the user or the audible human speech, wherein the external data comprises data regarding a time of the audible human speech, data regarding a location of the audible human speech, or both (“the semantic and non-semantic characteristics may be displayed to a third party who may be interested in reviewing the transcript 170. For example, the transcript 170 may be displayed as text along a timeline representing what was spoken by the first party” where a timeline indicates that the time of the audible speech is collected and evaluated {such that a timeline can be produced}.; McDaniel, ¶¶ Col. 7, lines 21-23).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara to incorporate the teachings of McDaniel to include further comprising: prior to generating the response, evaluating, at a dialog management model the first interpretation in light of one or more of: sarcasm information, emotion information, emphasis information, data regarding the user…, or external data relevant to the user or the audible human speech, wherein the external data comprises data regarding a time of the audible human speech, data regarding a location of the audible human speech, or both. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25).


Regarding claim 15, Azara discloses A non-transitory computer-readable medium comprising a plurality of computer-executable instructions and memory for maintaining the plurality of computer-executable instructions, (“The system for resolving ambiguity 100 and the various circuits discussed above can also be implemented by physically incorporating the system for resolving ambiguity 100 into software and/or a hardware system, such as the hardware and software systems of a web server or a client device” using “any appropriate combination of alterable, volatile or non-volatile memory or non-alterable, or fixed memory;” Azara, ¶¶ [0104]-[0105]); wherein the plurality of computer-executable instructions, when executed by one or more processors of a computer, perform the following function(s) (Discloses systems and methods for “resolv[ing] ambiguities… before the command can be properly executed {for response generation}”; Azara, ¶¶ [0037]): receive audible human speech (“The speech information {audible human speech} may be obtained from...human-computer commands, human-computer dictation {from a user} and the like.”; Azara, ¶¶ [0050]); determine textual speech data based on the audible human speech (“In step S20, the speech information is determined. The speech information may be obtained from any source of natural language information… [and] is recognized using an automatic speech recognition system.”; Azara, ¶¶ [0050]); extract, from the audible human speech, signal speech data that is indicative of acoustic characteristics which correspond to the textual speech data (“After the speech information has been recognized, control continues to step S30 where the prosodic features are determined.”; Azara, ¶¶ [0050]); based on the textual speech data, determine, using a natural language understanding model, a text string comprising an ambiguation, (“if the recognized speech contains an ambiguity, the candidate discourse functions include the alternate candidate sets of discourse functions corresponding to the identified ambiguities.”; Azara, ¶¶ [0048]) wherein the ambiguation comprises a first interpretation of the text string and a second interpretation of the text string, (the ambiguity can include “alternate possible meanings associated with the speech information,” thus including a first interpretation and a second interpretation of the text string.; Azara, ¶¶ [0048]) wherein the first interpretation differs from the second interpretation (“the interpretations have “alternate possible meanings,” thus, the first interpretation differs from the second interpretation; Azara, ¶¶ [0048]); determine the first interpretation is most accurate by corresponding word boundaries determined from the text string with the acoustic characteristics determined from the signal speech data (“A relation is determined between the prosodic features identified in the speech information and the expected prosodic features.” where “The prosodic features include but are not limited to pitch frequency, rate of speech, stress, number of intonational boundaries or any other known or later developed prosodic feature useful in determining discourse functions.” and where “The speech information is disambiguated or resolved based on the rank of the sets of candidate discourse functions. Sets of candidate discourse functions that are more likely prosodically will rank higher.”; Azara, ¶¶ [0051], [0053], [0055]); and generate a response to the audible human speech based on the first interpretation (“based on the correlation of identified prosodic features {based on the first interpretation}…” from the “candidate discourse functions... [indicating] possible alternate meaning of a speaker’s utterance {to the audible human speech},” “a natural language interface implementing the system for resolving ambiguity might decide whether to continue in a dictation or data mode when processing this sentence {generate a response…}.”; Azara, ¶¶ [0080]), wherein the response includes a preliminary response and the response accounts for the ambiguation (With relation to the alternate possible meanings, Azara discloses “The presence of more than one set of candidate discourse functions reflects alternate possible meanings associated with the speech information. Thus, if the recognized speech contains an ambiguity, the candidate discourse functions include the alternate candidate sets of discourse functions corresponding to the identified ambiguities.” Further, by acknowledging and providing alternative responses based on the ambiguity, Azara accounts for the ambiguation; Azara, ¶ [0048]). However, Azara fails to expressly recite wherein either the first interpretation or the second interpretation includes a sarcastic interpretation based on the acoustic characteristics.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 15, McDaniel teaches wherein either the first interpretation or the second interpretation includes a sarcastic interpretation based on the acoustic characteristics (discloses the use of “prosodic features… extracted 145 from the words and inter-word boundaries of the segment in various embodiments” to determine “the emotional state of the speaker, the form of an utterance (whether a statement, question, or command), the presence of irony or sarcasm, and emphasis, contrast, and focus,” where prosodic features includes “auditory measure or an acoustic measure”. (McDaniel, Col. 6, lines 21-34). 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara to incorporate the teachings of McDaniel to include wherein either the first interpretation or the second interpretation includes a sarcastic interpretation based on the acoustic characteristics. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25).

Regarding claim 16, the rejection of claim 15 is incorporated. Claim 16 is substantially the same as claim 3 and is therefore rejected under the same rationale as above.

Claim 4-5 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara and McDaniel as applied to claims 1, 3, and 15, and in further view of Hu (U.S. Pat. App. Pub. No. 2020/0251097, hereinafter Hu).

Regarding claim 4, the rejection of claim 3 is incorporated. Azara and McDaniel disclose all of the elements of the current invention as stated above. However, Azara and McDaniel fail(s) to expressly recite wherein determining the first interpretation further comprises: using a name entity recognition (NER) system to evaluate at least one Name Entity of the text string and determining the pause information (of the signal speech data) at the at least one word boundary of the Name Entity.
Hu teaches “named entity recognition method, a named entity recognition device, a named entity recognition equipment and a medium.” (Hu, ¶ [0002]). Regarding claim 4, Hu teaches wherein determining the first interpretation further comprises: using a name entity recognition (NER) system to evaluate at least one Name Entity of the text string (Discloses a named entity recognition system which includes “extracting a voice word feature vector in a voice signal…”; Hu, ¶¶ [0057]); and determining the pause information (of the signal speech data) at the at least one word boundary of the Name Entity (“the voice word feature vector may include a global sequence number of the word, a start time point of the word, a duration of the pronunciation, a pause time length from a previous word.”; Hu, ¶¶ [0057]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, to incorporate the teachings of Hu to include wherein determining the first interpretation further comprises: using a name entity recognition (NER) system to evaluate at least one Name Entity of the text string and determining the pause information (of the signal speech data) at the at least one word boundary of the Name Entity. By incorporating “voice information that is not included in the literalness, such as accent, pause, and intonation…the complex special name's effect on sentence structure determination and entity recognition in special scenarios is solved, precision and accuracy of entity recognition are improved, and the application scope of entity recognition is further enlarged,” as recognized by Hu. (Hu, ¶ [0039]).

Regarding claim 5, the rejection of claim 4 is incorporated. Azara and McDaniel disclose all of the elements of the current invention as stated above. However, Azara and McDaniel fail(s) to expressly recite wherein determining the first interpretation further comprises: determining the at least one Name Entity from among a plurality of Name Entities in the text string, wherein the at least one Name Entity is one of a first word of the Name Entity having a probability that is between a first threshold and a second threshold wherein the at least one Name Entity is one of an a word following the first word of the Name Entity having a probability that is between a third threshold and a fourth threshold.
The relevance of Hu is described above with relation to claim 4. Regarding claim 5, Hu teaches wherein determining the first interpretation further comprises: determining the at least one Name Entity from among a plurality of Name Entities in the text string, (“The word feature vector of the literalness characterizes each word recognized{at least one Name Entity}” where each is a selection from among a plurality of named entities in a text string.; Hu, ¶¶ [0064]) wherein the at least one Name Entity is one of a first word of the Name Entity having a probability that is between a first threshold and a second threshold, or (“it can also indicate the first {the at least one Name Entity is one of a first word of the Name Entity}, middle, and end words in a phrase by the positive and negative values” where “literalness feature vector (word vector of literalness-word segment embedding vector), splicing can be implemented, for example, by presetting a length of each feature vector,” and where a preset length is between a first threshold and a second threshold.; Hu, ¶¶ [0064], [0078]) wherein the at least one Name Entity is one of a word following the first word of the Name Entity having a probability that is between a third threshold and a fourth threshold (“it can also indicate the first, middle {the at least one Name Entity is one of a word following the first word of the Name Entity}, and end words in a phrase by the positive and negative values” where “literalness feature vector (word vector of literalness-word segment embedding vector), splicing can be implemented, for example, by presetting a length of each feature vector,” and where a preset length is between a third threshold and a fourth threshold.; Hu, ¶¶ [0064], [0078]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, to incorporate the teachings of Hu to include wherein determining the first interpretation further comprises: determining the at least one Name Entity from among a plurality of Name Entities in the text string, wherein the at least one Name Entity is one of a first word of the Name Entity having a probability that is between a first threshold and a second threshold wherein the at least one Name Entity is one of an a word following the first word of the Name Entity having a probability that is between a third threshold and a fourth threshold. By incorporating “voice information that is not included in the literalness, such as accent, pause, and intonation…the complex special name's effect on sentence structure determination and entity recognition in special scenarios is solved, precision and accuracy of entity recognition are improved, and the application scope of entity recognition is further enlarged,” as recognized by Hu. (Hu, ¶ [0039]).

Regarding claim 17, the rejection of claim 16 is incorporated. Claim 17 is substantially the same as claim 4 and is therefore rejected under the same rationale as above.

Claims 6 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara, McDaniel and Hu as applied to claims 4 and 17 above, and further in view of Hoffmeister (U.S. Pat. No. 10,388,274, hereinafter Hoffmeister), and Non-Patent Literature to Joshi (Aditya Joshi, Pushpak Bhattacharyya, and Mark J. Carman. 2017. Automatic Sarcasm Detection: A Survey. ACM Comput. Surv. 50, 5, Article 73 (September 2018), 22 pages. DOI:https://doi.org/10.1145/3124420, hereinafter Joshi).

Regarding claim 6, the rejection of claim 4 is incorporated. Azara, McDaniel, and Hu disclose all of the elements of the current invention as stated above. However, Azara fails to expressly recite wherein generating the response comprises: generating a first preliminary response using the NER system; determining a second preliminary response based on a sarcasm evaluation of the audible human speech; and determining a final response based on a ranking of the first and second preliminary responses, wherein the sarcasm evaluation comprises: determining that a text-based sentiment is Positive or Neutral by processing the textual speech data using a text-based sentiment analysis tool; determining that a signal-based sentiment is Negative by processing the signal speech data using a signal-based sentiment analysis tool; and detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative.
The relevance of Hu is described above with relation to claim 4. Regarding claim 6, Hu further teaches wherein generating the response comprises: generating a first preliminary response using the NER system (Discloses recognizing a named entity using NER, thus generating a first preliminary response where “The word feature vector of the literalness characterizes each word recognized”; Hu, ¶¶ [0064], [0087]; FIG. 7).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, and as modified by the systems and methods for named entity recognition of Hu, to further incorporate the teachings of Hu to include wherein generating the response comprises: generating a first preliminary response using the NER system. By incorporating “voice information that is not included in the literalness, such as accent, pause, and intonation…the complex special name's effect on sentence structure determination and entity recognition in special scenarios is solved, precision and accuracy of entity recognition are improved, and the application scope of entity recognition is further enlarged,” as recognized by Hu. (Hu, ¶ [0039]). However, Azara and Hu fail to expressly recite determining a second preliminary response based on a sarcasm evaluation of the audible human speech; and determining a final response based on a ranking of the first and second preliminary responses, wherein the sarcasm evaluation comprises: determining that a text-based sentiment is Positive or Neutral by processing the textual speech data using a text-based sentiment analysis tool; determining that a signal-based sentiment is Negative by processing the signal speech data using a signal-based sentiment analysis tool; and detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 6, McDaniel teaches determining a second preliminary response based on a sarcasm evaluation of the audible human speech (Discloses systems and methods to “capture and display the semantic and non-semantic characteristics of an audio” by using both text and prosodic features from the audio, where “prosodic features are extracted from the words and inter-word boundaries of the segment in various embodiments... [and where] prosodic features may reflect various characteristics of a speaker such as, for example, the emotional state of the speaker, the form of an utterance (whether a statement, question, or command), the presence of irony or sarcasm, and emphasis, contrast, and focus.” and where “the result of the semantic model... and the result of the non-semantic model... are combined using an ensemble {determining a second preliminary response}”; McDaniel, ¶¶ Col. 8, lines 30-32; Col. 6, lines 20-22, 26-31; Col. 32, lines 5-14); wherein the sarcasm evaluation comprises: determining that a text-based sentiment is Positive or Neutral by processing the textual speech data using a text-based sentiment analysis tool (“Next, the identify emotions module applies an emotion lexicon to the word [including]... two sentiments: negative and positive...{determining positive... by processing textual speech data} to identify any emotions associated with the word.{text based sentiment analysis tool}” and “a class is used to indicate when no emotion is identified for a particular utterance segment. For instance, the class “neutral” may be assigned to an utterance segment in which no emotion has been identified as expressed in the segment {determining...neutral by processing textual speech data}; McDaniel, ¶¶ Col. 28, line 64, Col. 29, line 7; Col. 29, lines 30-35); determining that a signal-based sentiment is Negative by processing the signal speech data using a signal-based sentiment analysis tool (“Prosodic features are normally identified as either an auditory measure or an acoustic measure,” as such the sarcasm evaluation can be text-based {auditory measure} or signal-based {acoustic measure}. Thus, the signal speech data can be determined to be positive, negative, or neutral.; McDaniel, ¶¶ Col. 6, lines 31-33). 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara as modified by the semantic and non-semantic audio analysis system of McDaniel, and as modified by the systems and methods for named entity recognition of Hu, to further incorporate the teachings of McDaniel to include determining a second preliminary response based on a sarcasm evaluation of the audible human speech wherein the sarcasm evaluation comprises: determining that a text-based sentiment is Positive or Neutral by processing the textual speech data using a text-based sentiment analysis tool, and determining that a signal-based sentiment is Negative by processing the signal speech data using a signal-based sentiment analysis tool. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25). However, Azara, McDaniel, and Hu fail to expressly recite determining a final response based on a ranking of the first and second preliminary responses and detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative.
Hoffmeister teaches systems and methods for confidence checking in ASR systems. (Hoffmeister, Col. 3, lines 30-35). Regarding claim 6, Hoffmeister teaches determining a final response based on a ranking of the first and second preliminary responses (“A device configured for NLU processing may include a named entity recognition (NER) module 252 and intent classification (IC) module 264, a result ranking and distribution module 266, “ where “The responses based on the query produced by each set of models is scored, with the overall highest ranked result from all applied domains is ordinarily selected to be the correct result.”; Hoffmeister, ¶¶ Col. 8, lines 20-26; Col. 9, lines 40-45).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara as modified by the semantic and non-semantic audio analysis system of McDaniel, and as modified by the systems and methods for named entity recognition of Hu, to incorporate the teachings of Hoffmeister to include determining a final response based on a ranking of the first and second preliminary responses. The systems and methods described in Hoffmeister “improve the ability of the system to answer user queries by expanding the information available to the system.” (Hoffmeister, Col. 3, lines 25-30). However, Azara, McDaniel, Hu, and Hoffmeister fail to expressly recite determining a final response based on a ranking of the first and second preliminary responses and detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative.
Joshi teaches systems and methods for automatic sarcasm detection. (Joshi, Abstract). Regarding claim 6, Joshi teaches detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative (“Sarcasm has a negative implied sentiment,” thus, read in light of McDaniel, sarcasm has negative non-semantic characteristics {signal-based sentiment is Negative} “but may not have a negative surface sentiment,” and similarly, sarcasm has a non-negative semantic characteristics {text-based sentiment being positive or neutral}.; Joshi, ¶¶ Pg. 1, para. 1).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, as modified by the systems and methods for named entity recognition of Hu, and as modified by the ASR confidence checking of Hoffmeister to incorporate the teachings of Joshi to include detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative. Detecting sarcasm can “improve the performance of sentiment classification” thus giving a better understanding of the speaker’s intent, as recognized by Joshi. (Joshi, pg. 17, para. 4).

Regarding claim 18, the rejection of claim 17 is incorporated. Claim 18 is substantially the same as claim 6 and is therefore rejected under the same rationale as above.

Claims 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara, McDaniel, Hu, Hoffmeister, and Joshi as applied to claim 6 above, and further in view of Deoras (U.S. Pat. App. Pub. No. 2015/0066496, hereinafter Deoras).

Regarding claim 7, the rejection of claim 6 is incorporated. Azara, Hu, McDaniel, Hoffmeister, and Joshi disclose all of the elements of the current invention as stated above. However, Azara, Hu, Hoffmeister, and Joshi fail to expressly recite wherein the second preliminary response is determined using an end-to-end neural network, wherein, when sarcasm is detected, an input to the neural network comprises a sarcasm token and a one-hot vector which represents that the audible human speech comprises sarcasm.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 7, McDaniel teaches wherein the second preliminary response is determined using… [a] neural network, (Discloses detecting sarcasm and using a neural network in response generation.; McDaniel, ¶¶ Col. 32, lines 53-60) wherein, when sarcasm is detected, an input to the neural network comprises... [the sarcasm result] (Describes using “base classifiers (e.g., the semantic model and the non-semantic model) and then using another classifier to combine their predictions... [where] the semantic and non-semantic models may be combined into a neural network built (trained, validated, and tested) using utterance segments annotated with emotions.”; McDaniel, ¶¶ Col. 32, lines 53-65).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara as modified by the semantic and non-semantic audio analysis system of McDaniel, as modified by the systems and methods for named entity recognition of Hu, as modified by the ASR confidence checking of Hoffmeister, and by the automatic sarcasm detection systems of Joshi, to further incorporate the teachings of McDaniel to include wherein the second preliminary response is determined using… [a] neural network, wherein, when sarcasm is detected, an input to the neural network comprises... [the sarcasm result].  “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25). However, Azara, Hu, Hoffmeister, Joshi, and McDaniel fail to expressly recite wherein the neural network is an end-to-end neural network, and…. the input to the neural network comprises a sarcasm token and a one-hot vector which represents that the audible human speech comprises sarcasm.
Deoras teaches systems and methods for “assignment of semantic labels to words in a natural language utterance.” (Deoras, ¶ [0004]). Regarding claim 7, Deoras teaches wherein the second preliminary response is determined using an end-to-end neural network, (The output is produced using “a spatio-temporally deep neural network… [which] combines features of both a DNN and a RNN {end-to-end neural network}.”; Deoras, ¶¶ [0058]) wherein, when sarcasm is detected, an input to the neural network comprises a sarcasm token and a one-hot vector which represents that the audible human speech comprises sarcasm (“the input layer of the RNN 128 can be... a “one-hot” representation” where “the semantic feature extractor component” can output a “sequence of tokens” which are received STDNN, where “semantic labels” are assigned “to tokens in a sequence of tokens”; Deoras, ¶¶ [0037], [0055], [0058], [0024])..
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara as modified by the semantic and non-semantic audio analysis system of McDaniel, as modified by the systems and methods for named entity recognition of Hu, as modified by the ASR confidence checking of Hoffmeister, and as modified by the automatic sarcasm detection systems of Joshi to incorporate the teachings of Deoras to include wherein the second preliminary response is determined using an end-to-end neural network, wherein, when sarcasm is detected, an input to the neural network comprises a sarcasm token and a one-hot vector which represents that the audible human speech comprises sarcasm. The systems and methods described in Deoras allow for better generalization on complex combinations of patterns. (Deoras, ¶ [0002], [0006]).

Claims 8-10, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara and McDaniel as applied to claim 3 above, and further in view of Kershaw (U.S. Pat. App. Pub. No. 2019/0171660, hereinafter Kershaw).

Regarding claim 8, the rejection of claim 3 is incorporated. Azara and McDaniel disclose all of the elements of the current invention as stated above. However, Azara and McDaniel fail to expressly recite wherein determining the first interpretation further comprises: identifying a first word boundary and a second word boundary using a chunking analysis.
Kershaw teaches systems and methods for “analyzing provided text representing conversations … for sentiments and performing categorizations.” (Kershaw, ¶ [0002]). Regarding claim 8, Kershaw teaches wherein determining the first interpretation further comprises: identifying a first word boundary and a second word boundary using a chunking analysis (“A chunk parser 410 is capable of and responsible for partitioning provided buffered input into chunks”; Kershaw, ¶¶ [0051]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara as modified by the semantic and non-semantic audio analysis system of McDaniel, to incorporate the teachings of Kershaw to include wherein determining the first interpretation further comprises: identifying a first word boundary and a second word boundary using a chunking analysis. The “improved categorization and sentiment analysis” overcomes the deficiencies in the prior art regarding both efficiency and effectiveness for achieving and acceptable level of accuracy, as recognized by Kershaw. (Kershaw, ¶ [0023], [0026]).

Regarding claim 9, the rejection of claim 8 is incorporated. Azara and McDaniel disclose all of the elements of the current invention as stated above. However, Azara and McDaniel fail to expressly recite wherein determining the first interpretation further comprises: analyzing the first and second word boundaries using a classification algorithm.
The relevance of Kershaw is described above with relation to claim 8. Regarding claim 9, Kershaw teaches wherein determining the first interpretation further comprises: analyzing the first and second word boundaries using a classification algorithm (chunks are “determined by e.g. parts of speech (PoS) parsing and may be configured to ignore certain words, characters, or sentences based on content, according to a preferred aspect.”; Kershaw, ¶¶ [0051]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, to incorporate the teachings of Kershaw to include wherein determining the first interpretation further comprises: analyzing the first and second word boundaries using a classification algorithm. The “improved categorization and sentiment analysis” overcomes the deficiencies in the prior art regarding both efficiency and effectiveness for achieving and acceptable level of accuracy, as recognized by Kershaw. (Kershaw, ¶ [0023], [0026]).

Regarding claim 10, the rejection of claim 9 is incorporated. Azara and McDaniel disclose all of the elements of the current invention as stated above. However, Azara and McDaniel fail to expressly recite wherein determining the first interpretation further comprises: determining a binary prediction that either the first word boundary or the second word boundary is most accurate.
The relevance of Kershaw is described above with relation to claim 8. Regarding claim 10, Kershaw teaches wherein determining the first interpretation further comprises: determining a binary prediction that either the first word boundary or the second word boundary is most accurate (“Input chunks are received from a deterministic rules engine 415 and fed into a chunk to embedding sequence reducer 610 which reduces the chunk further into a possibly reduced “sequence” of words (e.g. by selecting only the nouns in the sequence or up to and including the whole chunk).” The prediction is binary in that either the original sequence boundaries are correct {the first word boundary}, or the sequence can be further reduced {the second word boundary}, where the embedding sequence reducer determines which of the two is more accurate. The binary nature of the decision is further clarified in the example, “‘North America’ has a distinct meaning to simply the presence of the words ‘America’ and ‘North’.”; Kershaw, ¶¶ [0053]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, to incorporate the teachings of Kershaw to include wherein determining the first interpretation further comprises: determining a binary prediction that either the first word boundary or the second word boundary is most accurate. The “improved categorization and sentiment analysis” overcomes the deficiencies in the prior art regarding both efficiency and effectiveness for achieving and acceptable level of accuracy, as recognized by Kershaw. (Kershaw, ¶ [0023], [0026]).

Regarding claim 19, the rejection of claim 16 is incorporated. Claim 19 is substantially the same as claim 8 and is therefore rejected under the same rationale as above.

Claims 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara, McDaniel, and Kershaw as applied to claim 10 above, and further in view of Hoffmeister and Joshi.

Regarding claim 11, the rejection of claim 10 is incorporated. Azara, McDaniel, and Kershaw disclose all of the elements of the current invention as stated above. However, Azara fails to expressly recite wherein generating the response comprises: generating a first preliminary response based on the chunking analysis; determining a second preliminary response based on a sarcasm evaluation of the audible human speech; and determining a final response based on a ranking of the first and second preliminary responses, wherein the sarcasm evaluation comprises: determining that a text-based sentiment is Positive or Neutral by processing the textual speech data using a text-based sentiment analysis tool; determining that a signal-based sentiment is Negative by processing the signal speech data using a signal-based sentiment analysis tool; and detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative.
The relevance of Kershaw is described above with relation to claim 8. Regarding claim 11, Kershaw teaches wherein generating the response comprises: generating a first preliminary response based on the chunking analysis (“The resulting embedding sequence of words is then sent to the sequence embedder 620, which will embed each input word sequence into a high dimensional vector according to the chosen sequence embedding model (e.g. Phrase2Vec as discussed above) which provides a numeric form of measuring the semantic meanings of a particular word sequence, to be used to determine the category...[by] A semantic distance comparator 630 {generating a first preliminary response}”; Kershaw, ¶¶ [0053]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, to incorporate the teachings of Kershaw to include wherein generating the response comprises: generating a first preliminary response based on the chunking analysis. The “improved categorization and sentiment analysis” overcomes the deficiencies in the prior art regarding both efficiency and effectiveness for achieving and acceptable level of accuracy, as recognized by Kershaw. (Kershaw, ¶ [0023], [0026]). However, Azara and Kershaw fail to expressly recite determining a second preliminary response based on a sarcasm evaluation of the audible human speech; and determining a final response based on a ranking of the first and second preliminary responses, wherein the sarcasm evaluation comprises: determining that a text-based sentiment is Positive or Neutral by processing the textual speech data using a text-based sentiment analysis tool; determining that a signal-based sentiment is Negative by processing the signal speech data using a signal-based sentiment analysis tool; and detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 11, McDaniel teaches determining a second preliminary response based on a sarcasm evaluation of the audible human speech (Discloses systems and methods to “capture and display the semantic and non-semantic characteristics of an audio” by using both text and prosodic features from the audio, where “prosodic features are extracted from the words and inter-word boundaries of the segment in various embodiments... [and where] prosodic features may reflect various characteristics of a speaker such as, for example, the emotional state of the speaker, the form of an utterance (whether a statement, question, or command), the presence of irony or sarcasm, and emphasis, contrast, and focus.” and where “the result of the semantic model... and the result of the non-semantic model... are combined using an ensemble {determining a second preliminary response}”; McDaniel, ¶¶ Col. 8, lines 30-32; Col. 6, lines 20-22, 26-31; Col. 32, lines 5-14); wherein the sarcasm evaluation comprises: determining that a text-based sentiment is Positive or Neutral by processing the textual speech data using a text-based sentiment analysis tool (“Next, the identify emotions module applies an emotion lexicon to the word [including]... two sentiments: negative and positive...{determining positive... by processing textual speech data} to identify any emotions associated with the word.{text based sentiment analysis tool}” and “a class is used to indicate when no emotion is identified for a particular utterance segment. For instance, the class “neutral” may be assigned to an utterance segment in which no emotion has been identified as expressed in the segment {determining...neutral by processing textual speech data}; McDaniel, ¶¶ Col. 28, line 64, Col. 29, line 7; Col. 29, lines 30-35); determining that a signal-based sentiment is Negative by processing the signal speech data using a signal-based sentiment analysis tool (“Prosodic features are normally identified as either an auditory measure or an acoustic measure,” as such the sarcasm evaluation can be text-based {auditory measure} or signal-based {acoustic measure}. Thus, the signal speech data can be determined to be positive, negative, or neutral.; McDaniel, ¶¶ Col. 6, lines 31-33). 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, and as modified by the systems and methods for sentiment analysis of Kershaw, to further incorporate the teachings of McDaniel to include determining a second preliminary response based on a sarcasm evaluation of the audible human speech wherein the sarcasm evaluation comprises: determining that a text-based sentiment is Positive or Neutral by processing the textual speech data using a text-based sentiment analysis tool, and determining that a signal-based sentiment is Negative by processing the signal speech data using a signal-based sentiment analysis tool. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25). However, Azara, McDaniel, and Kershaw fail to expressly recite determining a final response based on a ranking of the first and second preliminary responses and detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative.
The relevance of Hoffmeister is described above with relation to claim 6. Regarding claim 11, Hoffmeister teaches determining a final response based on a ranking of the first and second preliminary responses (“A device configured for NLU processing may include a named entity recognition (NER) module 252 and intent classification (IC) module 264, a result ranking and distribution module 266, “ where “The responses based on the query produced by each set of models is scored, with the overall highest ranked result from all applied domains is ordinarily selected to be the correct result.”; Hoffmeister, ¶¶ Col. 8, lines 20-26; Col. 9, lines 40-45).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, as modified by the systems and methods for sentiment analysis of Kershaw, and as modified by the semantic and non-semantic audio analysis system of McDaniel to incorporate the teachings of Hoffmeister to include determining a final response based on a ranking of the first and second preliminary responses. The systems and methods described in Hoffmeister “improve the ability of the system to answer user queries by expanding the information available to the system.” (Hoffmeister, Col. 3, lines 25-30). However, Azara, McDaniel, Kershaw, and Hoffmeister fail to expressly recite detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative.
The relevance of Joshi is described above with relation to claim 6. Regarding claim 11, Joshi teaches detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative (“Sarcasm has a negative implied sentiment,” thus, read in light of McDaniel, sarcasm has negative non-semantic characteristics {signal-based sentiment is Negative} “but may not have a negative surface sentiment,” and similarly, sarcasm has a non-negative semantic characteristics {text-based sentiment being positive or neutral}.; Joshi, ¶¶ Pg. 1, para. 1).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, as modified by the systems and methods for sentiment analysis of Kershaw, and as modified by the ASR confidence checking of Hoffmeister, to incorporate the teachings of Joshi to include detecting sarcasm based on the text-based sentiment being Positive or Neutral while the signal-based sentiment is Negative. Detecting sarcasm can “improve the performance of sentiment classification” thus giving a better understanding of the speaker’s intent, as recognized by Joshi. (Joshi, pg. 17, para. 4).

Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara, McDaniel, Kershaw, Hoffmeister, and Joshi as applied to claim 11 above, and further in view of Deoras.

Regarding claim 12, the rejection of claim 11 is incorporated. Azara, McDaniel, Kershaw, Hoffmeister, and Joshi disclose all of the elements of the current invention as stated above. However, Azara, Kershaw, Hoffmeister, and Joshi fail to expressly recite wherein the second preliminary response is determined using an end-to-end neural network, wherein, when sarcasm is detected, an input to the neural network comprises a sarcasm token and a one-hot vector which represents that the audible human speech comprises sarcasm.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 12, McDaniel teaches wherein the second preliminary response is determined using… [a] neural network, (Discloses detecting sarcasm and using a neural network in response generation.; McDaniel, ¶¶ Col. 32, lines 53-60) wherein, when sarcasm is detected, an input to the neural network comprises... [the sarcasm result] (Describes using “base classifiers (e.g., the semantic model and the non-semantic model) and then using another classifier to combine their predictions... [where] the semantic and non-semantic models may be combined into a neural network built (trained, validated, and tested) using utterance segments annotated with emotions.”; McDaniel, ¶¶ Col. 32, lines 53-65).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, as modified by the systems and methods for sentiment analysis of Kershaw, as modified by the ASR confidence checking of Hoffmeister, and by the automatic sarcasm detection systems of Joshi to further incorporate the teachings of McDaniel to include wherein the second preliminary response is determined using… [a] neural network, wherein, when sarcasm is detected, an input to the neural network comprises... [the sarcasm result]. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25). However, Azara, McDaniel, Kershaw, Hoffmeister, and Joshifail to expressly recite wherein the neural network is an end-to-end neural network, and…. the input to the neural network comprises a sarcasm token and a one-hot vector which represents that the audible human speech comprises sarcasm.
The relevance of Deoras is described above with relation to claim 6. Regarding claim 12, Deoras teaches wherein the second preliminary response is determined using an end-to-end neural network, (The output is produced using “a spatio-temporally deep neural network… [which] combines features of both a DNN and a RNN {end-to-end neural network}.”; Deoras, ¶¶ [0058]) wherein, when sarcasm is detected, an input to the neural network comprises a sarcasm token and a one-hot vector which represents that the audible human speech comprises sarcasm (“the input layer of the RNN 128 can be... a “one-hot” representation” where “the semantic feature extractor component” can output a “sequence of tokens” which are received STDNN, where “semantic labels” are assigned “to tokens in a sequence of tokens”; Deoras, ¶¶ [0037], [0055], [0058], [0024])..
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, as modified by the systems and methods for sentiment analysis of Kershaw, by the semantic and non-semantic audio analysis system of McDaniel, by the ASR confidence checking of Hoffmeister, and by the automatic sarcasm detection systems of Joshi to incorporate the teachings of Deoras to include wherein the second preliminary response is determined using an end-to-end neural network, wherein, when sarcasm is detected, an input to the neural network comprises a sarcasm token and a one-hot vector which represents that the audible human speech comprises sarcasm. The systems and methods described in Deoras allow for better generalization on complex combinations of patterns. (Deoras, ¶ [0002], [0006]).

Claim 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara and McDaniel as applied to claim 1, and in further view of Kershaw.

Regarding claim 14, the rejection of claim 1 is incorporated. Azara and McDaniel disclose all of the elements of the current invention as stated above. Azara further discloses wherein the audible human speech is received via or the response is generated via one of: a table-top device (“The speech information may be acquired from a lavaliere microphone, a microphone array or any other natural language input device.”; Azara, ¶¶ [0056]). However, Azara and McDaniel fail(s) to expressly recite a kiosk, a mobile device, a vehicle, or a robotic machine.
The relevance of Kershaw is described above with relation to claim 8. Regarding claim 14, Kershaw teaches a kiosk, a mobile device, a vehicle, or a robotic machine (“It should be appreciated that some or all components illustrated may be combined, such as in various integrated applications, for example Qualcomm or Samsung system-on-a-chip (SOC) devices, or whenever it may be appropriate to combine multiple capabilities or functions into a single hardware device (for instance, in mobile devices such as smartphones, video game consoles, in-vehicle computer systems such as navigation or multimedia systems in automobiles {a vehicle}, or other integrated hardware devices).” as well as for “autonomous operation of vehicles {robotic machine}”; Kershaw, ¶¶ [0074], [0040]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDaniel, to incorporate the teachings of Kershaw to include a kiosk, a mobile device, a vehicle, or a robotic machine. The “improved categorization and sentiment analysis” overcomes the deficiencies in the prior art regarding both efficiency and effectiveness for achieving and acceptable level of accuracy, as recognized by Kershaw. (Kershaw, ¶ [0023], [0026]).

Claim 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Azara in view of McDaniel and Joshi.

Regarding claim 20, Azara discloses A method of response generation, comprising (Discloses systems and methods for “resolv[ing] ambiguities… before the command can be properly executed {for response generation}”; Azara, ¶¶ [0037]): receiving audible human speech (“The speech information {audible human speech} may be obtained from...human-computer commands, human-computer dictation {from a user} and the like.”; Azara, ¶¶ [0050]); determining textual speech data based on the audible human speech (“In step S20, the speech information is determined. The speech information may be obtained from any source of natural language information… [and] is recognized using an automatic speech recognition system.”; Azara, ¶¶ [0050]); extracting, from the audible human speech, signal speech data that is indicative of acoustic characteristics which correspond to the textual speech data (“After the speech information has been recognized, control continues to step S30 where the prosodic features are determined.”; Azara, ¶¶ [0050]); and generating a response to the audible human speech, wherein the response accounts for the ambiguation (With relation to the alternate possible meanings, Azara discloses “The presence of more than one set of candidate discourse functions reflects alternate possible meanings associated with the speech information. Thus, if the recognized speech contains an ambiguity, the candidate discourse functions include the alternate candidate sets of discourse functions corresponding to the identified ambiguities,” where the candidate discourse functions are generating a response to the audible human speech. Further, by acknowledging and providing alternative responses based on the ambiguity, the system accounts for the ambiguation; Azara, ¶ [0048]). However, Azara fails to expressly recite using a text-based sentiment analysis tool, determining that a sentiment analysis of the textual speech data is Positive or Neutral; using a signal-based sentiment analysis tool, determining that a sentiment analysis of the signal speech data is Negative; and based on the sentiment analyses of the textual and signal speech data, determining that the audible human speech comprises sarcasm.
The relevance of McDaniel is described above with relation to claim 1. Regarding claim 13, McDaniel teaches using a text-based sentiment analysis tool, determining that a sentiment analysis of the textual speech data is Positive or Neutral (“Next, the identify emotions module applies an emotion lexicon to the word [including]... two sentiments: negative and positive...{determining positive... by processing textual speech data} to identify any emotions associated with the word {text based sentiment analysis tool}” and “a class is used to indicate when no emotion is identified for a particular utterance segment. For instance, the class “neutral” may be assigned to an utterance segment in which no emotion has been identified as expressed in the segment {determining...neutral by processing textual speech data}; McDaniel, ¶¶ Col. 28, line 64, Col. 29, line 7; Col. 29, lines 30-35); using a signal-based sentiment analysis tool, determining that a sentiment analysis of the signal speech data is Negative (“Prosodic features are normally identified as either an auditory measure or an acoustic measure,” as such the sarcasm evaluation can be text-based {auditory measure} or signal-based {acoustic measure}. Thus, the identify emotions module can determine the signal speech data to be positive, negative, or neutral; McDaniel, ¶¶ Col. 6, lines 31-33).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara to incorporate the teachings of McDaniel to include using a text-based sentiment analysis tool, determining that a sentiment analysis of the textual speech data is Positive or Neutral using a signal-based sentiment analysis tool, determining that a sentiment analysis of the signal speech data is Negative. “Capturing the semantic and non-semantic characteristics [of a voice] along with the corresponding relationships between the characteristics can enable the IVR to better identify the party's intention,” as recognized by McDaniel. (McDaniel, Col. 46, lines 20-25). However, Azara and McDaniel fail to expressly recite based on the sentiment analyses of the textual and signal speech data, determining that the audible human speech comprises sarcasm.
The relevance of Joshi is described above with relation to claim 6. Regarding claim 20, Joshi teaches based on the sentiment analyses of the textual and signal speech data, determining that the audible human speech comprises sarcasm (“Sarcasm has a negative implied sentiment,” thus, read in light of McDaniel, sarcasm has negative non-semantic characteristics {signal-based sentiment is Negative} “but may not have a negative surface sentiment,” and similarly, sarcasm has a non-negative semantic characteristics {text-based sentiment being positive or neutral} Thus, it would be understood that when the text has a non-negative surface sentiment and the sentiment is negative, the user input is sarcastic.; Joshi, ¶¶ Pg. 1, para. 1). 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems and methods of ambiguity resolution of Azara, as modified by the semantic and non-semantic audio analysis system of McDanie,l to incorporate the teachings of Joshi to include based on the sentiment analyses of the textual and signal speech data, determining that the audible human speech comprises sarcasm. Detecting sarcasm can “improve the performance of sentiment classification” thus giving a better understanding of the speaker’s intent, as recognized by Joshi. (Joshi, pg. 17, para. 4).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Johnson et al (U.S. Pat. App. Pub. No. 2021/0082414) discloses systems and methods for dialog processing using contextual data including the identification of a term from an utterance for disambiguation.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627. The examiner can normally be reached 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached on (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Sean E Serraguard/Patent Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657