Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawing submitted on 02/10/2020 is being considered by the examiner.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/07/2021 has been entered.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 

Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “train model configured to determine an emotion corresponding to speech associated with the user identifier” and “first component configured to process associated with the NLU results data” in claim 21 and 31.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

 Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 41-42 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. Claim 41, depend on Claim 21, recites “determining the user profile data indicates the first emotion is permitted to be used to perform processing related to the first audio data; and after determining the user profile data indicates the first emotion is permitted to be used to perform processing related to the first audio data…” and Claim 42, depend on Claim31, similarly recites the “determine the user profile data indicates the first speech quality is permitted to be used to perform processing related to the first audio data; and after determining the user profile data indicates the first speech quality is permitted to be used to perform processing related to the first audio data…”. Claims 41-42 both added as new, however the disclosure is silent on the requirements of any permission based on user profile data for any emotion or whisper or quality of speech to be process or used or not processed or not used. Therefore it is a new matter and does not comply with the written description requirement as to an indication  at the time the application was filed, had possession of the claimed invention.
Therefore Claims 41-42, will not be further considered under examination with respect to prior art teaching.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 21-22, 24, 26-34, and 36-40 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Basye et al. (US 2016/0379638 A1).

Regarding Claim 21, Basye et al. teach: A computer-implemented method, comprising: receiving, from a first device (microphone), first audio data representing first speech (a spoken utterance) ([0021] An audio capture component, such as a microphone of device 110, captures audio 11 corresponding to a spoken utterance. The device sends audio data 111 corresponding to the utterance, to an ASR module 250.); determining a user identifier associated with the first audio data; determining a trained model associated with the user identifier, the trained model configured to determine an emotion (whisper) corresponding to speech associated with the user identifier ([0022] An ASR process 250 converts the audio data 111 into text. The ASR transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. A spoken utterance in the audio data is input to a processor configured to perform ASR which then interprets the utterance based on the similarity between the utterance and pre-established language models 254 stored in an ASR model knowledge base (ASR Models Storage 252). For example, the ASR process may compare the input audio data with models for sounds (e.g., subword units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.  [0052] Thus the system may determine that the input speech has an input speech quality corresponding to a whisper/approximated whisper. The system may train components to analyze paralinguistic feature data to make a decision as to whether the speech is whispered. The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0062] As shown in FIG. 3, the system may also employ customized models 354 that are customized for particular users. Each user may have multiple such models. The user models 354 may be used by the speech quality detector 220 to select a speech quality in a manner more customized for a specific user. For example, the system may track a user's utterances to determine how they normally speak, or how they speak under certain conditions, and use that information to train user-specific models 354. Thus the system may determine the speech quality using some representation of a reference of how a user speaks. The user models 354 may incorporate both audio and non-audio data, which may incorporate not only how a user speaks, but how a user speaks under particular circumstances (i.e., with many individuals present, at different locations, under different lighting conditions, etc. The user models 354 may also take into account eventual commands and/or speech output by the system so that the system may determine how user commands are processed under certain conditions. Each user model 354 may be associated with a user ID, which may be linked to a user profile containing various other information about a particular user. Such profile information may also be used to train the user model 354.); processing, using the trained model, the first audio data to determine  a first indicator (indicator of the whisper) representing a first emotion (whisper) corresponding to the first speech; performing speech processing on the first audio data to determine natural language understanding (NLU) (semantic interpretation of the text) results data (intent) ([0030] The NLU process takes textual input (such as processed from ASR 250 based on the utterance 11) and attempts to make a semantic interpretation of the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning NLU processing 260 interprets a text string to derive an intent or a desired action from the user as well as the pertinent pieces of information in the text that allow a device (e.g., device 110) to complete that action. For example, if a spoken utterance is processed using ASR 250 and outputs the text "call mom" the NLU process may determine that the user intended to activate a telephone in his/her device and to initiate a call with a contact matching the entity "mom." [0037] An intent classification (IC) module 264 parses the query to determine an intent or intents for each identified domain, where the intent corresponds to the action to be performed that is responsive to the query. Each domain is associated with a database (278a-278n) of words linked to intents. [0060] The model may be constructed based on the training utterances and then disseminated to individual devices 110 or to server(s) 120. A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc.); determining a first component (destination command processor 290) configured to process the NLU results data ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output.); sending, to the first component, the NLU results data ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. [0052] The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2.) receiving (TTS module), from the first component (command processor 290), first content (textual answer) responsive to the first speech (a spoken utterance); determining, based at least part on the first indicator, that the first content is to be output according to the first emotion; and causing output of the first content according to the first emotion ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. [0066] A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user.).

Regarding Claim 22, Basye et al. teach: The computer-implemented method of claim 21, further comprising: generating metadata representing text-to-speech (TTS) processing is to be performed based at least in part on the first emotion; and performing, based at least in part on the metadata, TTS processing on the first content to generate second audio data (synthesize whispered speech), wherein causing output of the first content according to the first emotion comprises causing output of audio corresponding to the second audio data (See rejection of Claim 21).

Regarding Claim 24, Basye et al. teach: The computer-implemented method of claim 22, wherein the second audio data (synthesize whispered speech) matches the first emotion (whisper) (See rejection of Claim 22).

Regarding Claim 26, Basye et al. teach: The computer-implemented method of claim 21, further comprising: determining the first indicator based at least in part on processing image data corresponding to the first speech to determine the first indicator ([0019] The system may also determine (144) non-audio data corresponding to the utterance, for example time data as to when the utterance was received, location data of the utterance, image data associated with the user 10 at the time the utterance was spoken, etc. The system may perform (146) ASR to determine utterance text. The system may then determine (148) one or more utterance speech qualities using the trained model(s), the audio data and the non-audio data. [0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220. For example, time/date data, location data (for example GPS location or relative indoor room location), ambient light data from a light sensor, the identity of other nearby individuals to the speaker, proximity of the user to a device (for example, if a user is leaning in close to a device to speak an utterance, or if a user is far away from the device), etc. The types of acoustic and non-audio data considered by the speech quality detector 220 depends on the types of such data available to the system 100 when processing an utterance. [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2. The command processor 290 may be a component capable of acting on the utterance.).

Regarding Claim 27, Basye et al. teach: The computer-implemented method of claim 26, wherein the image data represents a gesture (agitated, subdued, angry, etc.) of a user (([0019] The system may also determine (144) non-audio data corresponding to the utterance, for example time data as to when the utterance was received, location data of the utterance, image data associated with the user 10 at the time the utterance was spoken, etc. [0051] For example, based on audio (and possibly non-audio) paralinguistic feature data a system may determine that an input utterance was whispered. Whispered speech is typically "unvoiced," that is words are spoken using the articulators (mouth, lips, tongue, etc.) as normal, but without use/vibration of vocal cords such that an utterance has no resonance, or resonance below a certain threshold. Vocal resonance is when the product of voicing (i.e., phonation) is enhanced in tone quality (i.e., timbre) and/or intensity by the air-filled cavities through which speech passes on the speech's way to the outside air. During whispering, air comes through the throat without being modulated by the vocal cords so that what is left is motion of the articulators resulting in a stream of air without valve structure. [0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220.).

Regarding Claim 28, Basye et al. teach: The computer-implemented method of claim 26, wherein the image data represents a face (image inherently includes face to determine agitated, subdue, angry) of a user ([0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220.).

Regarding Claim 29, Basye et al. teach: The computer-implemented method of claim 21, wherein the first content corresponds to second audio data and the computer-implemented method further comprises: generating based at least in part on the first indicator, metadata representing the second audio data (synthesize whispered speech) is to be output at a first volume level; sending the second audio data to the first device; and sending the metadata to the first device, the  metadata causing the first device to output audio corresponding to the second audio data at the first volume level (See rejection of Claim 22 and [0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output.) [0051] As noted below, a machine learning model may be trained to recognize whispered speech based on resonance, volume, and/or other features of input audio. While certain spoken whispered sounds may differ from voiced sounds more than others as a result of the lack of voicing or low volume, ASR performance may not necessarily be impacted. That is, current ASR systems may be able to process whispered speech. [0052] The system may be configured to recognize that input audio is whispered (which is separate from recognizing the words of whispered speech). For example the system may determine that the input speech has resonance below a threshold and/or a volume below a threshold. Thus the system may determine that the input speech has an input speech quality corresponding to a whisper/approximated whisper. The system may train components to analyze paralinguistic feature data to make a decision as to whether the speech is whispered. While the system may determine whether speech is whispered based on whether a particular paralinguistic feature value(s) are above a threshold (for example, whether input speech has a resonance under a particular threshold and/or a volume under a particular threshold, etc.), more complex decision making is possible using machine learning models and training techniques. Thus, paralinguistic feature values (whether from audio data or non-audio data) are input as features to a speech quality detector. The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0064] For example, if the command processor 290 is a music player, and the utterance included a request to play music, only did not specify a particular music title, the command processor 290 may use the indicator of speech quality to select a music title. Specifically, if a user shouts, in an excited manner, "PLAY SOME MUSIC!!" the speech quality detector 220 may send an indicator to the command processor that the speech had a quality of excitement and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of excitement and may thus select a rock song or similar up-tempo song from a user's catalog. In another example, if a user whispers "play some music," the speech quality detector 220 may send an indicator to the command processor that the speech was whispered and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of being whispered and may thus select a mellow or calm song from a user's catalog. Similar selections of actions by different command processors 290 outside the domain of music are also envisioned. As another example, volume of output may be decreased as a result of whispered input speech, or volume increased as a result of excited speech, or the like.).

Regarding Claim 30, Basye et al. teach: The computer-implemented method of claim 21, wherein causing output of the first content comprises causing a second device (TTS module), different from the first device, to output the first content ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. [0064] In another example, if a user whispers "play some music," the speech quality detector 220 may send an indicator to the command processor that the speech was whispered and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of being whispered and may thus select a mellow or calm song from a user's catalog. Similar selections of actions by different command processors 290 outside the domain of music are also envisioned. As another example, volume of output may be decreased as a result of whispered input speech, or volume increased as a result of excited speech, or the like. [0066] A TTS module 314 may receive the indicator of input speech quality and may configure an output speech quality (if output speech is called for) to correspond to (or even match or approximate) the input speech quality. For example, if a user whispers an utterance including a query to a device 110, the device may send the audio to a server 120. The server may process the audio with a speech quality detector 220 to determine the utterance was whispered and to send an indicator that the speech was whispered to the TTS module 314. The server (or another server) may perform ASR and NLU processing to identify text associated with the query. A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user. In a broader example, the TTS module 314 may synthesize speech based on one or more speech qualities of the input speech as detected by the speech quality detector 220. Speech may be synthesized by the TTS module as described below.).

Regarding Claim 31, Basye et al. teach: A system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the system to: receive, from a first device (device 110 including microphone 104), first audio data representing first speech (a spoken utterance) ([0021] An audio capture component, such as a microphone of device 110, captures audio 11 corresponding to a spoken utterance. The device sends audio data 111 corresponding to the utterance, to an ASR module 250.); determine a user identifier associated with the first audio data; determine a trained model associated with the user identifier, the trained model configured to determine a speech quality(quality of speech or whisper)  corresponding to speech associated with the user identifier ([0022] An ASR process 250 converts the audio data 111 into text. The ASR transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. A spoken utterance in the audio data is input to a processor configured to perform ASR which then interprets the utterance based on the similarity between the utterance and pre-established language models 254 stored in an ASR model knowledge base (ASR Models Storage 252). For example, the ASR process may compare the input audio data with models for sounds (e.g., subword units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.  [0052] Thus the system may determine that the input speech has an input speech quality corresponding to a whisper/approximated whisper. The system may train components to analyze paralinguistic feature data to make a decision as to whether the speech is whispered. The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0062] As shown in FIG. 3, the system may also employ customized models 354 that are customized for particular users. Each user may have multiple such models. The user models 354 may be used by the speech quality detector 220 to select a speech quality in a manner more customized for a specific user. For example, the system may track a user's utterances to determine how they normally speak, or how they speak under certain conditions, and use that information to train user-specific models 354. Thus the system may determine the speech quality using some representation of a reference of how a user speaks. The user models 354 may incorporate both audio and non-audio data, which may incorporate not only how a user speaks, but how a user speaks under particular circumstances (i.e., with many individuals present, at different locations, under different lighting conditions, etc. The user models 354 may also take into account eventual commands and/or speech output by the system so that the system may determine how user commands are processed under certain conditions. Each user model 354 may be associated with a user ID, which may be linked to a user profile containing various other information about a particular user. Such profile information may also be used to train the user model 354.); process, using the trained model, the first audio data to determine a first indicator (indicator of the whisper) representing a first speech quality (quality of speech or whisper) corresponding to the first speech; perform speech processing on the first audio data to determine natural language understanding (NLU) (semantic interpretation of the text) results data (intent) ([0030] The NLU process takes textual input (such as processed from ASR 250 based on the utterance 11) and attempts to make a semantic interpretation of the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning NLU processing 260 interprets a text string to derive an intent or a desired action from the user as well as the pertinent pieces of information in the text that allow a device (e.g., device 110) to complete that action. For example, if a spoken utterance is processed using ASR 250 and outputs the text "call mom" the NLU process may determine that the user intended to activate a telephone in his/her device and to initiate a call with a contact matching the entity "mom." [0037] An intent classification (IC) module 264 parses the query to determine an intent or intents for each identified domain, where the intent corresponds to the action to be performed that is responsive to the query. Each domain is associated with a database (278a-278n) of words linked to intents. [0060] The model may be constructed based on the training utterances and then disseminated to individual devices 110 or to server(s) 120. A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc.); determine a first component (destination command processor 290) configured to process the NLU results data ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output.); send, to the first component, the NLU results data ([0052] The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2.); receive (TTS module), from the first component (command processor 290), first content (textual answer) responsive to the first speech (a spoken utterance); determine, based at least part on the first indicator, that the first content is to be output according to the first speech quality; and cause output of the first content according to the first speech quality ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. [0066] A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user. Note: the textual answer generated by the command processor 290 is independent of the first emotion since TTS module 314 received the textual answer from the command processor 290 and then generates the received answer to be output in synthesized whispered speech according to indicator of whispered.).

Regarding Claim 32, Basye et al. teach: The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: generate metadata representing the TTS processing is to be performed based at least in part on the first speech quality; and perform based at least in part on the metadata, TTS processing on the first content to generate second audio data, wherein the instruction that cause output of the first content according to the first speech quality further cause the system to cause output of audio corresponding to the second audio data(See rejection of Claim 31 and [0060] For example, a model, such as an SVM classifier, may be trained to recognize when an input speech utterance is whispered using many different training utterances, each labeled either "whispered" or "not whispered." Each training utterance may also be associated with various feature data corresponding to the respective utterance, where the feature data indicates values for the acoustic and/or non-audio paralinguistic features that may be used to determine if a future utterance was whispered. The model may be constructed based on the training utterances and then disseminated to individual devices 110 or to server(s) 120. A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc. The system may then tailor its operations and/or output based on the fact that the utterance was, or was not, whispered. [0066] A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user.).

Regarding Claim 33, Basye et al. teach: The system of claim 31, wherein the first speech quality corresponds to a first emotion (whisper) corresponding to the first speech (See rejection of Claim 31).

Regarding Claim 34, Basye et al. teach: The system of claim 32, wherein the second audio data matches the first speech quality (see rejection of Claim 32 specifically, [0060] A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc. The system may then tailor its operations and/or output based on the fact that the utterance was, or was not, whispered.).

Regarding Claim 36, Basye et al. teach: The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine the first indicator based at least in part on processing image data corresponding to the first speech ([0019] The system may also determine (144) non-audio data corresponding to the utterance, for example time data as to when the utterance was received, location data of the utterance, image data associated with the user 10 at the time the utterance was spoken, etc. The system may perform (146) ASR to determine utterance text. The system may then determine (148) one or more utterance speech qualities using the trained model(s), the audio data and the non-audio data. For example, a model configured to determine whether speech was whispered may analyze various audio data feature values to classify the utterance as whispered. The system may then perform (150) one or more operations resulting in output based on the utterance text and the speech quality/ies. [0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220. For example, time/date data, location data (for example GPS location or relative indoor room location), ambient light data from a light sensor, the identity of other nearby individuals to the speaker, proximity of the user to a device (for example, if a user is leaning in close to a device to speak an utterance, or if a user is far away from the device), etc. The types of acoustic and non-audio data considered by the speech quality detector 220 depends on the types of such data available to the system 100 when processing an utterance. [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2. The command processor 290 may be a component capable of acting on the utterance.).

Regarding Claim 37, Basye et al. teach: The system of claim 36, wherein the image data represents a gesture of (agitated, subdued, angry, etc.) of a user (([0019] The system may also determine (144) non-audio data corresponding to the utterance, for example time data as to when the utterance was received, location data of the utterance, image data associated with the user 10 at the time the utterance was spoken, etc. [0051] For example, based on audio (and possibly non-audio) paralinguistic feature data a system may determine that an input utterance was whispered. Whispered speech is typically "unvoiced," that is words are spoken using the articulators (mouth, lips, tongue, etc.) as normal, but without use/vibration of vocal cords such that an utterance has no resonance, or resonance below a certain threshold. Vocal resonance is when the product of voicing (i.e., phonation) is enhanced in tone quality (i.e., timbre) and/or intensity by the air-filled cavities through which speech passes on the speech's way to the outside air. During whispering, air comes through the throat without being modulated by the vocal cords so that what is left is motion of the articulators resulting in a stream of air without valve structure. [0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220.).

Regarding Claim 38, Basye et al. teach: The system of claim 36, wherein the image data represents a face (image inherently includes face to determine agitated, subdue, angry) of a user ([0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220.).

Regarding Claim 39, Basye et al. teach: The system of claim 31, wherein the first content correspond to second audio data and the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: generate based at least in part on the first indicator, metadata representing the second audio data is to be output at a first volume level based at least in part on the first indicator; send the second audio data, to the first device; and send the metadata to the first device, the  metadata causing the first device to output audio corresponding to the second audio data at the first volume level(See rejection of Claim 32 and [0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output.) [0051] As noted below, a machine learning model may be trained to recognize whispered speech based on resonance, volume, and/or other features of input audio. While certain spoken whispered sounds may differ from voiced sounds more than others as a result of the lack of voicing or low volume, ASR performance may not necessarily be impacted. That is, current ASR systems may be able to process whispered speech. [0052] The system may be configured to recognize that input audio is whispered (which is separate from recognizing the words of whispered speech). For example the system may determine that the input speech has resonance below a threshold and/or a volume below a threshold. Thus the system may determine that the input speech has an input speech quality corresponding to a whisper/approximated whisper. The system may train components to analyze paralinguistic feature data to make a decision as to whether the speech is whispered. While the system may determine whether speech is whispered based on whether a particular paralinguistic feature value(s) are above a threshold (for example, whether input speech has a resonance under a particular threshold and/or a volume under a particular threshold, etc.), more complex decision making is possible using machine learning models and training techniques. Thus, paralinguistic feature values (whether from audio data or non-audio data) are input as features to a speech quality detector. The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0064] For example, if the command processor 290 is a music player, and the utterance included a request to play music, only did not specify a particular music title, the command processor 290 may use the indicator of speech quality to select a music title. Specifically, if a user shouts, in an excited manner, "PLAY SOME MUSIC!!" the speech quality detector 220 may send an indicator to the command processor that the speech had a quality of excitement and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of excitement and may thus select a rock song or similar up-tempo song from a user's catalog. In another example, if a user whispers "play some music," the speech quality detector 220 may send an indicator to the command processor that the speech was whispered and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of being whispered and may thus select a mellow or calm song from a user's catalog. Similar selections of actions by different command processors 290 outside the domain of music are also envisioned. As another example, volume of output may be decreased as a result of whispered input speech, or volume increased as a result of excited speech, or the like.).

Regarding Claim 40, Basye et al. teach: The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: cause a second device, different from the first device, to output the first content([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. [0064] In another example, if a user whispers "play some music," the speech quality detector 220 may send an indicator to the command processor that the speech was whispered and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of being whispered and may thus select a mellow or calm song from a user's catalog. Similar selections of actions by different command processors 290 outside the domain of music are also envisioned. As another example, volume of output may be decreased as a result of whispered input speech, or volume increased as a result of excited speech, or the like. and [0066] A TTS module 314 may receive the indicator of input speech quality and may configure an output speech quality (if output speech is called for) to correspond to (or even match or approximate) the input speech quality. For example, if a user whispers an utterance including a query to a device 110, the device may send the audio to a server 120. The server may process the audio with a speech quality detector 220 to determine the utterance was whispered and to send an indicator that the speech was whispered to the TTS module 314. The server (or another server) may perform ASR and NLU processing to identify text associated with the query. A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user. In a broader example, the TTS module 314 may synthesize speech based on one or more speech qualities of the input speech as detected by the speech quality detector 220. Speech may be synthesized by the TTS module as described below.).



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Huang et al.(US 10565994 B2) teach:  (Abstract) A method, computer-readable medium, and system including a speech-to-text module to receive an input of speech including one or more words generated by a human and to output data including text, sentiment information, and other parameters corresponding to the speech input; a processing module like Artificial Intelligence to generate a reply to the speech input, the reply including a textual component, sentimental information associated with the textual component, and contextual information associated with the textual component; and a text-to-speech module to receive the textual component, sentimental information, and contextual information and to generate, based on the received textual component and its associated sentimental information and contextual information, a speech output including one or more spoken words, the spoken words to be presented with at least one of a pace, a tone, a volume, and an emphasis representative of the sentimental information and contextual information associated with the textual component.
McCord et al.(US 2018/0082679 A1) teach:  A system and method for emotion-enhanced natural speech using dilated convolutional neural networks, wherein an audio processing server receives a raw audio waveform from a dilated convolutional artificial neural network, associates text-based emotion content markers with portions of the raw audio waveform to produce an emotion-enhanced audio waveform, and provides the emotion-enhanced audio waveform to the dilated convolutional artificial neural network for use as a new input data set.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878.  The examiner can normally be reached on Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656