Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawing submitted on 02/10/2020 is being considered by the examiner.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 21-40 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Basye et al. (US 2016/0379638 A1).

Regarding Claim 21, Basye et al. teach: A computer-implemented method, comprising: receiving, from a first device (microphone), first audio data representing first speech (a spoken utterance) ([0021] An audio capture component, such as a microphone of device 110, captures audio 11 corresponding to a spoken utterance. The device sends audio data 111 corresponding to the utterance, to an ASR module 250.); determining a first indicator (indicator of the whisper) representing a first emotion (whisper) corresponding to the first speech; performing speech processing on the first audio data to determine natural language understanding (NLU) (semantic interpretation of the text) results data (intent) ([0030] The NLU process takes textual input (such as processed from ASR 250 based on the utterance 11) and attempts to make a semantic interpretation of the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning NLU processing 260 interprets a text string to derive an intent or a desired action from the user as well as the pertinent pieces of information in the text that allow a device (e.g., device 110) to complete that action. For example, if a spoken utterance is processed using ASR 250 and outputs the text "call mom" the NLU process may determine that the user intended to activate a telephone in his/her device and to initiate a call with a contact matching the entity "mom." [0037] An intent classification (IC) module 264 parses the query to determine an intent or intents for each identified domain, where the intent corresponds to the action to be performed that is responsive to the query. Each domain is associated with a database (278a-278n) of words linked to intents. [0060] The model may be constructed based on the training utterances and then disseminated to individual devices 110 or to server(s) 120. A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc.); determining a first component (destination command processor 290) associated with the NLU results data ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output.); sending, to the first component, the NLU results data; sending, to the first component, the first indicator ([0052] The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2.) receiving (TTS module), from the first component (command processor 290), first content (textual answer) responsive to the first speech (a spoken utterance); determining, based on the first indicator, the first content is to be output according to the first emotion; and causing output of the first content according to the first emotion ([0066] A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user.).

Regarding Claim 22, Basye et al. teach: The computer-implemented method of claim 21, further comprising: generating first metadata representing text-to-speech (TTS) processing is to be performed based at least in part on the first emotion; and performing, based at least in part on the first metadata, TTS processing on text data corresponding to the first content to generate second audio data (synthesize whispered speech), wherein causing output of the first content according to the first emotion comprises causing output of audio corresponding to the second audio data (See rejection of Claim 21).

Regarding Claim 23, Basye et al. teach: The computer-implemented method of claim 22, further comprising: determining the first content does not correspond to the first emotion, wherein generating the first metadata is performed in response to determining that the first content does not correspond to the first emotion (See rejection of Claim 21 and [0060] For example, a model, such as an SVM classifier, may be trained to recognize when an input speech utterance is whispered using many different training utterances, each labeled either "whispered" or "not whispered." Each training utterance may also be associated with various feature data corresponding to the respective utterance, where the feature data indicates values for the acoustic and/or non-audio paralinguistic features that may be used to determine if a future utterance was whispered. The model may be constructed based on the training utterances and then disseminated to individual devices 110 or to server(s) 120. A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc. The system may then tailor its operations and/or output based on the fact that the utterance was, or was not, whispered.).

Regarding Claim 24, Basye et al. teach: The computer-implemented method of claim 22, wherein the second audio data (synthesize whispered speech) matches the first emotion (whisper) (See rejection of Claim 22).

Regarding Claim 25, Basye et al. teach: The computer-implemented method of claim 21, further comprising: processing the first audio data to determine the first indicator (See rejection of Claim 21).

Regarding Claim 26, Basye et al. teach: The computer-implemented method of claim 21, further comprising: processing image data corresponding to the first speech to determine the first indicator ([0019] The system may also determine (144) non-audio data corresponding to the utterance, for example time data as to when the utterance was received, location data of the utterance, image data associated with the user 10 at the time the utterance was spoken, etc. The system may perform (146) ASR to determine utterance text. The system may then determine (148) one or more utterance speech qualities using the trained model(s), the audio data and the non-audio data. For example, a model configured to determine whether speech was whispered may analyze various audio data feature values to classify the utterance as whispered. The system may then perform (150) one or more operations resulting in output based on the utterance text and the speech quality/ies. [0049] The present system is actually configured to detect speech quality/qualities and determine a label corresponding to the detected qualities that may be applied to an utterance in the speech and used for later processing. The speech quality may be based on paralinguistic metrics that describe some quality/feature other than the specific words spoken. [0051] For example, based on audio (and possibly non-audio) paralinguistic feature data a system may determine that an input utterance was whispered. [0052] Thus, paralinguistic feature values (whether from audio data or non-audio data) are input as features to a speech quality detector. The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220. For example, time/date data, location data (for example GPS location or relative indoor room location), ambient light data from a light sensor, the identity of other nearby individuals to the speaker, proximity of the user to a device (for example, if a user is leaning in close to a device to speak an utterance, or if a user is far away from the device), etc. The types of acoustic and non-audio data considered by the speech quality detector 220 depends on the types of such data available to the system 100 when processing an utterance. [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2. The command processor 290 may be a component capable of acting on the utterance.).

Regarding Claim 27, Basye et al. teach: The computer-implemented method of claim 26, wherein the image data represents a gesture (agitated, subdued, angry, etc.) of a user (([0019] The system may also determine (144) non-audio data corresponding to the utterance, for example time data as to when the utterance was received, location data of the utterance, image data associated with the user 10 at the time the utterance was spoken, etc. [0051] For example, based on audio (and possibly non-audio) paralinguistic feature data a system may determine that an input utterance was whispered. Whispered speech is typically "unvoiced," that is words are spoken using the articulators (mouth, lips, tongue, etc.) as normal, but without use/vibration of vocal cords such that an utterance has no resonance, or resonance below a certain threshold. Vocal resonance is when the product of voicing (i.e., phonation) is enhanced in tone quality (i.e., timbre) and/or intensity by the air-filled cavities through which speech passes on the speech's way to the outside air. During whispering, air comes through the throat without being modulated by the vocal cords so that what is left is motion of the articulators resulting in a stream of air without valve structure. [0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220.).

Regarding Claim 28, Basye et al. teach: The computer-implemented method of claim 26, wherein the image data represents a face (image inherently includes face to determine agitated, subdue, angry) of a user ([0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220.).

Regarding Claim 29, Basye et al. teach: The computer-implemented method of claim 21, wherein the first content corresponds to second audio data, and wherein the method further comprises: generating first metadata representing the second audio data (synthesize whispered speech) is to be output at a first volume level based at least in part on the first indicator; sending, to the first device, the second audio data; and sending the first metadata to the first device, the first metadata causing the first device to output audio corresponding to the second audio data at the first volume level (See rejection of Claim 22 and [0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output.) [0051] As noted below, a machine learning model may be trained to recognize whispered speech based on resonance, volume, and/or other features of input audio. While certain spoken whispered sounds may differ from voiced sounds more than others as a result of the lack of voicing or low volume, ASR performance may not necessarily be impacted. That is, current ASR systems may be able to process whispered speech. [0052] The system may be configured to recognize that input audio is whispered (which is separate from recognizing the words of whispered speech). For example the system may determine that the input speech has resonance below a threshold and/or a volume below a threshold. Thus the system may determine that the input speech has an input speech quality corresponding to a whisper/approximated whisper. The system may train components to analyze paralinguistic feature data to make a decision as to whether the speech is whispered. While the system may determine whether speech is whispered based on whether a particular paralinguistic feature value(s) are above a threshold (for example, whether input speech has a resonance under a particular threshold and/or a volume under a particular threshold, etc.), more complex decision making is possible using machine learning models and training techniques. Thus, paralinguistic feature values (whether from audio data or non-audio data) are input as features to a speech quality detector. The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0064] For example, if the command processor 290 is a music player, and the utterance included a request to play music, only did not specify a particular music title, the command processor 290 may use the indicator of speech quality to select a music title. Specifically, if a user shouts, in an excited manner, "PLAY SOME MUSIC!!" the speech quality detector 220 may send an indicator to the command processor that the speech had a quality of excitement and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of excitement and may thus select a rock song or similar up-tempo song from a user's catalog. In another example, if a user whispers "play some music," the speech quality detector 220 may send an indicator to the command processor that the speech was whispered and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of being whispered and may thus select a mellow or calm song from a user's catalog. Similar selections of actions by different command processors 290 outside the domain of music are also envisioned. As another example, volume of output may be decreased as a result of whispered input speech, or volume increased as a result of excited speech, or the like.).

Regarding Claim 30, Basye et al. teach: The computer-implemented method of claim 21, wherein causing output of the first content comprises causing a second device (TTS module), different from the first device, to output the first content ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. [0064] In another example, if a user whispers "play some music," the speech quality detector 220 may send an indicator to the command processor that the speech was whispered and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of being whispered and may thus select a mellow or calm song from a user's catalog. Similar selections of actions by different command processors 290 outside the domain of music are also envisioned. As another example, volume of output may be decreased as a result of whispered input speech, or volume increased as a result of excited speech, or the like. and [0066] A TTS module 314 may receive the indicator of input speech quality and may configure an output speech quality (if output speech is called for) to correspond to (or even match or approximate) the input speech quality. For example, if a user whispers an utterance including a query to a device 110, the device may send the audio to a server 120. The server may process the audio with a speech quality detector 220 to determine the utterance was whispered and to send an indicator that the speech was whispered to the TTS module 314. The server (or another server) may perform ASR and NLU processing to identify text associated with the query. A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user. In a broader example, the TTS module 314 may synthesize speech based on one or more speech qualities of the input speech as detected by the speech quality detector 220. Speech may be synthesized by the TTS module as described below.).

Regarding Claim 31, Basye et al. teach: A system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the system to: receiving, from a first device (device 110 including microphone 104), first audio data representing first speech (a spoken utterance) ([0021] An audio capture component, such as a microphone of device 110, captures audio 11 corresponding to a spoken utterance. The device sends audio data 111 corresponding to the utterance, to an ASR module 250.); determining a first indicator (indicator of the whisper) representing a first speech quality (quality of speech or whisper) corresponding to the first speech; performing speech processing on the first audio data to determine natural language understanding (NLU) (semantic interpretation of the text) results data (intent) ([0030] The NLU process takes textual input (such as processed from ASR 250 based on the utterance 11) and attempts to make a semantic interpretation of the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning NLU processing 260 interprets a text string to derive an intent or a desired action from the user as well as the pertinent pieces of information in the text that allow a device (e.g., device 110) to complete that action. For example, if a spoken utterance is processed using ASR 250 and outputs the text "call mom" the NLU process may determine that the user intended to activate a telephone in his/her device and to initiate a call with a contact matching the entity "mom." [0037] An intent classification (IC) module 264 parses the query to determine an intent or intents for each identified domain, where the intent corresponds to the action to be performed that is responsive to the query. Each domain is associated with a database (278a-278n) of words linked to intents. [0060] The model may be constructed based on the training utterances and then disseminated to individual devices 110 or to server(s) 120. A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc.); determining a first component (destination command processor 290) associated with the NLU results data ([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output.); sending, to the first component, the NLU results data ([0052] The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2.); receiving (TTS module), from the first component (command processor 290), first content (textual answer) responsive to the first speech (a spoken utterance); performing, based at least in part on the first indicator, text-to-speech (TTS) processing using text data corresponding to the first content to determine second audio data(synthesize whispered speech) ([0066] A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user.).

Regarding Claim 32, Basye et al. teach: The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine the first content does not correspond to the first speech quality; and in response to determining that the first content does not correspond to the first speech quality, generate first metadata representing the TTS processing is to be performed based at least in part on the first speech quality(See rejection of Claim 31 and [0060] For example, a model, such as an SVM classifier, may be trained to recognize when an input speech utterance is whispered using many different training utterances, each labeled either "whispered" or "not whispered." Each training utterance may also be associated with various feature data corresponding to the respective utterance, where the feature data indicates values for the acoustic and/or non-audio paralinguistic features that may be used to determine if a future utterance was whispered. The model may be constructed based on the training utterances and then disseminated to individual devices 110 or to server(s) 120. A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc. The system may then tailor its operations and/or output based on the fact that the utterance was, or was not, whispered.).

Regarding Claim 33, Basye et al. teach: The system of claim 31, wherein the first speech quality corresponds to a first emotion (whisper) of the first speech (See rejection of Claim 31).

Regarding Claim 34, Basye et al. teach: The system of claim 31, wherein the second audio data matches the first speech quality (see rejection of Claim 32 specifically, [0060] A speech quality detector 220 may then use the model(s) to make decisions at runtime as to whether the utterance was whispered. An indicator of the whisper may then be output from the speech quality detector 220 to downstream components such as a command processor 290, TTS module 314, etc. The system may then tailor its operations and/or output based on the fact that the utterance was, or was not, whispered.).

Regarding Claim 35, Basye et al. teach: The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process the first audio data to determine the first indicator (See rejection of Claim 31 and [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2.).

Regarding Claim 36, Basye et al. teach: The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process image data corresponding to the first speech to determine the first indicator([0019] The system may also determine (144) non-audio data corresponding to the utterance, for example time data as to when the utterance was received, location data of the utterance, image data associated with the user 10 at the time the utterance was spoken, etc. The system may perform (146) ASR to determine utterance text. The system may then determine (148) one or more utterance speech qualities using the trained model(s), the audio data and the non-audio data. For example, a model configured to determine whether speech was whispered may analyze various audio data feature values to classify the utterance as whispered. The system may then perform (150) one or more operations resulting in output based on the utterance text and the speech quality/ies. [0049] The present system is actually configured to detect speech quality/qualities and determine a label corresponding to the detected qualities that may be applied to an utterance in the speech and used for later processing. The speech quality may be based on paralinguistic metrics that describe some quality/feature other than the specific words spoken. [0051] For example, based on audio (and possibly non-audio) paralinguistic feature data a system may determine that an input utterance was whispered. [0052] Thus, paralinguistic feature values (whether from audio data or non-audio data) are input as features to a speech quality detector. The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220. For example, time/date data, location data (for example GPS location or relative indoor room location), ambient light data from a light sensor, the identity of other nearby individuals to the speaker, proximity of the user to a device (for example, if a user is leaning in close to a device to speak an utterance, or if a user is far away from the device), etc. The types of acoustic and non-audio data considered by the speech quality detector 220 depends on the types of such data available to the system 100 when processing an utterance. [0063] The speech quality detector 220 may use the models 353, 354 to process audio data 111 and/or non-audio data 302 to determine one or more speech qualities to associate with an input spoken utterance. The speech quality detector 220 may then create an indicator for the determined speech quality/ies. The indicator may then be sent to a downstream command processor 290 so that a command/query may be processed using the indicator and based on the speech quality/ies. The command processor 290 receives the indicator, as well as text and possible other semantic notation related to the utterance, as discussed above in reference to FIG. 2. The command processor 290 may be a component capable of acting on the utterance.).

Regarding Claim 37, Basye et al. teach: The system of claim 36, wherein the image data represents a gesture of (agitated, subdued, angry, etc.) of a user (([0019] The system may also determine (144) non-audio data corresponding to the utterance, for example time data as to when the utterance was received, location data of the utterance, image data associated with the user 10 at the time the utterance was spoken, etc. [0051] For example, based on audio (and possibly non-audio) paralinguistic feature data a system may determine that an input utterance was whispered. Whispered speech is typically "unvoiced," that is words are spoken using the articulators (mouth, lips, tongue, etc.) as normal, but without use/vibration of vocal cords such that an utterance has no resonance, or resonance below a certain threshold. Vocal resonance is when the product of voicing (i.e., phonation) is enhanced in tone quality (i.e., timbre) and/or intensity by the air-filled cavities through which speech passes on the speech's way to the outside air. During whispering, air comes through the throat without being modulated by the vocal cords so that what is left is motion of the articulators resulting in a stream of air without valve structure. [0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220.).

Regarding Claim 38, Basye et al. teach: The system of claim 36, wherein the image data represents a face (image inherently includes face to determine agitated, subdue, angry) of a user ([0055] The system may also consider non-audio data and non-audio features when determining a quality of the speech. For example, if a camera detects the speaker, the system may analyze the video data (for example, the video data may be input to the speech quality detector 220) to determine some quality of the speaker (agitated, subdued, angry, etc.) that the speech quality detector 220 may consider. Other non-audio data may also be input to the speech quality detector 220.).

Regarding Claim 39, Basye et al. teach: The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: generating first metadata representing the second audio data is to be output at a first volume level based at least in part on the first indicator; sending, to the first device, the second audio data; and sending the first metadata to the first device, the first metadata causing the first device to output audio corresponding to the second audio data at the first volume level(See rejection of Claim 32 and [0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output.) [0051] As noted below, a machine learning model may be trained to recognize whispered speech based on resonance, volume, and/or other features of input audio. While certain spoken whispered sounds may differ from voiced sounds more than others as a result of the lack of voicing or low volume, ASR performance may not necessarily be impacted. That is, current ASR systems may be able to process whispered speech. [0052] The system may be configured to recognize that input audio is whispered (which is separate from recognizing the words of whispered speech). For example the system may determine that the input speech has resonance below a threshold and/or a volume below a threshold. Thus the system may determine that the input speech has an input speech quality corresponding to a whisper/approximated whisper. The system may train components to analyze paralinguistic feature data to make a decision as to whether the speech is whispered. While the system may determine whether speech is whispered based on whether a particular paralinguistic feature value(s) are above a threshold (for example, whether input speech has a resonance under a particular threshold and/or a volume under a particular threshold, etc.), more complex decision making is possible using machine learning models and training techniques. Thus, paralinguistic feature values (whether from audio data or non-audio data) are input as features to a speech quality detector. The speech quality detector may implement a model trained using machine learning techniques to determine a label describing the speech. For example the detector may determine that the speech is whispered. The label (or other indicator of the speech quality) may then be sent to downstream components to alter the output of the device. [0064] For example, if the command processor 290 is a music player, and the utterance included a request to play music, only did not specify a particular music title, the command processor 290 may use the indicator of speech quality to select a music title. Specifically, if a user shouts, in an excited manner, "PLAY SOME MUSIC!!" the speech quality detector 220 may send an indicator to the command processor that the speech had a quality of excitement and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of excitement and may thus select a rock song or similar up-tempo song from a user's catalog. In another example, if a user whispers "play some music," the speech quality detector 220 may send an indicator to the command processor that the speech was whispered and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of being whispered and may thus select a mellow or calm song from a user's catalog. Similar selections of actions by different command processors 290 outside the domain of music are also envisioned. As another example, volume of output may be decreased as a result of whispered input speech, or volume increased as a result of excited speech, or the like.).

Regarding Claim 40, Basye et al. teach: The system of claim 31, wherein the instructions that cause output of the first content comprise instructions that, when executed by the at least one processor, further cause the system to cause a second device, different from the first device, to output the first content([0047] The output from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. [0064] In another example, if a user whispers "play some music," the speech quality detector 220 may send an indicator to the command processor that the speech was whispered and the NLU module 260 may send the command processor 290 text and semantic indicators that the utterance included a request to play music. The command processor 290 may then select a music title to play based on the quality of being whispered and may thus select a mellow or calm song from a user's catalog. Similar selections of actions by different command processors 290 outside the domain of music are also envisioned. As another example, volume of output may be decreased as a result of whispered input speech, or volume increased as a result of excited speech, or the like. and [0066] A TTS module 314 may receive the indicator of input speech quality and may configure an output speech quality (if output speech is called for) to correspond to (or even match or approximate) the input speech quality. For example, if a user whispers an utterance including a query to a device 110, the device may send the audio to a server 120. The server may process the audio with a speech quality detector 220 to determine the utterance was whispered and to send an indicator that the speech was whispered to the TTS module 314. The server (or another server) may perform ASR and NLU processing to identify text associated with the query. A command processor 290 may then process the text to determine a textual answer responding to the query. The textual answer may be sent to the TTS module 314 so the TTS module 314 may synthesize speech corresponding to the textual answer. However the TTS module 314 may, based on the indicator, synthesize whispered speech (or speech configured to approximate a whisper) to output to the user. In a broader example, the TTS module 314 may synthesize speech based on one or more speech qualities of the input speech as detected by the speech quality detector 220. Speech may be synthesized by the TTS module as described below.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. McCord et al.(US 2018/0082679 A1) teach:  A system and method for emotion-enhanced natural speech using dilated convolutional neural networks, wherein an audio processing server receives a raw audio waveform from a dilated convolutional artificial neural network, associates text-based emotion content markers with portions of the raw audio waveform to produce an emotion-enhanced audio waveform, and provides the emotion-enhanced audio waveform to the dilated convolutional artificial neural network for use as a new input data set.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878.  The examiner can normally be reached on Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656