DETAILED ACTION
Claims 1-28 are pending.
This communication is in response to the communication filed 10/29/2019.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 9 is objected to because of the following informalities:  
             Claim 9 is objected to because of unclear antecedent basis.  It is suggested that Claim 9 be amended to recite “audio input signal.”  Appropriate correction is required.

 Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 10, 14, 24, and 28 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Basye (US 9548053 B1).

As per claim 10, Basye teaches a method comprising:
receiving, at a hotword detector of a user device (see Basye FIG. 1, item 102/114) audio input data containing a hotword, the hotword configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data (see Basye col. 2, lines 26-59, which notes A device that is configured to recognize the wake word may detect the wake word and inadvertently activate in response to detecting the wake word in the audio of the advertisement and execute any commands following the wake word. The people with devices that react to the wake word may experience their devices inadvertently waking up and potentially executing commands following the wake word thereby interfering with the television watching experience. Further, if the advertisement is shown as part of a broadcast watched by a large population (such as during a popular sporting event), the advertisement may result in many devices “waking up” at the same time. If those devices are programmed to connect to a central server upon waking up, the central server may become overwhelmed with many devices activating at the same time. These same principals apply to non-wake words and other commands that may inadvertently cause one or more devices to wake up and/or execute commands. Offered are a number of techniques to avoid responding to inadvertent wake words and executing inadvertent commands.  To avoid responding to an inadvertent wake word and executing an inadvertent audible command (for example, those of advertisements, broadcasts, etc.), a device may have access to stored audio recordings corresponding to inadvertent wake words and inadvertent audible commands, for example, audio samples of commercials, programs, etc. that include the particular wake words and/or command(s). When the device detects a wake word or an audible command, it may compare the audio of the detected wake word or command to audio stored in a data store. If the detected audio matches the stored audio the device may determine that the detected audio is part of an advertisement, etc. and is therefore an inadvertent wake word or command that the device may ignore/disregard/ abort/cancel; see Basye FIG. 1, item 114; and see Basye FIG. 4, items 404 and 406); 
determining, by the hotword detector, whether the audio input data comprises synthesized speech using a hotword detector model configured to detect the hotword in the audio input data and a presence of synthesized speech (see Basye col. 2, lines 26-59, which notes A device that is configured to recognize the wake word may detect the wake word and inadvertently activate in response to detecting the wake word in the audio of the advertisement and execute any commands following the wake word. The people with devices that react to the wake word may experience their devices inadvertently waking up and potentially executing commands following the wake word thereby interfering with the television watching experience. Further, if the advertisement is shown as part of a broadcast watched by a large population (such as during a popular sporting event), the advertisement may result in many devices “waking up” at the same time. If those devices are programmed to connect to a central server upon waking up, the central server may become overwhelmed with many devices activating at the same time. These same principals apply to non-wake words and other commands that may inadvertently cause one or more devices to wake up and/or execute commands. Offered are a number of techniques to avoid responding to inadvertent wake words and executing inadvertent commands.  To avoid responding to an inadvertent wake word and executing an inadvertent audible command (for example, those of advertisements, broadcasts, etc.), a device may have access to stored audio recordings corresponding to inadvertent wake words and inadvertent audible commands, for example, audio samples of commercials, programs, etc. that include the particular wake words and/or command(s). When the device detects a wake word or an audible command, it may compare the audio of the detected wake word or command to audio stored in a data store. If the detected audio matches the stored audio the device may determine that the detected audio is part of an advertisement, etc. and is therefore an inadvertent wake word or command that the device may ignore/disregard/ abort/cancel; see Basye FIG. 1, item 116; and see Basye FIG. 4, items 408 and 410); and 
when the audio input data comprises synthesized speech, preventing, by the hotword detector, initiation of the wake-up process on the user device for processing the hotword and/or the one or more other terms following the hotword in the audio input data (see Basye col. 2, lines 26-59, which notes A device that is configured to recognize the wake word may detect the wake word and inadvertently activate in response to detecting the wake word in the audio of the advertisement and execute any commands following the wake word. The people with devices that react to the wake word may experience their devices inadvertently waking up and potentially executing commands following the wake word thereby interfering with the television watching experience. Further, if the advertisement is shown as part of a broadcast watched by a large population (such as during a popular sporting event), the advertisement may result in many devices “waking up” at the same time. If those devices are programmed to connect to a central server upon waking up, the central server may become overwhelmed with many devices activating at the same time. These same principals apply to non-wake words and other commands that may inadvertently cause one or more devices to wake up and/or execute commands. Offered are a number of techniques to avoid responding to inadvertent wake words and executing inadvertent commands.  To avoid responding to an inadvertent wake word and executing an inadvertent audible command (for example, those of advertisements, broadcasts, etc.), a device may have access to stored audio recordings corresponding to inadvertent wake words and inadvertent audible commands, for example, audio samples of commercials, programs, etc. that include the particular wake words and/or command(s). When the device detects a wake word or an audible command, it may compare the audio of the detected wake word or command to audio stored in a data store. If the detected audio matches the stored audio the device may determine that the detected audio is part of an advertisement, etc. and is therefore an inadvertent wake word or command that the device may ignore/disregard/ abort/cancel; see Basye FIG. 1, items 124 and 126; and see Basye FIG. 4, item 412). 

As per claim 24, Basye teaches, A system comprises: 
data processing hardware of a user device; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations (see Basye col. 5, lines 30-37, which notes Computer instructions for operating the computing device 200 and its various components may be executed by the controller(s)/processor(s) 212, using the memory 214 as temporary “working” storage at runtime. The computer instructions may be stored in a non-transitory manner in non-volatile memory 214, storage 216, or an external device. Alternatively, some or all of the executable instructions may be embedded in hardware or firmware in addition to or instead of software) comprising:
receiving, at a hotword detector of the user device (see Basye FIG. 1, item 102/114), audio input data containing a hotword, the hotword configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data (see Basye col. 2, lines 26-59, which notes A device that is configured to recognize the wake word may detect the wake word and inadvertently activate in response to detecting the wake word in the audio of the advertisement and execute any commands following the wake word. The people with devices that react to the wake word may experience their devices inadvertently waking up and potentially executing commands following the wake word thereby interfering with the television watching experience. Further, if the advertisement is shown as part of a broadcast watched by a large population (such as during a popular sporting event), the advertisement may result in many devices “waking up” at the same time. If those devices are programmed to connect to a central server upon waking up, the central server may become overwhelmed with many devices activating at the same time. These same principals apply to non-wake words and other commands that may inadvertently cause one or more devices to wake up and/or execute commands. Offered are a number of techniques to avoid responding to inadvertent wake words and executing inadvertent commands.  To avoid responding to an inadvertent wake word and executing an inadvertent audible command (for example, those of advertisements, broadcasts, etc.), a device may have access to stored audio recordings corresponding to inadvertent wake words and inadvertent audible commands, for example, audio samples of commercials, programs, etc. that include the particular wake words and/or command(s). When the device detects a wake word or an audible command, it may compare the audio of the detected wake word or command to audio stored in a data store. If the detected audio matches the stored audio the device may determine that the detected audio is part of an advertisement, etc. and is therefore an inadvertent wake word or command that the device may ignore/disregard/ abort/cancel; see Basye FIG. 1, item 114; and see Basye FIG. 4, items 404 and 406)
determining, by the hotword detector, whether the audio input data comprises synthesized speech using a hotword detector model configured to detect the hotword in the audio input data and a presence of synthesized speech (see Basye col. 2, lines 26-59, which notes A device that is configured to recognize the wake word may detect the wake word and inadvertently activate in response to detecting the wake word in the audio of the advertisement and execute any commands following the wake word. The people with devices that react to the wake word may experience their devices inadvertently waking up and potentially executing commands following the wake word thereby interfering with the television watching experience. Further, if the advertisement is shown as part of a broadcast watched by a large population (such as during a popular sporting event), the advertisement may result in many devices “waking up” at the same time. If those devices are programmed to connect to a central server upon waking up, the central server may become overwhelmed with many devices activating at the same time. These same principals apply to non-wake words and other commands that may inadvertently cause one or more devices to wake up and/or execute commands. Offered are a number of techniques to avoid responding to inadvertent wake words and executing inadvertent commands.  To avoid responding to an inadvertent wake word and executing an inadvertent audible command (for example, those of advertisements, broadcasts, etc.), a device may have access to stored audio recordings corresponding to inadvertent wake words and inadvertent audible commands, for example, audio samples of commercials, programs, etc. that include the particular wake words and/or command(s). When the device detects a wake word or an audible command, it may compare the audio of the detected wake word or command to audio stored in a data store. If the detected audio matches the stored audio the device may determine that the detected audio is part of an advertisement, etc. and is therefore an inadvertent wake word or command that the device may ignore/disregard/ abort/cancel; see Basye FIG. 1, item 116; and see Basye FIG. 4, items 408 and 410); and 
when the audio input data comprises synthesized speech, preventing, by the hotword detector, initiation of the wake-up process on the user device for processing the hotword and/or the one or more other terms following the hotword in the audio input data (see Basye col. 2, lines 26-59, which notes A device that is configured to recognize the wake word may detect the wake word and inadvertently activate in response to detecting the wake word in the audio of the advertisement and execute any commands following the wake word. The people with devices that react to the wake word may experience their devices inadvertently waking up and potentially executing commands following the wake word thereby interfering with the television watching experience. Further, if the advertisement is shown as part of a broadcast watched by a large population (such as during a popular sporting event), the advertisement may result in many devices “waking up” at the same time. If those devices are programmed to connect to a central server upon waking up, the central server may become overwhelmed with many devices activating at the same time. These same principals apply to non-wake words and other commands that may inadvertently cause one or more devices to wake up and/or execute commands. Offered are a number of techniques to avoid responding to inadvertent wake words and executing inadvertent commands.  To avoid responding to an inadvertent wake word and executing an inadvertent audible command (for example, those of advertisements, broadcasts, etc.), a device may have access to stored audio recordings corresponding to inadvertent wake words and inadvertent audible commands, for example, audio samples of commercials, programs, etc. that include the particular wake words and/or command(s). When the device detects a wake word or an audible command, it may compare the audio of the detected wake word or command to audio stored in a data store. If the detected audio matches the stored audio the device may determine that the detected audio is part of an advertisement, etc. and is therefore an inadvertent wake word or command that the device may ignore/disregard/ abort/cancel; see Basye FIG. 1, items 124 and 126; and see Basye FIG. 4, item 412). 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 11-13 and 25-27 are rejected under 35 U.S.C. 103 as being unpatentable over Basye (US 9548053 B1) in view of Foerster (US 9443517 A1).

As per claims 11 and 25, Basye teaches all of the limitation of claims 10 and 24 above.  
Basye further teaches, wherein
the hotword detector model is trained on a plurality of training samples comprising: 
Bayse fails to specifically teach all of positive training samples comprising human-generated audio data corresponding to one or more users speaking the hotword assigned to the user device.
However, Foerster does teach 
positive training samples comprising human-generated audio data corresponding to one or more users speaking the hotword assigned to the user device  (see Foerster Abstract, which notes One of the methods includes accessing a first neural network that was trained to recognize a given keyword or keyphrase using a set of hotword training data, wherein the hotword training data includes positive hotword training data that correspond to utterances of the keyword or keyphrase, and negative hotword training data that corresponds to utterances of words or phrases that are other than the keyword or keyphrase, selecting a seed hotsound, mapping, to a feature space, (i) the positive hotword training data, (ii) the negative hotword training data, and (iii) the seed hotsound, performing an optimization of a position of the seed hotsound within the feature space to generate a modified seed hotsound, generating a set of hotsound training data using the modified seed hotsound, training a second neural network to recognize the modified seed hotsound using the generated set of hotsound training data, and using the trained second neural network to recognize the modified hotsound); and
negative training samples comprising synthesized speech utterances output from one or more speech synthesizer devices (see Foerster Abstract, which notes One of the methods includes accessing a first neural network that was trained to recognize a given keyword or keyphrase using a set of hotword training data, wherein the hotword training data includes positive hotword training data that correspond to utterances of the keyword or keyphrase, and negative hotword training data that corresponds to utterances of words or phrases that are other than the keyword or keyphrase, selecting a seed hotsound, mapping, to a feature space, (i) the positive hotword training data, (ii) the negative hotword training data, and (iii) the seed hotsound, performing an optimization of a position of the seed hotsound within the feature space to generate a modified seed hotsound, generating a set of hotsound training data using the modified seed hotsound, training a second neural network to recognize the modified seed hotsound using the generated set of hotsound training data, and using the trained second neural network to recognize the modified hotsound). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Basye with the modified seed hotsound of Foerster in order to move a modified hotsound to a position in feature space away from nearby clusters of hotword training data (see Foerster, col. 2, lines 7-18, which notes, in additional aspects, performing an optimization of the position of the seed hotsound in the feature space comprises identifying clusters of hotword training data in the feature space by performing a clustering algorithm; and performing a gradient descent algorithm to gradually modify the position of the seed hotsound such that the position of the seed hotsound moves further away from nearby clusters of hotword training data.  In some implementations the clusters of hotword training data correspond to hotsounds that will cause false activations for the seed hotsound.
The combination of Basye with Foerster includes predictable results, such as an activation of a device using a modified hotsound.

As per claims 12 and 26, Basye teaches all of the limitations of claims 11 and 25 above.  
Basye further teaches, wherein 
at least one of the synthesized speech utterances of the negative training samples pronounce the hotword assigned to the user device (see Basye col. 2, line 46—col. 3, line 15, which notes To avoid responding to an inadvertent wake word and executing an inadvertent audible command (for example, those of advertisements, broadcasts, etc.), a device may have access to stored audio recordings corresponding to inadvertent wake words and inadvertent audible commands, for example, audio samples of commercials, programs, etc. that include the particular wake words and/or command(s). When the device detects a wake word or an audible command, it may compare the audio of the detected wake word or command to audio stored in a data store. If the detected audio matches the stored audio the device may determine that the detected audio is part of an advertisement, etc. and is therefore an inadvertent wake word or command that the device may ignore/disregard/abort/cancel. The device may also send a recording of the detected audio to a remote device for determining whether the detected audio corresponds to a stored audio recording. In some aspects, the device may record and buffer portions of detected audio that precede and/or follow the detected wake word. The preceding and/or following portion(s) may be used to assist in identifying if the detected audio matches a stored audio sample, for example from the advertisement. Similarly, only portions of the detected audio may be captured and compared to stored audio, for example in the case where comparison of only small portions of audio may be sufficient to identify a detected wake word or command as inadvertent. Other techniques may also assist in preventing a device from responding to an inadvertent wake word or executing an inadvertent command. For example, the audio of the advertisement/program including the wake word or other audible command may also be configured to include an audio signal (for example a signal inaudible to humans) that indicates to a detecting device that the audio of the wake word or other audible command in the program is an inadvertent wake word or command and that the device should disregard that particular wake word or command. Other techniques are also possible). 

As per claims 13 and 27, Basye teaches all of the limitations of claims 11 and 25 above.  
Basye further teaches, wherein 
none of the synthesized speech utterances of the negative training samples pronounce the hotword assigned to the user device (see Basye FIG. 4, item 414; and see Basye col. 9, line 43-58, which notes Alternatively, when the captured audio does not substantially match the stored audio fingerprint (i.e., the comparison is less than the threshold), the wake word and/or audible command corresponding to the captured audio may be processed, for example, by the local device, illustrated as block 414. In this case, the wake word and/or audible command corresponding to the captured audio is determined to be an utterance of a wake word and/or audible command by a user, and the local device may execute the command. It should be appreciated that one or more steps may be performed by the local device and one or more steps may be performed by other devices. For example, the steps described with respect to blocks 402 and 408-410 may be performed by a remote device and an instruction relating to blocks 412 or 414 may be sent from the remote device to the local device.). 

As per claims 14 and 28, Basye teaches all of the limitations of claims 10 and 24 above.  
Basye further teaches, wherein determining whether the audio input data comprises synthesized speech comprises 
using the hotword detector model to detect the presence of synthesized speech in the audio input data through an analysis of acoustic features of the audio input data without transcribing or semantically interpreting the audio input data (see Basye col. 2, lines 46-59, which notes To avoid responding to an inadvertent wake word and executing an inadvertent audible command (for example, those of advertisements, broadcasts, etc.), a device may have access to stored audio recordings corresponding to inadvertent wake words and inadvertent audible commands, for example, audio samples of commercials, programs, etc. that include the particular wake words and/or command(s). When the device detects a wake word or an audible command, it may compare the audio of the detected wake word or command to audio stored in a data store. If the detected audio matches the stored audio the device may determine that the detected audio is part of an advertisement, etc. and is therefore an inadvertent wake word or command that the device may ignore/disregard/abort/cancel). 

1-3, 6, 15-17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wightman (US 10453460 B1) in view of Moore (US 20190149987 A1) and Lang (US 20190043492 A1).

As per claim 1, Wightman teaches a method comprising:
receiving, at data processing hardware of a speech synthesis device (see Wightman US 10453460 B1 col. 9, lines 4-9, which notes Backend system 100 may also include various modules that store software, hardware, logic, instructions, and/or commands for backend system 100 to perform, such as, for example, a speech-to-text ("STT") module and/or a text-to-speech ("TTS") module. A more detailed description of backend system 100 is provided below.  The text data may be analyzed to determine what command, action, or instruction is included within command 4.), text input data for conversion into synthesized speech (see Wightman US 10453460 B1 col. 9, lines 22-26, which notes After the audio data is analyzed and a response to command 4 is generated, speech, such as a response or answer to command 4 may be generated and converted from text into responsive audio data representing the response using TTS techniques); 
determining, by the data processing hardware and using a hotword-aware model trained to detect (see Wightman col. 2, lines 33-45, which notes As used herein, the term "wakeword" may correspond to a "keyword" or "key phrase," an "activation word" or "activation words," or a "trigger," "trigger word," or "trigger expression." One exemplary wakeword may be a name, such as the name, "Alexa," however persons of ordinary skill in the art will recognize that the any word (e.g., "Amazon"), or series of words (e.g., "Wake Up" or "Hello, Alexa") may alternatively be used as the wakeword. Furthermore, the wakeword may be set or programmed by an individual operating a voice activated electronic device, and in some embodiments more than one wakeword (e.g., two or more different wakewords) may be available to activate a voice activated electronic device; and see Wightman col. 13, lines 11-20, which notes In some embodiments, sound profiles for different words, phrases, commands, or audio compositions are also capable of being stored within storage/memory 204, such as within a sound profile database. For example, a sound profile of a video or of audio may be stored within the sound profile database of storage/memory 204 on voice activated electronic device 10. In this way, if a particular sound (e.g., a wakeword or phrase) is detected by voice activated electronic device 10, a corresponding command or request may be ignored, for example; and see Wightman col. 13, lines 45-51, which notes In some embodiments, a keyword spotter may use simplified ASR techniques. For example, an expression detector may use a Hidden Markov Model ("HMM") recognizer that performs acoustic modeling of the audio signal and compares the HMM model of the audio signal to one or more reference HMM models that have been created by training for specific trigger/hotword expressions)
a presence of at least one hotword assigned to a user device (see Wightman col. 13, lines 11-20, which notes In some embodiments, sound profiles for different words, phrases, commands, or audio compositions are also capable of being stored within storage/memory 204, such as within a sound profile database. For example, a sound profile of a video or of audio may be stored within the sound profile database of storage/memory 204 on voice activated electronic device 10. In this way, if a particular sound (e.g., a wakeword or phrase) is detected by voice activated electronic device 10, a corresponding command or request may be ignored, for example), 
whether a pronunciation of the text input data includes the hotword (see Wightman col. 7, lines 27-29, which notes Command 4 may include a wakeword; and see Wightman col. 8, lines 52-59, which notes backend system 100 may include automatic speech recognition functionality that may convert the audio data representing command 4 into text data, and may use the text data to determine one or more of the word(s) within command 4. Furthermore, backend system 100 may also include natural language understanding functionality for further processing and analyzing the text data representing command 4 to determine an intent of command 4), 
the hotword, when included in audio input data received by the user device, configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data (see Wightman col. 2, lines 33-45, which notes As used herein, the term "wakeword" may correspond to a "keyword" or "key phrase," an "activation word" or "activation words," or a "trigger," "trigger word," or "trigger expression." One exemplary wakeword may be a name, such as the name, "Alexa," however persons of ordinary skill in the art will recognize that the any word (e.g., "Amazon"), or series of words (e.g., "Wake Up" or "Hello, Alexa") may alternatively be used as the wakeword. Furthermore, the wakeword may be set or programmed by an individual operating a voice activated electronic device, and in some embodiments more than one wakeword (e.g., two or more different wakewords) may be available to activate a voice activated electronic device).
Wightman fails to specifically teach all of when the pronunciation of the text input data includes the hotword: generating an audio output signal from the text input data; and providing, by the data processing hardware, the audio output signal to an audio output device to output the audio output signal, the audio output signal when captured by an audio capture device of the user device, configured to prevent initiation of the wake-up process on the user device. 
However, Moore does teach 
when the pronunciation of the text input data includes the hotword: 
generating an audio output signal from the text input data (see Moore 20190149987 [0033], which notes After successfully setting up the second device 102(2), the second device 102(2) may emit a first text-to-speech (TTS) output 128(1) from a speaker(s) of the second device 102(2) (e.g., “Thanks for setting me up!”), and a microphone of the first device 102(1) may detect this first TTS output 128(1) (possibly with the inclusion of a predefined “wake word” in the first TTS output 128(1)), and respond by emitting a second TTS output 128(2) from a speaker(s) of the first device 102(1) (e.g., “No problem!”). The first TTS output 128(1) may be triggered by a command received, by the second device 102(2), from the remote system 108); and 
providing, by the data processing hardware, 
the audio output signal to an audio output device to output the audio output signal (see Moore 20190149987 [0033], which notes After successfully setting up the second device 102(2), the second device 102(2) may emit a first text-to-speech (TTS) output 128(1) from a speaker(s) of the second device 102(2) (e.g., “Thanks for setting me up!”), and a microphone of the first device 102(1) may detect this first TTS output 128(1) (possibly with the inclusion of a predefined “wake word” in the first TTS output 128(1)), and respond by emitting a second TTS output 128(2) from a speaker(s) of the first device 102(1) (e.g., “No problem!”). The first TTS output 128(1) may be triggered by a command received, by the second device 102(2), from the remote system 108.), 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman with the resource-conserving network credentialing of Moore in order to increase network security while conserving network resources  (see Moore [0101], which notes At 1108, the first device 102(1) may determine whether it should start sending the temporary authentication token 122 and the network credentials 116 for receipt by the secondary device. This determination may be based on a trigger from the secondary device, and may act as a resource (e.g., power resource, network bandwidth resource, processing resource, etc.) conservation measure to avoid transmitting data before the secondary device is ready to receive the data; and see Moore [0021] which notes Other features disclosed herein are directed to security measures that ensure an unauthorized user and/or device cannot be setup using the techniques described herein, as well as resource conservation measures that allow the first and/or second device in the environment to conserve resources with respect to communications bandwidth resources, processing resources, memory resources, power resources, and/or other computing resources. Furthermore, the techniques described herein may be utilized to register a secondary device that is headless (i.e., a device that does not have a display), yet, it is to be appreciated that the techniques described herein may be implemented to register any suitable type of networked computing device, including those that include a display or multiple displays.).
The combination of Wightman with Moore includes predictable results, such as a credentialing of a display-less device.
The combination of Wightman and Moore fails to specifically teach all of the audio output signal when captured by an audio capture device of the user device, configured to prevent initiation of the wake-up process on the user device. 
However, Lang does teach
the audio output signal when captured by an audio capture device of the user device, configured to prevent initiation of the wake-up process on the user device (see Lang 20190043492, Abstract, which notes An example implementation includes a playback device receiving data representing audio content for playback by the playback device. Before the audio content is played back by the playback device, the playback device detects, in the audio content, one or more wake words for one or more voice services. The playback device causes one or more networked microphone devices to disable its respective wake response to the detected one or more wake words during playback of the audio content by the playback device and plays back the audio content via one or more speakers. When enabled, the wake response of a given networked microphone device to a particular wake word causes the given networked microphone device to listen, via a microphone, for a voice command following the particular wake word). 

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman and Moore with the playback zones of Lang in order to a achieve a user experience of enjoying audio content with seamless transitions as a user traverses zones  (see Lang [0053] In one example, one or more playback zones in the environment of FIG. 1 may each be playing different audio content. For instance, the user may be grilling in the balcony zone and listening to hip hop music being played by the playback device 102 while another user may be preparing food in the kitchen zone and listening to classical music being played by the playback device 114. In another example, a playback zone may play the same audio content in synchrony with another playback zone. For instance, the user may be in the office zone where the playback device 118 is playing the same rock music that is being playing by playback device 102 in the balcony zone. In such a case, playback devices 102 and 118 may be playing the rock music in synchrony such that the user may seamlessly (or at least substantially seamlessly) enjoy the audio content that is being played out-loud while moving between different playback zones).
The combination of Wightman and Moore with Lang includes predictable results, such as achieving synchronization of playback devices among various playback zones.

As per claim 15, Wightman teaches a system comprising:
data processing hardware of a speech synthesis device (see Wightman US 10453460 B1 col. 9, lines 4-9, which notes Backend system 100 may also include various modules that store software, hardware, logic, instructions, and/or commands for backend system 100 to perform, such as, for example, a speech-to-text ("STT") module and/or a text-to-speech ("TTS") module. A more detailed description of backend system 100 is provided below.  The text data may be analyzed to determine what command, action, or instruction is included within command 4.); and
memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations (see Wightman col. 12, lines 29-36, which notes storage/memory 204 may be implemented as computer-readable storage media (“CRSM”), which may be any available physical media accessible by processor(s) 202 to execute one or more instructions stored within storage/memory 204. In some embodiments, one or more applications (e.g., gaming, music, video, calendars, lists, etc.) may be run by processor(s) 202, and may be stored in memory 204) comprising: 
receiving text input data for conversion into synthesized speech (see Wightman US 10453460 B1 col. 9, lines 22-26, which notes After the audio data is analyzed and a response to command 4 is generated, speech, such as a response or answer to command 4 may be generated and converted from text into responsive audio data representing the response using TTS techniques); 
determining using a hotword-aware model trained to detect (see Wightman col. 2, lines 33-45, which notes As used herein, the term "wakeword" may correspond to a "keyword" or "key phrase," an "activation word" or "activation words," or a "trigger," "trigger word," or "trigger expression." One exemplary wakeword may be a name, such as the name, "Alexa," however persons of ordinary skill in the art will recognize that the any word (e.g., "Amazon"), or series of words (e.g., "Wake Up" or "Hello, Alexa") may alternatively be used as the wakeword. Furthermore, the wakeword may be set or programmed by an individual operating a voice activated electronic device, and in some embodiments more than one wakeword (e.g., two or more different wakewords) may be available to activate a voice activated electronic device; and see Wightman col. 13, lines 11-20, which notes In some embodiments, sound profiles for different words, phrases, commands, or audio compositions are also capable of being stored within storage/memory 204, such as within a sound profile database. For example, a sound profile of a video or of audio may be stored within the sound profile database of storage/memory 204 on voice activated electronic device 10. In this way, if a particular sound (e.g., a wakeword or phrase) is detected by voice activated electronic device 10, a corresponding command or request may be ignored, for example; and see Wightman col. 13, lines 45-51, which notes In some embodiments, a keyword spotter may use simplified ASR techniques. For example, an expression detector may use a Hidden Markov Model ("HMM") recognizer that performs acoustic modeling of the audio signal and compares the HMM model of the audio signal to one or more reference HMM models that have been created by training for specific trigger/hotword expressions)
a presence of at least one hotword assigned to a user device (see Wightman col. 13, lines 11-20, which notes In some embodiments, sound profiles for different words, phrases, commands, or audio compositions are also capable of being stored within storage/memory 204, such as within a sound profile database. For example, a sound profile of a video or of audio may be stored within the sound profile database of storage/memory 204 on voice activated electronic device 10. In this way, if a particular sound (e.g., a wakeword or phrase) is detected by voice activated electronic device 10, a corresponding command or request may be ignored, for example), 
whether a pronunciation of the text input data includes the hotword (see Wightman col. 7, lines 27-29, which notes Command 4 may include a wakeword; and see Wightman col. 8, lines 52-59, which notes backend system 100 may include automatic speech recognition functionality that may convert the audio data representing command 4 into text data, and may use the text data to determine one or more of the word(s) within command 4. Furthermore, backend system 100 may also include natural language understanding functionality for further processing and analyzing the text data representing command 4 to determine an intent of command 4), 
the hotword, when included in audio input data received by the user device, configured to initiate the wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data (see Wightman col. 2, lines 33-45, which notes As used herein, the term "wakeword" may correspond to a "keyword" or "key phrase," an "activation word" or "activation words," or a "trigger," "trigger word," or "trigger expression." One exemplary wakeword may be a name, such as the name, "Alexa," however persons of ordinary skill in the art will recognize that the any word (e.g., "Amazon"), or series of words (e.g., "Wake Up" or "Hello, Alexa") may alternatively be used as the wakeword. Furthermore, the wakeword may be set or programmed by an individual operating a voice activated electronic device, and in some embodiments more than one wakeword (e.g., two or more different wakewords) may be available to activate a voice activated electronic device).

However, Moore does teach 
when the pronunciation of the text input data includes the hotword: 
generating an audio output signal from the text input data (see Moore 20190149987 [0033], which notes After successfully setting up the second device 102(2), the second device 102(2) may emit a first text-to-speech (TTS) output 128(1) from a speaker(s) of the second device 102(2) (e.g., “Thanks for setting me up!”), and a microphone of the first device 102(1) may detect this first TTS output 128(1) (possibly with the inclusion of a predefined “wake word” in the first TTS output 128(1)), and respond by emitting a second TTS output 128(2) from a speaker(s) of the first device 102(1) (e.g., “No problem!”). The first TTS output 128(1) may be triggered by a command received, by the second device 102(2), from the remote system 108); and 
providing 
the audio output signal to an audio output device to output the audio output signal (see Moore 20190149987 [0033], which notes After successfully setting up the second device 102(2), the second device 102(2) may emit a first text-to-speech (TTS) output 128(1) from a speaker(s) of the second device 102(2) (e.g., “Thanks for setting me up!”), and a microphone of the first device 102(1) may detect this first TTS output 128(1) (possibly with the inclusion of a predefined “wake word” in the first TTS output 128(1)), and respond by emitting a second TTS output 128(2) from a speaker(s) of the first device 102(1) (e.g., “No problem!”). The first TTS output 128(1) may be triggered by a command received, by the second device 102(2), from the remote system 108.), 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman with the resource-conserving network credentialing of Moore in order to increase network security while conserving network resources  (see Moore [0101], which notes At 1108, the first device 102(1) may determine whether it should start sending the temporary authentication token 122 and the network credentials 116 for receipt by the secondary device. This determination may be based on a trigger from the secondary device, and may act as a resource (e.g., power resource, network bandwidth resource, processing resource, etc.) conservation measure to avoid transmitting data before the secondary device is ready to receive the data; and see Moore [0021] which notes Other features disclosed herein are directed to security measures that ensure an unauthorized user and/or device cannot be setup using the techniques described herein, as well as resource conservation measures that allow the first and/or second device in the environment to conserve resources with respect to communications bandwidth resources, processing resources, memory resources, power resources, and/or other computing resources. Furthermore, the techniques described herein may be utilized to register a secondary device that is headless (i.e., a device that does not have a display), yet, it is to be appreciated that the techniques described herein may be implemented to register any suitable type of networked computing device, including those that include a display or multiple displays.).
The combination of Wightman with Moore includes predictable results, such as a credentialing of a display-less device.
The combination of Wightman and Moore fails to specifically teach all of the audio output signal when captured by an audio capture device of the user device, configured to prevent initiation of the wake-up process on the user device. 
However, Lang does teach
the audio output signal when captured by an audio capture device of the user device, configured to prevent initiation of the wake-up process on the user device (see Lang 20190043492, Abstract, which notes An example implementation includes a playback device receiving data representing audio content for playback by the playback device. Before the audio content is played back by the playback device, the playback device detects, in the audio content, one or more wake words for one or more voice services. The playback device causes one or more networked microphone devices to disable its respective wake response to the detected one or more wake words during playback of the audio content by the playback device and plays back the audio content via one or more speakers. When enabled, the wake response of a given networked microphone device to a particular wake word causes the given networked microphone device to listen, via a microphone, for a voice command following the particular wake word). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman and Moore with the playback zones of Lang in order to a achieve a user experience of enjoying audio content with seamless transitions as a user traverses zones (see Lang [0053] In one example, one or more playback zones in the environment of FIG. 1 may each be playing different audio content. For instance, the user may be grilling in the balcony zone and listening to hip hop music being played by the playback device 102 while another user may be preparing food in the kitchen zone and listening to classical music being played by the playback device 114. In another example, a playback zone may play the same audio content in synchrony with another playback zone. For instance, the user may be in the office zone where the playback device 118 is playing the same rock music that is being playing by playback device 102 in the balcony zone. In such a case, playback devices 102 and 118 may be playing the rock music in synchrony such that the user may seamlessly (or at least substantially seamlessly) enjoy the audio content that is being played out-loud while moving between different playback zones).


As per claims 2 and 16, the combination of Wightman, Moore, and Lang teaches all of the limitations of claims 1 and 15.
Wightman further teaches wherein determining whether the pronunciation of the text input data includes the hotword comprises 
determining that of at least one of a word, a sub-word, or a text-to-speech sequence of the text input data is associated with the hotword (see Wightman col. 7, lines 27-29, which notes Command 4 may include a wakeword; and see Wightman col. 8, lines 52-59, which notes backend system 100 may include automatic speech recognition functionality that may convert the audio data representing command 4 into text data, and may use the text data to determine one or more of the word(s) within command 4. Furthermore, backend system 100 may also include natural language understanding functionality for further processing and analyzing the text data representing command 4 to determine an intent of command 4; and see Wightman col. 13, lines 45-51, which notes In some embodiments, a keyword spotter may use simplified ASR techniques. For example, an expression detector may use a Hidden Markov Model ("HMM") recognizer that performs acoustic modeling of the audio signal and compares the HMM model of the audio signal to one or more reference HMM models that have been created by training for specific trigger/hotword expressions). 

As per claims 3 and 17, the combination of Wightman, Moore, and Lang teaches all of the limitations of claims 1 and 15.
Wightman further teaches wherein 
the hotword-aware model is trained on a text-to-speech sequence or audio representation of the hotword assigned to the user device (see Wightman col. 13, lines 45-51, which notes In some embodiments, a keyword spotter may use simplified ASR techniques. For example, an expression detector may use a Hidden Markov Model ("HMM") recognizer that performs acoustic modeling of the audio signal and compares the HMM model of the audio signal to one or more reference HMM models that have been created by training for specific trigger/hotword expressions). 

As per claims 6 and 20, the combination of Wightman, Moore, and Lang teaches all of the limitations of claims 1 and 15.
Wightman further teaches wherein
querying, by the data processing hardware, a remote hotword repository to obtain at least the hotword assigned to the user device for training the hotword-aware model (see Wightman col 18, lines 5-19, which notes Sound profile generation module 272, in one embodiment, may be used to generate a sound profile, such as an audio fingerprint, of a specific audio signal or sound. For example, a media event, such as a commercial, which may include an utterance of the wakeword (e.g., “Alexa”) of voice activated electronic device 10, and a sound profile of the audio of that commercial may be generated using sound profile generation module 272. The generated sound profile may then be provided to, and stored within, sound profile database 270. This may enable backend system 100 to prevent any future occurrences of the media event from erroneously triggering voice activated electronic device and/or causing unwanted speech processing to occur from audio emanating from the media event, and see Wightman, col. 27, line 59—col. 28, line 5, which notes In some embodiments, after step 622, the audio data corresponding to command 4 may be provided to sound profile generation module 272. Sound profile generation module 272 may, for instance, generate a sound profile unique to the audio data. For example, an audio fingerprint of command 4 may be generated by sound profile generation module 272. Furthermore, in one embodiment, the generated sound profile may be stored within sound profile database 270 on backend system 100. This may enable any future instances of audio data also representing command 4 being received by backend system 100 to be more readily matched as corresponding to a command that is non-human in origin, and therefore a command that is to be ignored by backend system 100). 

Claims 4 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wightman (US 10453460 B1) in view of Moore (US 20190149987 A1) and Lang (US 20190043492 A1) and in further view of Porsbo (US 20170257712 A1).


The combination of Wightman, Moore, and Lang fails to specifically teach all of the text input data comprises a first language and the audio output signal comprises a translation of the text input data in a different language.
However, Porsbo does teach wherein 
the text input data comprises a first language (see Porsbo [0027], which notes an embodiment where an event detector is a part of the external device and is configured to detect a beginning or end of a predefined process performed by the external device, or a storing of a predefined content or a predefined content characteristic on a memory that is functionally connected to the external device, or an activation or deactivation of the external device, or a combination thereof as a trigger event. The detection of a beginning of a predefined process performed by the external device is performed in a variant of this embodiment by a detection unit within a reading device, preferably a wireless reading device such as a translator pen, that forms the external device and sends an event signal that represents a spoken/first language version of a text scanned by the reading device. Thereby, a hearing aid user who is also visually impaired can hear a respective text via the hearing aid in a properly amplified manner) and
 the audio output signal comprises a translation of the text input data in a different language (see Porsbo [0027], which notes The relay server or the communication unit might also be configured to translate the spoken version of the text, which is represented by a respective audio signal, into a language chosen by the hearing aid user by generating a translated audio signal. In a further variant, the external device is formed by a mobile phone or a group of mobile phones which are activated, by a respective user input, in order to detect a speech which is held in front of an audience). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Wightman, Moore, and Lang with the rule generation channel of Porsbo in order to generate a rule by defining an action to be triggered in response to a trigger event so that a hearing aid user can generate rules for a hearing aid according to his or her own hearing limitations (see Porsbo [0023], which notes In an embodiment of the communication system, the definition of rules that refer to a specific hearing aid is provided by a further communication channel of the rule processing server, e.g. a communication channel between communication device and rule processing server, which is arranged and configured to allow a user of the hearing aid to generate a rule by defining an action to be triggered in response to a trigger event. The further communication channel thus forms a rule generation channel. This user generated rule either applies only to the hearing aid of the user or to a predefined group of hearing aids, which might also comprise the hearing aid of the user. Thereby the user can generate rules according to his or her own hearing limitations).

Claims 5 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wightman (US 10453460 B1) in view of Moore (US 20190149987 A1) and Lang (US 20190043492 A1) and in further view of Douglas (US 10649727 B1).

As per claims 5 and 19, the combination of Wightman, Moore, and Lang teaches all of the limitations of claims 1 and 15.
The combination of Wightman, Moore, and Lang fails to specifically teach all of the text input data comprises a first language and the audio output signal comprises a translation of the text input data in a different language.
Moore further teaches: 
detecting, by the data processing hardware, a presence of the user device within an operating environment of the speech synthesis device (see Moore FIG. 6, which shows an already-setup first device 102(1), a not-setup second device 102(2), and a threshold distance/operating environment 603 extending from the already-setup first device 102(1) and encompassing the not-setup second device 102(2); and see Moore [0066], which notes At 602, a user 106 may bring the second device 102(2) within a threshold distance 603 of the first device 102(1). The threshold distance 603 may vary depending upon the wireless data transfer technique and/or protocol used to transfer information/data between the two devices 102(1) and 102(2). In some embodiments, the second device 102(2) may utilize its speaker(s) 216 as a wireless data transmission component to output a signal in the form of a TTS/speech synthesis device output or one or more HFA tones, and the first device 102 may utilize its microphone(s) 218 to receive the audio that is output from the speaker(s) 216 of the second device 102(2). In this scenario, the threshold distance may be about 10 meters (m)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman with the resource-conserving network credentialing of Moore in order to increase network security while conserving network resources  (see Moore [0101], which notes At 1108, the first device 102(1) may determine whether it should start sending the temporary authentication token 122 and the network credentials 116 for receipt by the secondary device. This determination may be based on a trigger from the secondary device, and may act as a resource (e.g., power resource, network bandwidth resource, processing resource, etc.) conservation measure to avoid transmitting data before the secondary device is ready to receive the data; and see Moore [0021] which notes Other features disclosed herein are directed to security measures that ensure an unauthorized user and/or device cannot be setup using the techniques described herein, as well as resource conservation measures that allow the first and/or second device in the environment to conserve resources with respect to communications bandwidth resources, processing resources, memory resources, power resources, and/or other computing resources. Furthermore, the techniques described herein may be utilized to register a secondary device that is headless (i.e., a device that does not have a display), yet, it is to be appreciated that the techniques described herein may be implemented to register any suitable type of networked computing device, including those that include a display or multiple displays).
The combination of Wightman with Moore includes predictable results, such as a credentialing of a display-less device.
The combination of Wightman, Moore, and Lang fails to specifically teach all of querying, by the data processing hardware, the user device to obtain the hotword assigned to the user device for training the hotword-aware model.
However, Douglas does teach:
querying, by the data processing hardware, the user device to obtain the hotword assigned to the user device for training the hotword-aware model (see Douglas col. 1, line 53—col. 2, line 3, which notes Systems and methods for wake word detection configuration are described herein. Take, for example, an electronic device, such as a mobile phone, that includes components that facilitate the detection of wake words from user utterances. The electronic device may include a microphone that may be configured to capture audio representing the user utterance and generate corresponding audio data. The electronic device may also include a digital signal processor and a digital-signal-processor component configured to cause the digital signal processor to detect a wake word in the audio data. The electronic device may also have stored thereon one or more wake word application programming interfaces (APIs) that are associated with a remote speech-processing system. When the digital signal processor, using the digital-signal-processor component, detects the wake word, an indication that the wake word has been detected may be sent to the wake word APIs along with the audio data; and see Douglas col. 3 lines 15-23, which notes Additionally, or alternatively, the wake word APIs may be configured to receive data from the speech-processing application, such as wake word model data representing one or more wake word models. The wake word APIs may provide all or a portion of the wake word model data to the digital-signal-processor component for use in detecting the wake word. The wake word model data may be updated, and/or new wake word model data may be provided, by the remote system, such as at the speech-processing application).   
 Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman, Moore, and Lang with the wake word APIs of Douglas in order to detect and disable out-of-date or “custom” speech-processing applications  (see Douglas col. 13, lines 52-67, which notes At block 510, the process 500 may include determining whether a second application associated with a remote speech-processing system is installed on the electronic device 102. For example, the wake word APIs may be utilized to determine if a speech-processing application has been installed on the electronic device. Additionally, or alternatively, in examples, one or more electronic devices may have installed in memory thereon a custom speech-processing application and/or an outdated speech-processing application. In these examples, the wake word APIs may determine whether an up-to-date and/or non-custom speech-processing application is installed. If not, the up-to-date and/or non-custom speech-processing application may be installed on the electronic device and the previous version of the speech-processing application may be removed and/or otherwise not utilized by the electronic device).
The combination of Wightman, Moore, and Lang with Douglas includes predictable results, such as removal out-of-date or custom speech processing and installation of up-to-date and non-custom speech processing applications.

Claims 7 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Wightman (US 10453460 B1) in view of Moore (US 20190149987 A1) and Lang (US 20190043492 A1) and in further view of O’Malley (US 10152966 B1).

As per claims 7 and 21, the combination of Wightman, Moore, and Lang teaches all of the limitations of claims 1 and 15.
The combination of Wightman, Moore, and Lang fails to specifically teach all of wherein generating the audio output signal from the text input data comprises inserting a watermark to the audio output signal that indicates the audio output signal corresponds to synthesized speech and instructs a hotword detector of the user device to ignore detection of the hotword in the synthesized speech.
O’Malley further teaches wherein generating the audio output signal from the text input data comprises 
inserting a watermark to the audio output signal (see O’Malley col. 2, lines 13-29, which notes a voice activated device may be configured to receive a trigger and a voice command spoken by a user of the device, and verification of the trigger may cause the voice activated device to output a response based on the received voice command. The voice activated device may also receive one or more triggers from an unintended source, such as a nearby television set that outputs an audio signal comprising the trigger and a voice command. Triggers received from unintended sources may be referred to herein as “false triggers.” The false trigger may cause the voice activated device to operate in an unintended manner, for example, by executing or responding to the voice command received from the nearby television.  In order to prevent the execution of voice commands associated with false triggers, one or more signal markers may be inserted into the audio signal at a location corresponding to the trigger and/or the voice command) that 
indicates the audio output signal corresponds to synthesized speech (see O’Malley col. 2, lines 25-29, which notes in order to prevent the execution of voice commands associated with false triggers, one or more signal markers may be inserted into the audio signal at a location corresponding to the trigger and/or the voice command) and
instructs a hotword detector of the user device to ignore detection of the hotword in the synthesized speech (see O’Malley col. 2, lines 29-33, which notes upon receiving an audio signal with the inserted signal markers, the voice activated device may be configured to ignore the trigger and/or the voice command, thereby preventing the occurrence of a false trigger). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman, Moore, and Lang with the location-based watermarking trigger-override of O’Malley in order to execute a voice command given by a user at a time when a watermarking/trigger signal has been detected, where the user gives the voice command from a location other than a location of a known synthetic voice generator (e.g., television), where the user device detects the user voice command is generated in a location other than the synthetic voice generator, and where the user device executes—rather than otherwise ignore, based on the concurrently detected trigger/wakeword signal—the user’s voice command.  (see O’Malley col. 10, lines 21-41, which notes - For example, the user device may determine, based on a sampling of a received audio signal, a location of a television set at a location proximate to the user device (e.g., in the same room as the user device). The user device may determine, based on a decibel level or a frequency of the audio signal, that this audio signal is being received from a television set. Thus, if a trigger is received from a location that the user device has determined to be a location of a television set, the user device may be configured to ignore the voice command following the trigger. In contrast, if the trigger is received from a location that is different than the location of the known television set, the user device may be configured to execute the voice command following the trigger).
The combination of Wightman, Moore, and Lang with O’Malley includes predictable results, such as reducing false triggering from non-human voice sources.

Claims 8 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Wightman (US 10453460 B1) in view of Moore (US 20190149987 A1) and Lang (US 20190043492 A1) and in further view of Lockhart (US 10186265 B1).

As per claims 8 and 22, the combination of Wightman, Moore, and Lang teaches all of the limitations of claims 1 and 15.
Wightman further teaches wherein generating the audio output signal from the text input data comprises: 
determining a speech waveform that represents a text-to-speech output for the text input data (see Wightman US 10453460 B1 col. 9, lines 22-26, which notes After the audio data is analyzed and a response to command 4 is generated, speech, such as a response or answer to command 4 may be generated and converted from text into responsive audio data representing the response using TTS techniques); and 
The combination of Wightman, Moore, and Lang fails to specifically teach all of wherein generating the audio output signal from the text input data comprises: altering the speech waveform by removing or altering any sounds associated with the hotword to evade detection of the hotword by a hotword detector of the user device.
Lockhart further teaches wherein generating the audio output signal from the text input data comprises: 
altering the speech waveform by removing or altering any sounds associated with the hotword to evade detection of the hotword by a hotword detector of the user device (see Lockhart col. 23, lines 30-43, which notes In another example embodiment, if a wakeword is detected in the output audio data 151, and the output audio data 151 is being transmitted to a remote speaker (e.g., Bluetooth speaker, wireless speaker connected to the device 110), a signal such as a beep or chirp (which may be inaudible to humans but detectable by devices) may be output from the local speaker 101 as an indication that an upcoming wakeword will be output soon thereafter. The microphone 103 (or other component) may detect the inaudible beep or chirp and transfer the indication to the primary wakeword detector 220a/wakeword synch module 222 to disable wakeword detection during a time interval in which the wakeword will be output from the remote speaker). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman, Moore, and Lang with differently tuned wakeword detector of Lockhart in order to selectively adjust system operation to a desired tradeoff, such as between missed positives and false positives (see Lockhart col. 10, lines 21-41, which notes The device 110 may include an audio processing module 522 and wakeword detection modules 220. The audio processing module 522 and wakeword detection modules 220 may perform the various functions described above. The primary wakeword detection module 220a may receive audio data captured by a microphone 103 (which itself may have been processed, for example by AEC, prior to reaching the wakeword detection module 220a). The secondary wakeword detection module 220b may receive audio data intended for speaker 101. The primary wakeword detection module 220a and secondary wakeword detection module 220b may be configured similarly, e.g., tuned to a similar level of wakeword-detection aggressiveness, or they may be configured differently, e.g. tuned to a similar level of wakeword-detection aggressiveness. In the later situation, one detector 220 may potentially detect a wakeword in audio data that the other detector 220 may not. This may be done in order to adjust system operation to a desired tradeoff, such as between missed positives and false positives).
The combination of Wightman, Moore, and Lang with Lockhart includes predictable results, such as tuning a system operation to a desired tradeoff between missed positives and false positives.

Claims 9 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Wightman (US 10453460 B1) in view of Moore (US 20190149987 A1) and Lang (US 20190043492 A1) and in further view of Wang (US 10580405 B1).

As per claims 9 and 23, the combination of Wightman, Moore, and Lang teaches all of the limitations of claims 1 and 15.
Wightman further teaches wherein generating the audio output signal from the text input data comprises: 
determining a speech waveform that represents a text-to-speech output for the text input data (see Wightman US 10453460 B1 col. 9, lines 22-26, which notes After the audio data is analyzed and a response to command 4 is generated, speech, such as a response or answer to command 4 may be generated and converted from text into responsive audio data representing the response using TTS techniques); and 
The combination of Wightman, Moore, and Lang fails to specifically teach all of wherein generating the audio output signal from the text input data comprises: filtering the audio waveform to evade detection of the hotword by a hotword detector of the user device.
Wang further teaches wherein generating the audio output signal from the text input data comprises: 
filtering the audio waveform to evade detection of the hotword by a hotword detector of the user device (see Wang col. 38, lines 20-53, which notes In order to avoid inadvertent wakeword detection, the device 110 may be configured to temporarily disable wakeword detection during the time in which the wakeword will be output from the speaker(s) 114 and detectable by the microphone(s) 112 (e.g., during playback of output audio data 115 upon detecting the wakeword in the output audio data 115). Such a configuration may involve multiple wakeword detectors (e.g., wakeword detection components 220), as illustrated in FIG. 16B. For example, FIG. 16B illustrates that the device 110 may include a primary wakeword detection component 220a and a secondary wakeword detection component 220b, which may detect a wakeword in the output audio data 115 before it is output from the speaker(s) 114. The secondary wakeword detection component 220b may receive the output audio data 115 from the server(s) 120b via the network(s) 10 during a communication session (e.g., incoming audio data corresponding to a conversation). Upon receipt of the output audio data 115, the secondary wakeword detection component 220b may determine that the output audio data 115 includes the wakeword (e.g., detect wakeword 1604). In response to determining that the output audio data 115 includes the wakeword, the secondary wakeword detection component 220b and/or a wakeword synchronization component 1622 may send data (e.g., indicator to ignore incoming wakeword 1606) corresponding to instructions to the primary wakeword detection component 220a, wherein the instructions are to ignore/filter the incoming wakeword. In other words, the instructions to ignore the incoming wakeword may disable the primary wakeword detection component 220a during a time interval in which the wakeword will be output from the speaker(s) 114 as part of the output audio data 115. Thus, the wakeword detection component 220a will temporarily ignore the wakeword represented in the audio data 111.). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Wightman, Moore, and Lang with remote control-based disabling of wakeword detection of Lockhart in order to interpret voice commands generated by a remote caller (see Wang col. 10, lines 21-41, which notes (172) In some examples, when remote control is granted to a caller device, the device 110 may be configured to detect a wakeword in incoming audio data and send command audio data corresponding to the incoming audio data to the server(s) 120a. Thus, the server(s) 120a receive the command audio data from the device 110 and may interpret a voice command represented in the command audio data as though the voice command originated from the device 110. In order to enable the device 110 to detect the wakeword in the audio data 111, the device 110 may temporarily disable the secondary wakeword detection component 220b when remote control is granted).
The combination of Wightman, Moore, and Lang with Wang includes predictable results, such as interpreting voice commands received from a caller device.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARK R HENNINGS whose telephone number is (571) 272-9676. The examiner can normally be reached on Monday-Friday 8:00 am-5:00 pm. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Pierre-Louis Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 

/MARK HENNINGS/
Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/            Supervisory Patent Examiner, Art Unit 2659