DETAILED ACTION
This communication is in response to the Application filed on 19 July 2019. Claims 1-20 are pending and have been examined.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 19 July 2019 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 6, 13, 15, 17, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210104221, hereinafter referred to as Sharifi et al., in view of US 9484014, hereinafter referred to as Kaszczuk et al.

Regarding claim 1, Sharifi et al. discloses a method for automatically creating a wake word detection algorithm (Sharifi et al., Fig. 5), the method comprising: 

Through the use of a "hotword" (also referred to as an "attention word", "wake-up phrase/word", "trigger phrase", or "voice action initiation command"), in which by agreement a predetermined term (i.e., keyword) that is spoken to invoke the attention of the system is reserved, the system is able to discern between utterances directed to the system (i.e., to initiate a wake-up process for processing one or more terms following the hotword in the utterance) and utterances directed to an individual in the environment,” Sharifi et al., para [0023].); 

generating a plurality of samples associated with the user input, wherein the plurality of samples are generated using a plurality of text-to-speech services (“In some examples, the speech synthesizer 300 on one device 110 (e.g., the first user device 110, 110a) is trained on a text-to-speech sequence or audio representation of a hotword 130 assigned to the other user device 110 (e.g., the second user device 110b). For instance, a training pipeline (e.g., a hotword-aware trainer 310) of the speech synthesizer 300 (e.g., a TTS system) associated with one device 110 may generate a hotword-aware model 320 for use in detecting a presence of hotwords 130,” Sharifi et al., para [0036].); 

training a machine learning model for the custom wake word using the plurality of samples (“In some implementations in order to efficiently and to effectively detect hotwords 130, the hotword detector 200 is trained by a hotword detector model 220 with data or examples of speech to learn how to identify whether an utterance 150 includes a hotword 130. For example, the hotword detector 200 is taught by a machine learning model to identify a hotword 130,” Sharifi et al., para [0033]. And, para [0043]-[0044].); and 

deploying a wake word detection algorithm that is the result of the machine learning model for the custom wake word to the computing device, wherein the wake word detection algorithm facilitates the computing device in recognizing when the The hotword detector model 220 is a synthesized speech aware model 220 generated by the hotword detector trained 210 based on training examples 212, 212a-b,” Sharifi et al., para [0043].).  

Sharifi et al., though, does not specifically disclose that the wake word spoken by the user is a custom wake word.
Kaszczuk et al. is cited to disclose that the wake word spoken by the user is a custom wake word (“Other information may also be stored in the TTS storage 220 for use in speech recognition. The contents of the TTS storage 220 may be prepared for general TTS use or may be customized to include sounds and words that are likely to be used in a particular application. For example, for TTS processing by a global positioning system (GPS) device, the TTS storage 220 may include customized speech specific to location and navigation. In certain instances the TTS storage 220 may be customized for an individual user based on his/her individualized desired speech output. For example a user may prefer a speech output voice to be a specific gender, have a specific accent, speak at a specific speed, have a distinct emotive quality (e.g., a happy voice), or other customizable characteristic. The speech synthesis engine 218 may include specialized Kaszczuk et al. benefits Sharifi et al. by allowing the user to create a user-customizable wake word (Kaszczuk et al., col. 8, lines 24-40), thereby enhancing the security of the hotword-aware device of Sharifi et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Kaszczuk et al. to enhance the hotword-aware speech synthesis of Sharifi et al.

Regarding claim 2, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, wherein the text-to-speech services generate samples that modify how the user input would be spoken based on different pitches (“For example, a related art TTS module receives text and outputs only an audio signal according to an acoustic feature initially set in the DB of the TTS module. However, a TTS module according to an embodiment may be trained to output an audio signal corresponding to a specific keyword, based on acoustic factors representing acoustic features such as a tone, intensity, pitch, formant frequency, speech speed, and voice quality of a user voice that utters the specific keyword, and accordingly may output an audio signal in which the acoustic feature of a specific user is properly reflected,” Sharifi et al., para [0074].).  

Regarding claim 6, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, wherein the text-to-speech services generate samples that modify how the user input would be spoken based on different speeds (“For example, a related art TTS module receives text and outputs only an audio signal according to an acoustic feature initially set in the DB of the TTS module. However, a TTS module according to an embodiment may be trained to output an audio signal corresponding to a specific keyword, based on acoustic factors representing acoustic features such as a tone, intensity, pitch, formant frequency, speech speed, and voice quality of a user voice that utters the specific keyword, and accordingly may output an audio signal in which the acoustic feature of a specific user is properly reflected,” Sharifi et al., para [0074].).  


Regarding claim 13, Sharifi et al. discloses a non-transitory computer-readable medium comprising instructions generating and automatically training a custom wake word (Sharifi et al., fig. 3A-B), the instructions (Sharifi et al., para [0009]), when executed by a computing system, cause the computing system to: 

receive a user input from a user associated with a custom wake word, wherein the user input includes one or more words that will be spoken by the user in a vicinity of a computing device, and wherein the custom wake word is used to initiate a virtual assistant associated with the computing device (“Through the use of a "hotword" (also referred to as an "attention word", "wake-up phrase/word", "trigger phrase", or "voice action initiation command"), in which by agreement a predetermined term (i.e., keyword) that is spoken to invoke the attention of the system is reserved, the system is able to discern between utterances directed to the system (i.e., to initiate a wake-up process for processing one or more terms following the hotword in the utterance) and utterances directed to an individual in the environment,” Sharifi et al., para [0023].); 

generate a plurality of samples associated with the user input, wherein the plurality of samples are generated using a plurality of different text-to-speech services (“In some examples, the speech synthesizer 300 on one device 110 (e.g., the first user device 110, 110a) is trained on a text-to-speech sequence or audio representation of a hotword 130 assigned to the other user device 110 (e.g., the second user device 110b). For instance, a training pipeline (e.g., a hotword-aware trainer 310) of the speech synthesizer 300 (e.g., a TTS system) associated with one device 110 may generate a hotword-aware model 320 for use in detecting a presence of hotwords 130,” Sharifi et al., para [0036].); 

train a wake word detection algorithm for the custom wake word using the plurality of samples (“In some implementations in order to efficiently and to effectively detect hotwords 130, the hotword detector 200 is trained by a hotword detector model 220 with data or examples of speech to learn how to identify whether an utterance 150 includes a hotword 130. For example, the hotword detector 200 is taught by a machine learning model to identify a hotword 130,” Sharifi et al., para [0033]. And, para [0043]-[0044].); and 

deploy the wake word detection algorithm for the custom wake word to the computing device, wherein the wake word detection algorithm facilitates the computing device in recognizing when the custom wake word when spoken by the user (“The hotword detector model 220 is a synthesized speech aware model 220 generated by the hotword detector trained 210 based on training examples 212, 212a-b,” Sharifi et al., para [0043].).  

Sharifi et al., though, does not specifically disclose that the wake word spoken by the user is a custom wake word.   
The contents of the TTS storage 220 may be prepared for general TTS use or may be customized to include sounds and words that are likely to be used in a particular application. For example, for TTS processing by a global positioning system (GPS) device, the TTS storage 220 may include customized speech specific to location and navigation. In certain instances the TTS storage 220 may be customized for an individual user based on his/her individualized desired speech output. For example a user may prefer a speech output voice to be a specific gender, have a specific accent, speak at a specific speed, have a distinct emotive quality (e.g., a happy voice), or other customizable characteristic. The speech synthesis engine 218 may include specialized databases or models to account for such user preferences,” Kaszczuk et al., col. 8, lines 24-40.). Kaszczuk et al. benefits Sharifi et al. by allowing the user to create a user-customizable wake word (Kaszczuk et al., col. 8, lines 24-40), thereby enhancing the security of the hotword-aware device of Sharifi et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Kaszczuk et al. to enhance the hotword-aware speech synthesis of Sharifi et al.
As to claim 17, system claim 17 and method claim 1 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to method claim. And, Sharifi et al., para [0065]-[0066] teaches processor, memory, and CRM. 

Regarding claim 15, Sharifi et al., as modified by Kaszczuk et al., discloses the non-transitory computer-readable medium of claim 13, wherein the training of the wake word detection algorithm is performed using a neural network, and a classifier is used to classify speech samples as containing the wake word or not (“In some implementations, the hotword detector trainer 210 trains the hotword detector model 220 by negative training examples 212a and positive training examples 212b. A negative training example 212a is a sample of audio that the hotword detector trainer 210 teaches the hotword detector model 220 to ignore. Here, in order to prevent inadvertent wake-up initiation for a user device 110 based on synthesized speech 160, the negative training examples 212a are samples of audio corresponding to synthesized speech 160. The synthesized speech 160 of one or more negative training example(s) 212a may be synthesized speech 160 that includes the hotword 130 (i.e. pronounces the hotword 130) or synthesized speech that does not include the hotword 130. In either scenario, the hotword detector 200 is taught to disregard synthesized speech 160 so that a wake-up process based on utterances 150 is not inadvertently initiated by synthesized speech 160 containing a hotword or one or more words/sub-words that sound like the hotword 130. By disregarding synthesized speech 160, the hotword detector 200 prevents the initiation of the wake-up process on the user device 110 for processing the hotword 130 and/or the one or more other terms following the hotword 130 in the audio input data,” Sharifi et al., para [0044].).  
As to claim 19, system claim 19 and method claim 15 are related as method and system of using same, with each claimed element’s function corresponding to the . 

Claims 3 and 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210104221, hereinafter referred to as Sharifi et al., in view of US 9484014, hereinafter referred to as Kaszczuk et al., and further in view of 10170116, hereinafter referred to as Kelly et al.

Regarding claim 3, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, but not wherein the text-to-speech services generate samples that modify how the user input would be spoken based on different accents. Kelly et al. is cited to disclose wherein the text-to-speech services generate samples that modify how the user input would be spoken based on different accents (“Each portion of text data may be represented by multiple potential states corresponding to different known pronunciations of phonemes and their parts (such as the phoneme identity, stress, accent, position, etc.),” Kelly et al., col. 18, lines 64-67. And, “For example a user may prefer a speech output voice to be a specific gender, have a specific accent, speak at a specific speed, have a distinct emotive quality (e.g., a happy voice), or other customizable characteristic(s),” Kelly et al., col. 20, lines 35-39.). Kelly et al. benefits Sharifi et al. by providing solutions to improve a user experience when switching between different processes, thereby allowing devices to switch between multiple processes to perform different tasks (Kelly et al., col. 1, lines 20-26). Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Kelly et al. to enhance the versatility of Sharifi et al. 

Regarding claim 5, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, but not wherein the text-to-speech services generate samples that modify how the user input would be spoken based on the gender of the user. Kelly et al. is cited to disclose wherein the text-to-speech services generate samples that modify how the user input would be spoken based on the gender of the user (“For example a user may prefer a speech output voice to be a specific gender, have a specific accent, speak at a specific speed, have a distinct emotive quality (e.g., a happy voice), or other customizable characteristic(s),” Kelly et al., col. 20, lines 35-39.). Kelly et al. benefits Sharifi et al. by providing solutions to improve a user experience when switching between different processes, thereby allowing devices to switch between multiple processes to perform different tasks (Kelly et al., col. 1, lines 20-26). Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Kelly et al. to enhance the versatility of Sharifi et al.


Claim 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210104221, hereinafter referred to as Sharifi et al., in view of US 9484014, hereinafter referred to as Kaszczuk et al., and further in view of US 10580405, hereinafter referred to as Wang et al.

Regarding claim 4, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, but not wherein the text-to-speech services generate samples that modify For example, for TTS processing by a global positioning system (GPS) device, the TTS storage 420 may include customized speech specific to location and navigation,” Wang et al., col. 28, lines 22-25.). Wang et al. benefits Sharifi et al. by allowing a user to customize a wake word according to the vocabulary of a region (Wang et al. col. 28, lines 22-25), thereby extending the word customization capability of Sharifi et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Wang et al. to enhance the hotword-aware speech synthesis of Sharifi et al.    


Claims 7-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210104221, hereinafter referred to as Sharifi et al., in view of US 9484014, hereinafter referred to as Kaszczuk et al., and further in view of US 20180374477, hereinafter referred to as Kim et al.

Regarding claim 7, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, generating the plurality of samples also uses services and/or algorithms that simulate background noise. Kim et al. is cited to disclose generating the plurality of samples also uses services and/or algorithms that simulate room acoustics (“In some cases, a speech simulation module (e.g., speech simulation module 565 shown in FIG. 5) may be used during the model training process to generate simulated audible sounds at various different distances from a speaker device. The speech simulation module may utilize different room configuration parameters (e.g., room size, room shape, microphone locations, noise levels) while generating the simulated audible sounds during the training of acoustic model 112,” Kim et al., para [0042]. The room simulation includes noise levels (i.e., background noise).). Kim et al. benefits Sharifi et al. by providing a room simulation application that is capable of simulating audio under various environment conditions (Kim et al., para [0002]), thereby training the speech synthesizer for more realistic speaker environments. Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Kim et al. to improve the hotword-aware speech synthesis of Sharifi et al.

Regarding claim 8, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, but not generating the plurality of samples also uses services and/or algorithms that simulate room acoustics. Kim et al. is cited to disclose. Kim et al. is cited to disclose generating the plurality of samples also uses services and/or algorithms that simulate room acoustics (“In some cases, a speech simulation module (e.g., speech simulation module 565 shown in FIG. 5) may be used during the model training process to generate simulated audible sounds at various different distances from a speaker device. The speech simulation module may utilize different room configuration parameters (e.g., room size, room shape, microphone locations, noise levels) while generating the simulated audible sounds during the training of acoustic model 112,” Kim et al., para [0042].). Kim et al. benefits Sharifi et al. by providing a room simulation application that is capable of simulating audio under various environment conditions (Kim et al., para [0002]), thereby training the speech synthesizer for more realistic speaker environments. Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Kim et al. to improve the hotword-aware speech synthesis of Sharifi et al.
  
Claims 9, 16, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210104221, hereinafter referred to as Sharifi et al., in view of US 9484014, hereinafter referred to as Kaszczuk et al., and further in view of US 20180233150, hereinafter referred to as Gruenstein et al.

Regarding claim 9, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, but not further comprising: 

identifying one or more "white-listed" words, wherein the "white-listed" words do not initiate the virtual assistant associated with the computing device, and wherein the "white-listed" words are similar to the custom wake word, and 

training the machine learning model associated with the one or more "white-listed" words, whereby the wake word detection algorithm can recognize the one or more "white-listed" words as different from the wake word when spoken by the user within the vicinity of the computing device.

Gruenstein et al. is cited to disclose identifying one or more "white-listed" words, wherein the "white-listed" words do not initiate the virtual assistant associated with the computing device, and wherein the "white-listed" words are similar to the custom wake 

training the machine learning model associated with the one or more "white-listed" words, whereby the wake word detection algorithm can recognize the one or more "white-listed" words as different from the wake word when spoken by the user within the vicinity of the computing device (Gruenstein et al., para [0024] – multiple different (but similar) key phrases are used to train the model.). Gruenstein et al. benefits Sharifi et al. by distinguishing between a hotword used to wake a device and other similar sounding expressions (Gruenstein et al., para [0024]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Gruenstein et al. to improve the hotword detection of Sharifi et al.


Regarding claim 16, Sharifi et al., as modified by Kaszczuk et al., discloses the non-transitory computer-readable medium of claim 13, but not wherein the instructions further cause the computing system to:  25085115-623587 (CPOL 1018772-US.01) 

identify one or more "white-listed" words, wherein the "white-listed" words do not initiate the virtual assistant associated with the computing device, and wherein the "white-listed" words are similar to the custom wake word, and 

train the machine learning model associated with the one or more "white-listed" words, whereby the wake word detection algorithm can recognize the one or more "white-listed" words as different from the wake word when spoken by the user within the vicinity of the computing device.

Gruenstein et al. is cited to disclose identify one or more "white-listed" words, wherein the "white-listed" words do not initiate the virtual assistant associated with the computing device, and wherein the "white-listed" words are similar to the custom wake word (Gruenstein et al., para [0024] - threshold can be configured to accept “ok” or “ok google” (i.e., similar key phrases) such that only one key phrase is detected.), and 

train the machine learning model associated with the one or more "white-listed" words, whereby the wake word detection algorithm can recognize the one or more "white-listed" words as different from the wake word when spoken by the user within the vicinity of the computing device (Gruenstein et al., para [0024] – multiple different (but similar) key phrases are used to train the model.). Gruenstein et al. benefits Sharifi et al. by distinguishing between a hotword used to wake a device and other similar sounding expressions (Gruenstein et al., para [0024]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Gruenstein et al. to improve the hotword detection of Sharifi et al.
As to claim 20, system claim 20 and method claim 16 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 20 is similarly rejected under the same rationale as applied above with respect to method claim. And, Sharifi et al., para [0065]-[0066] teaches processor, memory, and CRM. 

Claims 10, 14, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210104221, hereinafter referred to as Sharifi et al., in view of .

Regarding claim 10, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, but not wherein the generating the plurality of samples using the plurality of text-to-speech services comprises: 

modifying how the custom wake word can be pronounced; and 

varying at least one parameter of at least one of the plurality of text-to-speech services to result in different variations in pronunciation for the custom wake word output by the at least one of the plurality of text-to-speech services.

Marple et al. is cited to disclose modifying how the custom wake word can be pronounced (Marple et al., para [0070]); and 

varying at least one parameter of at least one of the plurality of text-to-speech services to result in different variations in pronunciation for the custom wake word output by the at least one of the plurality of text-to-speech services (Marple et al., para [0070]). Marple et al. benefits Sharifi et al. by providing a speech synthesizer method which can generate high quality, prosodic synthetic speech from input text (Marple et al., para [0012]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Marple et al. to improve the speech synthesizer of Sharifi et al.


claim 14, Sharifi et al., as modified by Kaszczuk et al., discloses the non-transitory computer-readable medium of claim 13, but not wherein the different text-to-speech services modify the user input in order to create the variations in pronouncing the custom wake word. Marple et al. is cited to disclose wherein the different text-to-speech services modify the user input in order to create the variations in pronouncing the custom wake word (Marple et al., para [0070]). Marple et al. benefits Sharifi et al. by providing a speech synthesizer method which can generate high quality, prosodic synthetic speech from input text (Marple et al., para [0012]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Marple et al. to improve the speech synthesizer of Sharifi et al.
As to claim 18, system claim 18 and method claim 14 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 18 is similarly rejected under the same rationale as applied above with respect to method claim. And, Sharifi et al., para [0065]-[0066] teaches processor, memory, and CRM. 

Claim 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210104221, hereinafter referred to as Sharifi et al., in view of US 9484014, hereinafter referred to as Kaszczuk et al., and further in view of US 20190341067, hereinafter referred to as Rajendran et al.

Regarding claim 11, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, but not further comprising defining an even distribution of different types of In a typical neural network based approach, the weights and biases of the neural network may be adjusted or trained based on a large speech database…During the training, neural network may generate probability distributions of the speech samples, given the conditional inputs comprising at least one of 531, 551, 532, 561…The goal of a properly trained generative model during the inference stage may be to find the probability distribution having a maximum likelihood, given the test conditionals. This probability distribution may be sampled to generate the synthesized speech signal 591,” Rajendran et al., para [0067].). Rajendran et al. benefits Sharifi et al. by accounting for distribution biases during the training of the speech sample neural network (Rajendran et al., para [0067]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Rajendran et al. to improve the wakeword detection of Sharifi et al.   


Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20210104221, hereinafter referred to as Sharifi et al., in view of US 9484014, hereinafter referred to as Kaszczuk et al., and further in view of US 20160071510, hereinafter referred to as Li et al.

Regarding claim 12, Sharifi et al., as modified by Kaszczuk et al., discloses the method of claim 1, but not further comprising receiving crowd-sourced samples of the custom In an alternative exemplary embodiment, crowd-sourcing techniques may be utilized to generate the plurality of emotionally diverse candidate speech segments, as further described hereinbelow with reference to FIG. 5,” Li et al., para [0046].). Li et al. benefits Sharifi et al. by providing efficient and robust techniques for generating voice with emotional content to enhance user experience (Li et al., para [0005]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Sharifi et al. with those of Li et al. to enhance the speech synthesizer of Sharifi et al.

Conclusion
Other related prior art are listed in the attached PTO-892. In particular, the examiner notes that serval other Google patents/applications disclose similar information to that of Sharifi et al. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2659