Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed 09/28/2022 has been entered. Claims 1-20 remain pending in the application. Applicant’s amendments to the Claims have overcome each and every 112(b) rejection and 35 U.S.C. 101 rejection previously set forth in the Non-Final Office Action mailed 06/28/2022.
Response to Arguments
Applicant's arguments filed 09/28/2022 have been fully considered but they are not persuasive. 
With respect to rejections under 35 U.S.C. §103 of claims 1 and 19 (see Applicant’s remarks, “35 U.S.C. §103 Rejections,” pages 10-14), Applicant argues that the cited aspects of Garcia and Gruenstein fail to teach or suggest "suppressing processing of a query included in the audio data" based on both "(i) detecting the watermark and (ii) the speech transcription feature or intermediate embedding", as recited in claim 1. Applicant further argues that according to the relied- upon portion of Garcia, "determining [...] whether to perform speech recognition" is "based on analyzing the audio watermark", and "[i]n instances where the system determines to perform speech recognition [...] [t]he system executes a command included in the transcription", and therefore the relied-upon portion of Garcia teaches away from any such use of both "the watermark" and "the speech transcription feature or intermediate embedding" as a basis for "suppressing processing of a query included in the audio data". Additionally, Applicant submits that the cited aspects of Gruenstein also fail to teach or suggest use of both "(i) detecting the watermark and (ii) the speech transcription feature" as basis for "suppressing processing of a query included in the audio data", as claimed, and instead the relied-upon portion of Gruenstein merely compares an "audio fingerprint [...] to the one or more audio fingerprints in the fingerprint database". Applicant further submits that the relied-upon portion of Gruenstein does not indicate that the "audio fingerprint" is a "speech transcription feature or intermediate embedding corresponding to one of the plurality of stored speech transcription features or intermediate embeddings", as claimed.
Examiner respectfully disagrees. MPEP section 2143.01 Suggestion or Motivation To Modify the References discusses establishing obviousness by combining or modifying the teachings of the prior art. In particular, subsections I. and V. discuss potential instances of the prior art teaching away from claim limitations. Examiner argues that Garcia does not teach away from the use of both "the watermark" and "the speech transcription feature or intermediate embedding" as a basis for "suppressing processing of a query included in the audio data" because Garcia does not criticize nor disparage the use of both  "the watermark" and "the speech transcription feature or intermediate embedding," nor does the proposed modification render the prior art invention being modified unsatisfactory for its intended purpose. Regarding the audio fingerprint, Gruenstein describes the audio fingerprint as a compact digital signature summarizing the audio data (Spec. page 3, [0026], lines 1-5), which can be stored in a table embedded in a binary format (Spec. page 3, [0027], lines 1-3). Under its broadest reasonable interpretation, this can be considered to be an intermediate embedding. Such fingerprints are compared to stored fingerprints as a basis for suppressing query processing (Spec. page 3, [0028], lines 1-8), therefore the audio fingerprint of Gruenstein can be considered a "speech transcription feature or intermediate embedding corresponding to one of the plurality of stored speech transcription features or intermediate embeddings." Additionally, Applicant’s amendments to the claims alter the scope of the invention. Claim 1 now include the limitations “based on (i) detecting the watermark and (ii) the speech transcription feature or intermediate embedding” which had not been previously considered.
With respect to rejections under 35 U.S.C. §103 of claim 10 (see Applicant’s remarks, “35 U.S.C. §103 Rejections,” page 14), Applicant argues that similar to claims 1 and 19 above, the relied-upon portions of Garcia and Gruenstein fail to teach or suggest using both "(i) detecting the watermark and (ii) the speech transcription feature", as claimed. Additionally, Applicant argues that the relied-upon portion of Gruenstein does not indicate that the "audio fingerprint" is a "speech transcription feature or intermediate embedding corresponding to one of the plurality of stored speech transcription features or intermediate embeddings". Applicant further argues that the cited aspects of Kim do not teach or suggest to "modify a threshold" "based on (i) detecting the watermark and (ii) the speech transcription feature" (emphasis added), as claimed.
Examiner respectfully disagrees. Applicant’s amendments to the claims alter the scope of the invention. Claim 1 now include the limitations “based on (i) detecting the watermark and (ii) the speech transcription feature or intermediate embedding” which had not been previously considered.
With respect to dependent claims (see Applicant’s remarks, “General Comments on the Dependent Claims,” pages 14-15), Applicant argues that each of the dependent claims depends from a base claim that is believed to be in condition for allowance.
Similar to claims 1, 10, and 19, Applicant’s amendments to the claims alter the scope of the invention. Claim 1 now include the limitations “based on (i) detecting the watermark and (ii) the speech transcription feature or intermediate embedding” which had not been previously considered.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-4, 8, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Basye et al (Patent No. US 9548053 B1), hereinafter Basye, in view of Garcia (Doc. ID US 20180350356 A1).

Regarding claim 1, Basye teaches a method (Abstract) implemented by one or more processors (Spec. Col. 5, lines 11-15; computing device 200, which may be either local device 102 or remote device 104. Each of the methods described with reference to FIGS. 1 and 4-8, which are disclosed as performed on one or more components described in Fig. 2, may be combined with one or more of the other methods, and one or more steps of a methods may be incorporated into the other methods; col. 8 lines 7 - 16), the method comprising:
receiving, via one or more microphones of a client device, audio data that captures a spoken utterance (e.g. listen for and capture audio; 604 of Fig. 6; note computing device 200 of Fig. 2 may include one or more audio capture device(s), such as a microphone or an array of microphones 202, for receiving and capturing wake words and audible commands and other audio; Spec. Col. 4, lines 45-48;); 
processing the audio data using one or more models (Spec. Col. 6, lines 36 – 62; Audio processing module 222 uses various models, speech recognition… models may be applied to compare the audio input to one or more acoustic models that are based on stored audio of inadvertent wake words and/or audible commands) to generate a predicted output that indicates a probability of one or more hotwords being present in the audio data (Spec. Col. 5, lines 55 – 60; audio processing module 222 receives captured audio of detected wake words and audible commands and any additional audio captured in the recording, and processes the audio to determine whether the recording corresponds to an utterance of the wake words and/or audible command. Col. 6 lines 43-44; HMM is used to determine probabilities that feature vectors may match phonemes); 
determining that the predicted output satisfies a threshold that is indicative of the one or more hotwords being present in the audio data (Spec. Col. 6, lines 36 – 62; use of HMMs to determine probabilities that feature vectors may match phonemes. A threshold is implied by virtue of the fact that the process is seeking to identify a “match,” and one way to do this is described using probabilities, thus it is understood that a match will be identified with a satisfactory probability, or a certain probability level, or a “threshold”);
in response to determining that the predicted output satisfies the threshold (Fig. 6, element 606; recognize wake word and/or command; when this block results in a “yes,” [i.e. matches] proceeding to 608, i.e. “in response to”), processing the audio data using automatic speech recognition to generate a speech transcription feature or intermediate embedding (Fig. 6, element 608, Spec. Col. 11, lines 9 – 11; . when a wake word and/or audible command is recognized, the captured audio may be processed. Spec. Col. 6, lines 36 – 62; Audio processing is further detailed indicating processing of new feature vectors. Col. 6, lines 15-26; to process… a digital summary of audio including an inadvertent wake word and/or audible command may be generated based on frequency, intensity, time, and other parameters of the audio); 
detecting a watermark that is embedded in the audio data (Spec. Col. 6, lines 8-14; the audio processing module 222 may process the captured audio to detect a signal inaudible to humans embedded in the audible command, i.e. a watermark); 
in response to detecting the watermark (Fig. 6, element 610; signal detected?; if yes, proceeding to 614, i.e. “in response to”): determining that the speech transcription feature or intermediate embedding corresponds to one of a plurality of stored speech transcription features or intermediate embeddings (Spec. Col. 11, lines 15-29, Fig. 6; element 614- the signal may be compared to the one or more stored known signals to determine whether the captured audio corresponds to a known inadvertent wake word and/or command-element 616. In order to make this comparison and identify these correspondences, the system uses the various audio comparison/processing techniques, for example finger printing, HMMs, speech recognition etc., detailed in col. 6; in other words, matching the processed feature vectors, or using the digital summary); and 
based on (i) detecting the watermark (e.g. answering yes in 610 Fig. 6) and (ii) the speech transcription feature or intermediate embedding corresponding to one of the plurality of stored speech transcription features or intermediate embeddings (comparison and determination in 614 and 616, “matching,” resulting in an inadvertent determination), suppressing processing of a query included in the audio data (Spec. Col. 11, lines 25-30, Fig. 6 element 618; the wake word and/or audible command corresponding to the captured audio may be an inadvertent wake word and/or command and may be disregarded/aborted/cancelled).
Basye does not explicitly teach, however, that the models used for processing the audio data are machine learning models.
In a related field of endeavor, Garcia teaches a method for suppressing hotword triggers detected in recorded media playback based on analysis of a watermark detected in the media (Abstract). Garcia further teaches processing audio data using one or more machine learning models (Spec. page 3, [0025]), lines 14-16; the hotworder may use a neural network to process the audio) to generate a predicted output that indicates a probability of one or more hotwords being present in the audio data (Spec. page 3, [0025], lines 1-2; the computing device contains a hotworder. Lines 18-21; the hotworder generates a hotword confidence score for the audio to determine if the audio contains a hotword, i.e. a predicted output that indicates a probability of one or more hotwords being present in the audio data);  and determining that the predicted output satisfies a threshold that is indicative of the one or more hotwords being present in the audio data (Spec. page 3, [0025], lines 18-21; the hotworder determines that the audio includes a hotword if the hotword confidence score satisfies a hotword confidence score threshold).
Adapting Basye’s techniques for disregarding wake words and commands in media to incorporate the hotword detection techniques as detailed by Garcia further discloses:
processing the audio data using machine learning models (e.g. the models of Basye, now adapted to implement the techniques disclosed by Garcia, namely the use of a neural network to process the audio; para 25).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Basye by incorporating the teachings of Garcia. Both Garcia and Basye are directed to hotword trigger suppression techniques for hotwords detected in recorded media playback. Further, Basye teaches that the determination of whether captured audio contains a wake word, i.e. a hotword, may be accomplished by a variety of known audio processing techniques (Spec. Col. 8, lines 54-57). Garcia teaches such a technique for accomplishing the goal of hotword detection. Given the overlap, in particular, the detection of hotword triggers in captured audio for the purposes of suppressing triggers in media playback, incorporation of the features of Garcia into Basye would have been predictable to one of ordinary skill in the art at the time of filing. 

Regarding claim 2, in addition to the elements stated above regarding claim 1, the combination of Basye and Garcia above further teaches wherein the detecting the watermark is in response to determining that the predicted output satisfies the threshold (Basye: Fig. 6; element 608, “Process Audio to Detect Signal Embedded in Audio” is done in response to a wake word recognized in the captured audio in element 606. As detailed above with respect to claim 1, the recognition of the wake word can be determined according to the features as taught by Garcia such that the wake word is recognized if the predicted output satisfies the threshold).

Regarding claim 3, in addition to the elements stated above regarding claim 1, the combination of Basye and Garcia above further teaches wherein the watermark is an audio watermark that is imperceptible to humans (Basye: Spec. Col. 6, lines 8-14; the audio processing module may process the captured audio to detect a signal inaudible to humans embedded in the audible command, i.e. a watermark).

Regarding claim 4, in addition to the elements stated above regarding claim 1, the combination of Basye and Garcia above further teaches wherein the plurality of stored speech transcription features or intermediate embeddings is stored on the client device (Basye, Spec. Col. 4, lines 24-37; the local device, i.e. the client device, may process the audio to compare the captured audio to stored audio data without sending the captured audio data to the remote device, and the local device may have a local storage component storing the relevant files).

Regarding claim 8, in addition to the elements stated above regarding claim 1, the combination of Basye and Garcia above further teaches determining whether a current time or date is within an active window of the watermark (Basye, Spec. Col. 8, lines 10-14; methods described with respect to Figs. 1 and 4-8 may be combined with one or more other methods and steps of methods may be combined with other methods. Col. 11, lines 49-51 and 59-66; the time of received requests to verify wake words and/or audible commands may be compared to known time frames, i.e. active windows, for broadcasting of inadvertent wake words/commands to determine whether the time of the requests correspond to the known time frames. This can be combined with the inaudible signal, i.e. watermark, detection detailed above with respect to claim 1 such that the device determines whether a current time or date is within an active window of the watermark), and 
wherein suppressing processing of the query included in the audio data is further in response to determining that the current time or date is within the active window of the watermark (Basye, Spec. Col. 12, lines 3-7; the device cancels a response to the wake word and/or command if the time of the request for verification is within the time frame, i.e. the active window).

Regarding claim 19, the claim is directed to a system comprising: a processor, a computer-readable memory, one or more computer-readable storage media, and program instructions collectively stored on the one or more computer-readable storage media, the program instructions executable to perform features of the claimed method of claim 1. Basye teaches a system comprising these elements (Spec. Col. 5 lines 30-38) for performing the method of claim 1, therefore claim 19 is rejected under the same grounds.

Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Basye in view of Garcia and further in view of Bar-Yossef, Ziv, et al. (“Approximating Edit Distance Efficiently”), hereinafter Bar-Yossef.

Regarding claim 5, the combination of Basye and Garcia teaches the elements stated above regarding claim 1. The combination further teaches the use of a score relating to the captured audio matching a known utterance of the stored inadvertent wake words and commands by comparing the score to a threshold value (Basye, Spec. Col. 9, lines 7-12). However, the combination does not explicitly teach the use of an edit distance to determine that the speech transcription feature or intermediate embedding corresponds to one of the plurality of stored speech transcription features or intermediate embeddings.
Bar-Yossef teaches algorithms to improve the computation of edit distance between two strings (Abstract, page 1). The techniques involve the embedding of edit distance space into Hamming space (Page 2, Col. 1, “Techniques,” lines 1-3) and estimating a Hamming distance between the strings (Page 4, Col. 2, Section 2 “Overview,” lines 19-22), i.e. determining an edit distance between two strings by determining an embedding distance between the embeddings of the strings. 
Adapting the combination of Basye and Garcia to incorporate the teachings of Bar-Yossef for estimating edit distance between strings provides the method according to claim 1, wherein determining that the speech transcription feature or intermediate embedding corresponds to one of the plurality of stored speech transcription features or intermediate embeddings comprises determining that an edit distance between the speech transcription feature and one of the plurality of stored speech transcription features satisfies a threshold edit distance (The method of Basye, Spec. Col. 6, lines 27-31; the comparison process involving processing the captured audio with speech recognition to convert the audio to text for comparison to the stored text of advertisements and other media as described with reference to claim 1 above and Basye, Spec. Col. 9, lines 7-12; comparing a score relating to the captured audio matching a known utterance of the stored inadvertent wake words and commands to a threshold value, now adapted to use the algorithm of Bar-Yossef as taught above to determine the score using an edit distance between the text of the captured audio and the stored text corresponding to advertisements, etc.).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Basye and Garcia by incorporating the teachings of Bar-Yossef to provide the claimed invention of claim 5. Basye teaches a comparison of text using a score to rate a potential match between text and Bar-Yossef teaches a particular method for determining edit distance between two strings which can be used as the score. As Bar-Yossef teaches a known technique, the computation of edit distance for comparing the likeness of strings, to a known method, text comparison, incorporation of the features of Bar-Yossef into the combination of Basye and Garcia would have been predictable to one of ordinary skill in the art at the time of filing.

Regarding claim 6, in addition to the elements stated above regarding claim 1, the combination of Basye, Garcia and Bar-Yossef further teaches wherein determining that the speech transcription feature or intermediate embedding corresponds to one of the plurality of stored speech transcription features or intermediate embeddings comprises determining that an embedding-based distance satisfies a threshold embedding-based distance (The method of Basye, Spec. Col. 6, line 63 – Col. 7, line 3; the audio processing module may determine if the captured wake word and/or audible command matches a stored inadvertent wake word and/or command by comparing their respective acoustic models. The models are considered to be intermediate embeddings. The combination further teaches the use of a score relating to the captured audio matching a known utterance of the stored inadvertent wake words and commands by comparing the score to a threshold value [Basye, Spec. Col. 9, lines 7-12]. The method can now be adapted to use the algorithm of Bar-Yossef as taught above to determine an embedding distance between the acoustic models of the captured audio and the acoustic models of the stored inadvertent wake words and/or commands).

Claims 7 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Basye in view of Garcia and further in view of Mahmood et. al (Doc. ID US 20210090575 A1), hereinafter Mahmood.

Regarding claim 7, the combination of Basye and Garcia teaches the method according to claim 1 as detailed above for suppressing processing of a query. However, the combination does not teach in response to detecting the watermark: 
using speaker identification on the audio data to determine a speaker vector corresponding to the query included in the audio data; and 
determining that the speaker vector corresponds to one of a plurality of stored speaker vectors, 
wherein suppressing processing of the query included in the audio data is further in response to determining that the speaker vector corresponds to one of the plurality of stored speaker vectors.
Mahmood teaches techniques for a natural language processing system to implement multiple assistants during dialog with one or more users (Abstract). The system also includes a user recognition component (Spec. page 5, [0067]).
Adapting the combination of Basye and Garcia to incorporate the teachings of Mahmood for user recognition provides the method according to claim 1, further comprising, in response to detecting the watermark: 
using speaker identification on the audio data to determine a speaker vector corresponding to the query included in the audio data (Mahmood: Spec. page 5, [0068], lines 1-7; a user recognition component compares speech characteristics in audio data to stored speech characteristics of users to identify a speaker. Page 19, [0213], lines 12-14; user recognition is done with user recognition feature vector data); and 
determining that the speaker vector corresponds to one of a plurality of stored speaker vectors (Mahmood, Spec. page 5, [0070], lines 1-3; the user recognition component outputs a single user identifier corresponding to the most likely user that originated the natural language input. [0072], lines 1-3; the user identifier is associated with a user profile in a plurality of user profiles in profile storage), 
wherein suppressing processing of the query included in the audio data is further in response to determining that the speaker vector corresponds to one of the plurality of stored speaker vectors (the method of wake word trigger suppression as taught by the combination of Basye and Garcia now adapted to use the features taught by Mahmood to suppress processing the query based on determining that the speaker vector corresponds to one of the plurality of stored speaker vectors).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Basye and Garcia by incorporating the teachings of Mahmood to provide the claimed invention of claim 7. Basye and Mahmood are both directed to the use of natural language queries to perform tasks on a user device. Basye recognizes that a flaw of the use of wake words to trigger a device configured to accept commands from multiple users is the device’s inability to distinguish an actual user from other audio not meant to trigger the device (Spec. Col. 2, lines 1-7). Mahmood teaches a particular technique for identifying a speaker for received audio. Given the overlap, in particular, identifying when a user is attempting to trigger a device in natural language processing, incorporation of the features of Mahmood into the combination of Basye and Garcia would have been predictable to one of ordinary skill in the art at the time of filing.

Regarding claim 20, the claim is directed to the system according to claim 19 for performing the features of the claimed method of claim 7 and is rejected under the same grounds.

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Basye in view of Garcia and further in view of Bartosik et al. (US 20070033026 A1), hereinafter Bartosik.

Regarding claim 9, the combination of Basye and Garcia teaches the method according to claim 1 as detailed above for suppressing processing of a query, however the combination does not explicitly teach wherein the plurality of stored speech transcription features or intermediate embeddings includes erroneous transcriptions.
Bartosik teaches a speech recognition and correction system which creates a lexicon of alternatives for frequently incorrect utterance transcriptions (Abstract).
Adapting the combination of Basye and Garcia provides the method according to claim 1, wherein the plurality of stored speech transcription features or intermediate embeddings includes erroneous transcriptions (The stored data corresponding to the inadvertent wake words and/or commands of Basye, now adapted to include erroneous transcriptions of the fingerprints as taught by Bartosik in the Spec. page 1, [0003]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Basye and Garcia by incorporating the teachings of Bartosik to provide the claimed invention of claim 9. All three disclosures are directed to the processing of natural language input. Basye is directed to the prevention of a user device performing an action incorrectly due to received input. Similarly, Bartosik is directed to the prevention of erroneous action by providing a list of alternatives to replace incorrectly recognized text (Spec. page 1, [0004]). Furthermore, inclusion of the features of Bartosik would have improved the ability to correctly identify when an action should be suppressed, as Bartosik notes that a ready list of alternatives for incorrectly identified input eases correction by making it so that correction can be done more quickly (Spec. page 1, [0006]).

Claims 10, 12-13, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Basye in view of Kim et al. (US 20160077794 A1), hereinafter Kim.

Regarding claim 10, Basye teaches a computer program product comprising one or more non-transitory computer-readable storage media having program instructions collectively stored on the one or more non-transitory computer-readable storage media (Spec. Col. 5 lines 30-38), the program instructions executable to: 
receive, via one or more microphones of a client device, first audio data that captures a first spoken utterance (Spec. Col. 4, lines 45-48; Spec. Col. 8, lines 51-52); 
process the first audio data using automatic speech recognition to generate a speech transcription feature or intermediate embedding (Spec. Col. 6, lines 27-31; the audio processing module may process the captured audio with speech recognition to convert the audio to text for comparison to the stored text of advertisements and other media);  
detect a watermark that is embedded in the first audio data (Spec. Col. 6, lines 8-14; the audio processing module may process the captured audio to detect a signal inaudible to humans embedded in the audible command, i.e. a watermark);
in response to detecting the watermark, determine that the speech transcription feature or intermediate embedding corresponds to one of a plurality of stored speech transcription features or intermediate embeddings (Spec. Col. 8, lines 10-14; methods described with respect to Figs. 1 and 4-8 may be combined with one or more other methods and steps of methods may be combined with other methods. Spec. Col. 11, lines 16-20; when the inaudible signal, i.e. the watermark, is detected, the signal is compared to stored signals to determine if the captured audio corresponds to a known inadvertent wake word and/or command. This can be combined with the processing the captured audio with speech recognition to convert the audio to text for comparison to the stored text of advertisements and other media).
However, Basye does not explicitly teach modifying a threshold that is indicative of one or more hotwords being present in audio data.
In a related field of endeavor, hotword trigger suppression to avoid false positive trigger activation, Kim teaches systems and processes for dynamically adjusting a speech trigger threshold for triggering a virtual assistant in response to perceived events to minimize missed and false positive triggers (Abstract). 
Adapting Basye’s hotword trigger suppression techniques to incorporate the features for adjusting a speech trigger threshold for triggering a virtual assistant as detailed by Kim further discloses: based on (i) detecting a watermark and (ii) the speech transcription feature or intermediate embedding corresponding to one of the plurality of stored speech transcription features or intermediate embeddings, modifying a threshold that is indicative of one or more hotwords being present in audio data (Kim’s method of modifying a threshold indicating that a sampled audio input includes a spoken command trigger as detailed in Spec. page 1, [0008], now adapted to be performed in response to (i) detecting a watermark and (ii) the speech transcription feature or intermediate embedding corresponding to one of the plurality of stored speech transcription features or intermediate embeddings as detailed in Basye in Col. 6, lines 5-14; the audio processing module may use both the detection of a signal inaudible to humans embedded in the audible command, i.e. a watermark, and the comparison of the captured audio to stored utterances in order to determine if the wake word, i.e. hotword, and audible command in the captured audio should be processed).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Basye by incorporating the teachings of Kim. Basye and Kim are directed to speech triggers for client devices. In particular, Basye pertains to the avoidance of activating the user’s device in response to false triggers. Kim is also concerned with the avoidance of false positive trigger activation, and teaches a processes that can respond to a change in circumstances by adjusting the threshold of the trigger command recognition as needed to solve this problem (Abstract). Therefore, it would have been predictable to one of ordinary skill in the art at the time of filing to combine the disclosures to further minimize false trigger activation.

Regarding claim 12, the claim is directed to the computer program product according to claim 10 for performing the features of the claimed method of claim 3 and is rejected under the same grounds.

Regarding claim 13, the claim is directed to the computer program product according to claim 10 for performing the features of the claimed method of claim 4 and is rejected under the same grounds.

Regarding claim 17, the combination of Basye and Kim further teaches the computer program product according to claim 10, wherein the program instructions are further executable to determine whether a current time or date is within an active window of the watermark (Basye, Spec. Col. 8, lines 10-14; methods described with respect to Figs. 1 and 4-8 may be combined with one or more other methods and steps of methods may be combined with other methods. Col. 11, lines 49-51 and 59-66; the time of received requests to verify wake words and/or audible commands may be compared to known time frames, i.e. active windows, for broadcasting of inadvertent wake words/commands to determine whether the time of the requests correspond to the known time frames. This can be combined with the inaudible signal, i.e. watermark, detection detailed above with respect to claim 1 such that the device determines whether a current time or date is within an active window of the watermark), and 
wherein modifying the threshold that is indicative of the one or more hotwords being present in audio data is further in response to determining that the current time or date is within the active window of the watermark (Basye’s system for determining whether the current time or date is within an active window of the watermark as detailed above, now adapted to modify the threshold that is indicative of the one or more hotwords being present in audio data as taught by Kim in the Spec. page 1, [0008] in response to determining that the current time or date is within the active window of the watermark).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Basye in view of Kim and further in view of Garcia.

Regarding claim 11, the combination of Basye and Garcia as detailed above with respect to claim 1 further teaches the computer program product according to claim 10, as detailed by Basye and Kim, wherein the program instructions are further executable to: 
receive, via the one or more microphones of the client device, second audio data that captures a second spoken utterance (Kim, Spec. page 5, [0051], lines 7-10); 
process the second audio data using one or more machine learning models (Garcia, Spec. page 3, [0025]), lines 14-16; the hotworder may use a neural network to process the audio) to generate a predicted output that indicates a probability of one or more hotwords being present in the second audio data (the hotworder of Garcia as detailed in the  Spec. page 3, [0025], lines 1-2; the computing device contains a hotworder. Lines 18-21: now adapted to generate a hotword confidence score for the second audio to determine if the audio contains a hotword, i.e. a predicted output that indicates a probability of one or more hotwords being present in the second audio data);36 
Attorney Docket No. ZS202-21328determine that the predicted output satisfies the modified threshold that is indicative of the one or more hotwords being present in the second audio data (Garcia, Spec. page 3, [0025], lines 18-21; the hotworder adapted to determine that the audio includes a hotword if the hotword confidence score satisfies a modified hotword confidence score threshold, modified as taught by Kim); and 
in response to determining that the predicted output satisfies the modified threshold, process a query included in the second audio data (Kim, Spec. page 5, [0051], lines 18-22; the threshold is lowered such that on the second try, the user utterance is more likely to satisfy the modified threshold and trigger activation of the device and process any associated query).

Claims 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Basye in view of Kim and further in view of Bar-Yossef.

Regarding claim 14, the claim is directed to the computer program product according to claim 10 for performing the features of the claimed method of claim 5 and is rejected under the same grounds.

Regarding claim 15, the claim is directed to the computer program product according to claim 10 for performing the features of the claimed method of claim 6 and is rejected under the same grounds.

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Basye in view of Kim and further in view of Mahmood.

Regarding claim 16, the combination of Basye and Mahmood as detailed above with respect to claim 7 further teaches the computer program product according to claim 10, as detailed by Basye and Kim, wherein the program instructions are further executable to, in response to detecting the watermark: 
use speaker identification on the audio data to determine a speaker vector corresponding to the query included in the audio data (Mahmood: Spec. page 5, [0068], lines 1-7; a user recognition component compares speech characteristics in audio data to stored speech characteristics of users to identify a speaker. Page 19, [0213], lines 12-14; user recognition is done with user recognition feature vector data); and 
determining that the speaker vector corresponds to one of a plurality of stored speaker vectors (Mahmood, Spec. page 5, [0070], lines 1-3; the user recognition component outputs a single user identifier corresponding to the most likely user that originated the natural language input. [0072], lines 1-3; the user identifier is associated with a user profile in a plurality of user profiles in profile storage),
 wherein modifying the threshold that is indicative of the one or more hotwords being present in audio data is further in response to determining that the speaker vector corresponds to one of the plurality of stored speaker vectors (Mahmood’s system for user recognition as detailed above, now adapted to modify the threshold that is indicative of the one or more hotwords being present in audio data as taught by Kim in the Spec. page 1, [0008] in response to determining that the speaker vector corresponds to one of the plurality of stored speaker vectors).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Basye in view of Kim and further in view of Bartosik.

Regarding claim 18, the claim is directed to the computer program product according to claim 10 for performing the features of the claimed method of claim 9 and is rejected under the same grounds.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Tai et al. (Pub. No. US 2020/0098380 A1) teaches a system for embedding and detecting audio watermarks in audio data to enable wakeword suppression or signal transmission between devices in proximity with one another (Abstract).
Salem et al. (Patent No. US 11,100,930 B1) teaches a method for avoiding false wake word triggers from remote devices during communication sessions (Abstract). 
Gruenstein et al. (Doc. ID. US 2018/0130469 A1) teaches a method and system for hotword trigger suppression for hotwords detected in playback of media content (Abstract).

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PARKER L MAYFIELD whose telephone number is (571)272-4745. The examiner can normally be reached Monday - Thursday 8:00 AM-6:00 PM, Friday 8:00 AM-12:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PARKER L MAYFIELD/
Examiner
Art Unit 2655



/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655