DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-6, 8-10, 13-19, 21-23 and 26, is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Georges U.S. PAP 2019/0266240 A1.
Regarding claim 1 Georges teaches a method comprising: 
receiving, at data processing hardware of a user device associated with a user, audio data corresponding to an utterance spoken by the user and captured by the user device (detect a phrase in an electronic representation of an audio stream based on a pre-defined vocabulary, see abstract), the utterance comprising a command for a digital assistant to perform an operation (personal digital assistant , see par. [0057]); 
during each of a plurality of fixed-duration time (a time stamp with the detected phrase, see abstract) windows of the audio data: 
determining, by the data processing hardware, using a hotphrase detector configured to detect each trigger word in a set of trigger words associated with a hotphrase (automatic speech recognition (ASR) systems may include a wake-phrase detector, see par. [0029]), whether any of the trigger words in the set of trigger words are detected in the audio data during the corresponding fixed-duration time window (the wake-phrase detector listens for a wake-up phrase. The speech recognizer starts recognizing the speech signal when the wake-phrase is detected. The recognized word hypotheses are then analyzed by the NLU module, see par. [0029]); 
when one of the trigger words in the set of trigger words associated with the hotphrase is detected in the audio data during the corresponding fixed-duration time window, determining, by the data processing hardware, whether each other trigger word in the set of trigger words associated with the hotphrase was also detected in the audio data (detect a set of keywords in a sequence with suitable timing and make an intent decision from the detected keywords/sequence by using enhanced features, see par. [0042]); 
and when each other trigger word in the set of trigger words was also detected in the audio data, identifying, by the data processing hardware, in the audio data corresponding to the utterance, the hotphrase (he phrase detection module 36 detects the in-domain vocabulary in the continuous audio stream, determines the corresponding phrase text and also determine a relative, quantized time stamps to previously spotted phrases of the continuous audio stream, see par. [0032]); 
and triggering, by the data processing hardware, an automated speech recognizer (ASR) to perform speech recognition on the audio data when the hotphrase is identified in the audio data corresponding to the utterance (some embodiments may determine if a detected intent is outside the scope of the in-domain vocabulary and provide the speech signal and/or speech text data to a full ASR system for further processing, see par. [0031]).
Regarding claim 2 Georges teaches the method of claim 1, wherein: the user device is in a low-power state when the user spoke the utterance (detects speaker intentions in voice queries for applications with low power constraint, see par. [0030]); 
and the utterance spoken by the user does not include a predetermined hotword that is configured to trigger the user device to wake up from the low-power state (wake-phrase detected, see par. [0029]).
Regarding claim 3 Georges teaches the method of claim 1, wherein determining whether any of the trigger words in the set of trigger words are detected in the audio data comprises, for each trigger word in the set of trigger words: 
generating, using the hotphrase detector, a respective trigger word confidence score indicating a likelihood that the corresponding trigger word is present in the audio data during the corresponding fixed-duration time window (keyword model 55 includes sub-phonetic units and a non-keyword model to reject non-keywords, see par. [0047]);
detecting the corresponding trigger word in the audio data during the corresponding fixed-duration time window when the respective trigger word confidence score satisfies a trigger word confidence threshold (final score of a keyword model is calculated, a phrase is reported when the score at the final HMM stage of each sequence exceeds a threshold, see par. [0048]); 
and buffering, in memory hardware in communication with the data processing hardware, the audio data and a respective trigger event for the corresponding trigger word detected in the audio data, the respective detection event indicating the respective trigger word confidence score and a respective timestamp indicating when the corresponding trigger word was detected in the audio data (store an electronic representation of an audio stream, detect a phrase in the audio stream based on a pre-defined vocabulary, associate a time stamp with the detected phrase, and classify a spoken intent based on a sequence of detected phrases and the respective associated time stamps. For example, the logic 13 may be further configured to monitor a continuous audio stream, detect the phrase in the continuous audio stream, and compute a quantized time stamp for the detected phrase which is relative to previously detected phrase see par. [0020]).
Regarding claim 4 Georges teaches the method of claim 3, further comprising, when one of the trigger words in the set of trigger words associated with the hotphrase is detected in the audio data during the corresponding fixed-duration time window, executing, by the data processing hardware, a trigger word aggregation routine configured to: 
determine whether a respective trigger event for each other corresponding trigger word in the set of trigger words is also buffered in the memory hardware (phrase detection module 36 may incorporate or access a defined in-domain vocabulary (e.g., a pre-defined vocabulary for a particular application or set of applications, see par. [031]);
and when the respective trigger event for each other corresponding trigger word in the set of trigger words is also buffered in the memory hardware, determine a hotphrase confidence score indicating a likelihood that the utterance spoken by the user includes the hotphrase, wherein identifying the hotphrase comprises identifying the hotphrase when the hotphrase confidence score satisfies a hotphrase confidence threshold (The intent with the highest probability is passed to the application 43 e.g., to turn on/off lights, to control a music application, etc., see par. [0036]).
Regarding claim 5 Georges teaches the method of claim 4, wherein the trigger phrase aggregation routine is configured to determine the hotphrase confidence score based on the respective trigger word confidence score and the respective time stamp indicated by the respective detection event buffered in the memory hardware for each corresponding trigger word in the set of trigger words (The intent classifier apparatus 70 takes a sequence of tuples and returns the probability of each intent. A tuple comprises the spotted phrase and a quantized time stamp. A feature front end may be as described above. Some embodiments of the intent classifier apparatus, see par. [0052]).
Regarding claim 6 Georges teaches the method of claim 3, wherein executing the trigger word aggregation routine comprises executing a neural network-based model (In some embodiments, the logic 13 may include a first neural network with an acoustic model and a hidden Markov model (HMM) to detect the phrase in the audio stream, see par. [0020]).
Regarding claim 8 Georges teaches the method of claim 1, further comprising, when each other trigger word in the set of trigger words was also detected in the audio data: 
determining, by the data processing hardware, whether a sequence of the set of trigger words detected in the audio data matches a predefined sequential order associated with the hotphrase, wherein identifying the hotphrase in the audio data corresponding to the utterance comprises identifying the hotphrase when the sequence of the set of trigger words detected in the audio data matches the predefined sequential order associated with the hotphrase (phrase in the audio stream based on a pre-defined vocabulary, associate a time stamp with the detected phrase, and classify a spoken intent based on a sequence of detected phrases and the respective associated time stamps, see par. [0020]).
Regarding claim 9 Georges teaches the method of claim 8, further comprising: determining, by the data processing hardware, a respective time period between each pair of adjacent trigger words in the set of trigger words that were detected in the audio data, wherein identifying the hotphrase in the audio data corresponding to the utterance is based on the respective time period between each pair of adjacent trigger words (monitor a continuous audio stream, detect the phrase in the continuous audio stream, and compute a quantized time stamp for the detected phrase which is relative to previously detected phrase, see par. [0020]; the phrase detection module 36 detects the in-domain vocabulary in the continuous audio stream, determines the corresponding phrase text and also determine a relative, quantized time stamps to previously spotted phrases of the continuous audio stream. The sequence of detected phrases and time stamps are features for the intent classification module 37, see par. [0032]).
Regarding claim 10 the method of claim 1, wherein triggering the ASR to perform speech recognition on the audio data comprises: 
generating a transcription of the utterance by processing the audio data (where sufficient power and/or network bandwidth is available, some embodiments may determine if a detected intent is outside the scope of the in-domain vocabulary and provide the speech signal and/or speech text data to a full ASR system for further processing, see par. [0031]);
determining whether each trigger word in the set of trigger words associated with the hotphrase is recognized in the transcription of the utterance (the phrase detection module 36 may monitor a continuous audio stream, and output an enable signal, phrase text, and time stamp information to the intent classification module, see par. [0032]); 
and when each trigger word in the set of trigger words associated with the hotphrase is recognized in the transcription, performing query interpretation on the transcription to identify that the transcription includes the command for the digital assistant to perform the operation (The intent classification module 37 receives the signals from the phrase detection module 36 and outputs an intent class decision. For example, the intent class decision may include a multi-bit binary digital signal where a value of zero indicates ‘DO NOTHING’ while non-zero indicates “SEND TO APPLICATION” e.g., corresponding to a recognized intent which may be acted on by an application without further processing by a full ASR system, see par. [0033]).
Regarding claim 13 Georges teaches the method of claim 1, wherein the hotphrase detector comprises a trigger word detection model trained to detect each trigger word in the set of trigger words associated with the hotphrase (multi-keyword model, see figure 5 and par. [0047]).
Regarding claim 14 Georges teaches a system (spoken intent detection device, see abstract) comprising: 
data processing hardware; 
and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: 
receiving audio data corresponding to an utterance spoken by a user and captured by a user device associated with the user (detect a phrase in an electronic representation of an audio stream based on a pre-defined vocabulary, see abstract), the utterance comprising a command for a digital assistant to perform an operation (personal digital assistant , see par. [0057]); 
during each of a plurality of fixed-duration time (a time stamp with the detected phrase, see abstract) windows of the audio data: 
determining, using a hotphrase detector configured to detect each trigger word in a set of trigger words associated with a hotphrase (automatic speech recognition (ASR) systems may include a wake-phrase detector, see par. [0029]), whether any of the trigger words in the set of trigger words are detected in the audio data during the corresponding fixed-duration time window
 (the wake-phrase detector listens for a wake-up phrase. The speech recognizer starts recognizing the speech signal when the wake-phrase is detected. The recognized word hypotheses are then analyzed by the NLU module, see par. [0029]); 
when one of the trigger words in the set of trigger words associated with the hotphrase is detected in the audio data during the corresponding fixed-duration time window, determining, by the data processing hardware, whether each other trigger word in the set of trigger words associated with the hotphrase was also detected in the audio data (detect a set of keywords in a sequence with suitable timing and make an intent decision from the detected keywords/sequence by using enhanced features, see par. [0042]); 
and when each other trigger word in the set of trigger words was also detected in the audio data, identifying, by the data processing hardware, in the audio data corresponding to the utterance, the hotphrase (he phrase detection module 36 detects the in-domain vocabulary in the continuous audio stream, determines the corresponding phrase text and also determine a relative, quantized time stamps to previously spotted phrases of the continuous audio stream, see par. [0032]); 
and triggering an automated speech recognizer (ASR) to perform speech recognition on the audio data when the hotphrase is identified in the audio data corresponding to the utterance (some embodiments may determine if a detected intent is outside the scope of the in-domain vocabulary and provide the speech signal and/or speech text data to a full ASR system for further processing, see par. [0031]).
Regarding claim 15 Georges teaches the system of claim 14, wherein: the user device is in a low-power state when the user spoke the utterance (detects speaker intentions in voice queries for applications with low power constraint, see par. [0030]); 
and the utterance spoken by the user does not include a predetermined hotword that is configured to trigger the user device to wake up from the low-power state (wake-phrase detected, see par. [0029]).
Regarding claim 16 Georges teaches the system of claim 14, wherein determining whether any of the trigger words in the set of trigger words are detected in the audio data comprises, for each trigger word in the set of trigger words: 
generating, using the hotphrase detector, a respective trigger word confidence score indicating a likelihood that the corresponding trigger word is present in the audio data during the corresponding fixed-duration time window (keyword model 55 includes sub-phonetic units and a non-keyword model to reject non-keywords, see par. [0047]);
detecting the corresponding trigger word in the audio data during the corresponding fixed-duration time window when the respective trigger word confidence score satisfies a trigger word confidence threshold (final score of a keyword model is calculated, a phrase is reported when the score at the final HMM stage of each sequence exceeds a threshold, see par. [0048]); 
and buffering, in memory hardware in communication with the data processing hardware, the audio data and a respective trigger event for the corresponding trigger word detected in the audio data, the respective detection event indicating the respective trigger word confidence score and a respective timestamp indicating when the corresponding trigger word was detected in the audio data (store an electronic representation of an audio stream, detect a phrase in the audio stream based on a pre-defined vocabulary, associate a time stamp with the detected phrase, and classify a spoken intent based on a sequence of detected phrases and the respective associated time stamps. For example, the logic 13 may be further configured to monitor a continuous audio stream, detect the phrase in the continuous audio stream, and compute a quantized time stamp for the detected phrase which is relative to previously detected phrase see par. [0020]).
Regarding claim 17 Georges teaches the system of claim 16, wherein the operations further comprise, when one of the trigger words in the set of trigger words associated with the hotphrase is detected in the audio data during the corresponding fixed-duration time window, executing a trigger word aggregation routine configured to: 
determine whether a respective trigger event for each other corresponding trigger word in the set of trigger words is also buffered in the memory hardware (phrase detection module 36 may incorporate or access a defined in-domain vocabulary (e.g., a pre-defined vocabulary for a particular application or set of applications, see par. [031]);
and when the respective trigger event for each other corresponding trigger word in the set of trigger words is also buffered in the memory hardware, determine a hotphrase confidence score indicating a likelihood that the utterance spoken by the user includes the hotphrase, wherein identifying the hotphrase comprises identifying the hotphrase when the hotphrase confidence score satisfies a hotphrase confidence threshold (The intent with the highest probability is passed to the application 43 e.g., to turn on/off lights, to control a music application, etc., see par. [0036]).
Regarding claim 18 Georges teaches the system of claim 17, wherein the trigger phrase aggregation routine is configured to determine the hotphrase confidence score based on the respective trigger word confidence score and the respective time stamp indicated by the respective detection event buffered in the memory hardware for each corresponding trigger word in the set of trigger words (The intent classifier apparatus 70 takes a sequence of tuples and returns the probability of each intent. A tuple comprises the spotted phrase and a quantized time stamp. A feature front end may be as described above. Some embodiments of the intent classifier apparatus, see par. [0052]).
Regarding claim 19 Georges teaches the system of claim 17, wherein executing the trigger word aggregation routine comprises executing a neural network-based model (In some embodiments, the logic 13 may include a first neural network with an acoustic model and a hidden Markov model (HMM) to detect the phrase in the audio stream, see par. [0020]).
Regarding claim 21 Georges teaches the system of claim 14, wherein the operations further comprise, when each other trigger word in the set of trigger words was also detected in the audio data: 
determining whether a sequence of the set of trigger words detected in the audio data matches a predefined sequential order associated with the hotphrase, wherein identifying the hotphrase in the audio data corresponding to the utterance comprises identifying the hotphrase when the sequence of the set of trigger words detected in the audio data matches the predefined sequential order associated with the hotphrase (phrase in the audio stream based on a pre-defined vocabulary, associate a time stamp with the detected phrase, and classify a spoken intent based on a sequence of detected phrases and the respective associated time stamps, see par. [0020]).
Regarding claim 22 Georges teaches the system of claim 21, wherein the operations further comprise: determining a respective time period between each pair of adjacent trigger words in the set of trigger words that were detected in the audio data, wherein identifying the hotphrase in the audio data corresponding to the utterance is based on the respective time period between each pair of adjacent trigger words (monitor a continuous audio stream, detect the phrase in the continuous audio stream, and compute a quantized time stamp for the detected phrase which is relative to previously detected phrase, see par. [0020]; the phrase detection module 36 detects the in-domain vocabulary in the continuous audio stream, determines the corresponding phrase text and also determine a relative, quantized time stamps to previously spotted phrases of the continuous audio stream. The sequence of detected phrases and time stamps are features for the intent classification module 37, see par. [0032]).
Regarding claim 23 Georges teaches the system of claim 14, wherein triggering the ASR to perform speech recognition on the audio data comprises: generating a transcription of the utterance by processing the audio data (phrase text, and time stamp information to the intent classification module 37. For example, the phrase detection module 36 detects the in-domain vocabulary in the continuous audio stream, determines the corresponding phrase text, see par. [0032]); 
determining whether each trigger word in the set of trigger words associated with the hotphrase is recognized in the transcription of the utterance (the phrase detection module 36 may monitor a continuous audio stream, and output an enable signal, phrase text, and time stamp information to the intent classification module, see par. [0032]); 
and when each trigger word in the set of trigger words associated with the hotphrase is recognized in the transcription, performing query interpretation on the transcription to identify that the transcription includes the command for the digital assistant to perform the operation (The intent classification module 37 receives the signals from the phrase detection module 36 and outputs an intent class decision. For example, the intent class decision may include a multi-bit binary digital signal where a value of zero indicates ‘DO NOTHING’ while non-zero indicates “SEND TO APPLICATION” e.g., corresponding to a recognized intent which may be acted on by an application without further processing by a full ASR system, see par. [0033]).

Regarding claim 26 Georges teaches the system of claim 14, wherein the hotphrase detector comprises a trigger word detection model trained to detect each trigger word in the set of trigger words associated with the hotphrase (multi-keyword model, see figure 5 and par. [0047]).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 7 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Georges U.S. PAP 2019/0266240 A1 in view of Bocklet 2019/0043488 A1.
Regarding claim 7 Georges does not teach the method of claim 3, wherein executing the trigger word aggregation routine comprises executing a heuristic-based model.
In a similar field of endeavor Bocklet teaches a method which uses neural network keyphrase detection, see abstract. Keyphrase detection (, or hot word detection systems may be used to detect a word or phrase or the like, which may initiate an activity by a device. For example, the device may wake by transitioning from a low power or sleep mode to an active mode, and/or may wake a particular computer program such as a personal assistant (PA) application. In this case, the detection of a waking keyphrase may activate an automatic speech recognition application to understand a command incoming from a user. For example, a user may state “Alexa, what is the weather?” where the word “Alexa” is the waking keyphrase, see par. [0001]. Once the current multiple element state score vector 624 is generated, the elements on the vector may be used in a final decision as to whether or not a keyphrase is present in the audio input being analyzed. By one form, as mentioned, the element S.sub.0 representing the rejection model and one of the elements representing the keyphrase model, such as the last score S.sub.N may be used to determine a final score S.sub.Final. Alternatively, for the state scores forming the recurrent layer 650 and current multiple element state score vector 624 here, the decoder 600 may use an equation that compares the difference between the rejection score and last score to a threshold to obtain the final score S.sub.Final 628 as follows:
S.sub.Final=(S.sub.0*(−1)S.sub.N*1)−thr  (2)
where thr is a threshold value which may be determined by heuristics, and by one form, to set the thresholds for specific needs or applications, see par. [0094].
It would have been obvious to one of ordinary skill in the art to combine the Georges invention with the teachings of Bocklet for the benefit of customizing the model thresholds for specific applications, see par. [0094].

Regarding claim 20 Georges does not teach the system of claim 17, wherein executing the trigger word aggregation routine comprises executing a heuristic-based model.

In a similar field of endeavor Bocklet teaches a method which uses neural network keyphrase detection, see abstract. Keyphrase detection (, or hot word detection systems may be used to detect a word or phrase or the like, which may initiate an activity by a device. For example, the device may wake by transitioning from a low power or sleep mode to an active mode, and/or may wake a particular computer program such as a personal assistant (PA) application. In this case, the detection of a waking keyphrase may activate an automatic speech recognition application to understand a command incoming from a user. For example, a user may state “Alexa, what is the weather?” where the word “Alexa” is the waking keyphrase, see par. [0001]. Once the current multiple element state score vector 624 is generated, the elements on the vector may be used in a final decision as to whether or not a keyphrase is present in the audio input being analyzed. By one form, as mentioned, the element S.sub.0 representing the rejection model and one of the elements representing the keyphrase model, such as the last score S.sub.N may be used to determine a final score S.sub.Final. Alternatively, for the state scores forming the recurrent layer 650 and current multiple element state score vector 624 here, the decoder 600 may use an equation that compares the difference between the rejection score and last score to a threshold to obtain the final score S.sub.Final 628 as follows:
S.sub.Final=(S.sub.0*(−1)S.sub.N*1)−thr  (2)
where thr is a threshold value which may be determined by heuristics, and by one form, to set the thresholds for specific needs or applications, see par. [0094].
It would have been obvious to one of ordinary skill in the art to combine the Georges invention with the teachings of Bocklet for the benefit of customizing the model thresholds for specific applications, see par. [0094].

Claim(s) 11, 12, 24 and 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Georges U.S. PAP 2019/0266240 A1 in view of Miyazaki U.S. Patent No. 7,487,091 B2.

Regarding claim 11 Georges odes not teach the method of claim 10, wherein generating the transcription comprises:
rewinding the audio data buffered in memory hardware in communication with the data processing to a time at or before the first trigger word in the set of trigger words was detected in the audio data; and processing the audio data commencing at the time at or before the first trigger word in the sequence of trigger words to generate the transcription of the utterance.
In the same field of endeavor Miyazaki teaches a speech recognition device which can preferably be used for reducing the memory capacity required for speaker-independent speech recognition , see abstract. The system teaches rewinding the audio data buffered in memory hardware in communication with the data processing to a time at or before the first trigger word in the set of trigger words was detected in the audio data; and processing the audio data commencing at the time at or before the first trigger word in the sequence of trigger words to generate the transcription of the utterance (the speech recognition means specifies as a recognition speech model a speech model of which the occurrence probability is the highest from the first speech model group, loads speech models belonging to one of the second speech model group and the third speech model group, which has a link relationship with the recognition speech model in the speech model loading storage means, calculates time required until a change in the occurrence probability is propagated from the recognition speech model to the unspecified speech recognizing speech model, and rewinds to the position of read of the speech parameter in the speech parameter storage means by a number corresponding to the time required, see col. 7 lines 40-54).
It would have been obvious to one of ordinary skill in the art to combine the Georges invention with the teachings of Miyazaki for the benefit of reducing the memory capacity required for speaker-independent speech recognition , see abstract.

Regarding claim 12 Georges does not teach the method of claim 10, wherein the transcription comprises, between a first trigger word in the set of trigger words recognized in the transcription and a last trigger word in the set of trigger words recognized in the transcription, one or more other words not associated with the hotphrase.
In a similar field of endeavor Miyazaki teaches a garbage model 350 is linked to the rear of the first speech model network 300. The garbage model 350 is modeled so that the occurrence probability increases when a speech parameter representing speech other than specified speech capable of being recognized by speech models belonging to the speech model group 304, the speech model group 306 and the speech model group 308 is given, and is linked to the speech model group 306 or speech model group 308. The garbage model 350 is a filler model for an unknown redundant word, and has a nature such that the occurrence probability (likelihood) increases if a word that does not exist in a sequence of speech models existing on any path in the speech model network is spoken. For example, the example shown in FIG. 4 shows a speech model group recognizing only the names of prefectures, and improves the rate of recognition of the names of prefectures against a redundant word such as "anoh" or "sonoh" before the name of the prefecture, see col. 14 lines 26-47. 
It would have been obvious to one of ordinary skill in the art to combine the Georges invention with the teachings of Miyazaki for the benefit of recognizing words not present in the language models, see col. 14 lines 26-47.

Regarding claim 24 Georges does not teach the system of claim 23, wherein generating the transcription comprises: rewinding the audio data buffered in memory hardware in communication with the data processing to a time at or before the first trigger word in the set of trigger words was detected in the audio data; and processing the audio data commencing at the time at or before the first trigger word in the sequence of trigger words to generate the transcription of the utterance.
In the same field of endeavor Miyazaki teaches a speech recognition device which can preferably be used for reducing the memory capacity required for speaker-independent speech recognition , see abstract. The system teaches rewinding the audio data buffered in memory hardware in communication with the data processing to a time at or before the first trigger word in the set of trigger words was detected in the audio data; and processing the audio data commencing at the time at or before the first trigger word in the sequence of trigger words to generate the transcription of the utterance (the speech recognition means specifies as a recognition speech model a speech model of which the occurrence probability is the highest from the first speech model group, loads speech models belonging to one of the second speech model group and the third speech model group, which has a link relationship with the recognition speech model in the speech model loading storage means, calculates time required until a change in the occurrence probability is propagated from the recognition speech model to the unspecified speech recognizing speech model, and rewinds to the position of read of the speech parameter in the speech parameter storage means by a number corresponding to the time required, see col. 7 lines 40-54).
It would have been obvious to one of ordinary skill in the art to combine the Georges invention with the teachings of Miyazaki for the benefit of reducing the memory capacity required for speaker-independent speech recognition , see abstract.
Regarding claim 25 Georges does not teach the system of claim 23, wherein the transcription comprises, between a first trigger word in the set of trigger words recognized in the transcription and a last trigger word in the set of trigger words recognized in the transcription, one or more other words not associated with the hotphrase.
In a similar field of endeavor Miyazaki teaches a garbage model 350 is linked to the rear of the first speech model network 300. The garbage model 350 is modeled so that the occurrence probability increases when a speech parameter representing speech other than specified speech capable of being recognized by speech models belonging to the speech model group 304, the speech model group 306 and the speech model group 308 is given, and is linked to the speech model group 306 or speech model group 308. The garbage model 350 is a filler model for an unknown redundant word, and has a nature such that the occurrence probability (likelihood) increases if a word that does not exist in a sequence of speech models existing on any path in the speech model network is spoken. For example, the example shown in FIG. 4 shows a speech model group recognizing only the names of prefectures, and improves the rate of recognition of the names of prefectures against a redundant word such as "anoh" or "sonoh" before the name of the prefecture, see col. 14 lines 26-47. 
It would have been obvious to one of ordinary skill in the art to combine the Georges invention with the teachings of Miyazaki for the benefit of recognizing words not present in the language models, see col. 14 lines 26-47.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.
Zhou ‘447 teaches keyphrase detection in audio which identifies keywords and based on the score of each word determines a probability score for the entire phrase, see abstract.
Marcinkiewicz ‘913 teaches a smart device with an AI system which responds to user speech requests, see par. [0178].
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711. The examiner can normally be reached Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656