DETAILED ACTION
Applicant’s argument filed on 11/24/2021 were received and fully considered. Claims 1, 4, 19, and 20 were amended. As such claims 1-20 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. PCT/US2019/034917, filed on 5/6/2019.
Drawings
The drawing filed on 12/11/2019 have been accepted and considered by the examiner.

Response to Amendment
Applicant’s arguments with respect to the prior art rejections raised in the previous office action have been considered but are moot because the new ground of rejection does not rely on the combination of references that are currently applied. Please see prior art section below for more detail including updated citations and obviousness rationale.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 2, 5 - 13, 15, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ganong (US20140274203A1), and Johnson et al. (US20150109191A1)(hereinafter "Johnson").

Ganong was applied in the previous Office Action.
	Regarding claim 1, Ganong teaches a method performed by an automated assistant application of a client device, the method performed using one or more processors of the client device, and the method comprising: (Ganong, Par. 0001:” Many mobile communications devices, such as smart phones, are equipped with a voice response system [e.g., a virtual assistant or agent] that can recognize speech and respond to voice commands to perform desired tasks [perform an Internet search, make a phone call, provide directions, answer questions, make recommendations, schedule appointments, etc.].”, and Par. 0123:” Exemplary system components of a mobile device may include a primary processor 115, a secondary processor 125 and an audio codec 105, all illustrated for convenience and clarity of illustration as being interconnected via a common bus 155.”).
	determining to activate on-device speech recognition, wherein determining to activate the on-device speech recognition is in response to determining satisfaction of one or more conditions, (Ganong, Par. 0057:” Other techniques may be used to assist in minimizing false positive and false negative rates while keeping power consumption relatively low, when performing act 220…. It should be appreciated that anyone or combination of techniques described herein may be used to determine whether the acoustic input includes a voice command, as the aspects are limited to using any particular technique or combination of techniques.”, and Par. 0058:” The voice response system may then continue to monitor the acoustic environment to obtain further acoustic input [e.g., the voice response system may return to or continue to perform act 210].”, and Par. 0130:” The at least one first processing speech processing stages, provided the secondary processor has the processing power and/or functionality implemented to do so. For example, the secondary processor may be configured to perform limited vocabulary ASR on the acoustic input such as detecting an explicit voice trigger or keyword spotting.”).
	determining the satisfaction of the one or more conditions comprising determining the satisfaction based on processing of both: hot-word free audio data detected by one or more microphones of the client device, and additional sensor data that is based on output from at least one non- microphone sensor of the client device, (Ganong, Par. 0047:” a mobile device having a voice response system that evaluates received acoustic input to ascertain whether a user has spoken a voice command, without requiring an explicit trigger [hot-word free audio]... For example, one or more microphones may sense acoustic activity in the environment and obtain the resulting acoustic input for further processing to assess whether the acoustic input includes a voice command.”, and Par. 0026:” ... conventional voice response systems require one or more explicit triggers to engage the voice response system. An "explicit trigger" refers herein to one or more specific, designated and predetermined actions required to engage a voice response system, and includes manual triggers [i.e., actions performed on the mobile device via a user's hands] and explicit voice triggers [i.e., speaking a specific, designated word or phrase to engage the voice response system].”, and Par. 0051:” In act 220, the acoustic input is processed to determine whether the acoustic input includes a voice command, without requiring an explicit trigger to do so. That is, the user is not required to manually engage the voice response system [e.g., by performing one or more manual triggers such as manipulating one or more interface controls by hand], nor is the user required to speak an explicit voice trigger to notify the voice response system that the user is uttering or will immanently utter an actionable voice command [though in some embodiments, the user can optionally use an explicit voice trigger if the user so desires, while not requiring the user to do so].”, and Par. 0053:” According to some embodiments, act 220 may include performing one or more voice activity detection [VAD] processing stages that evaluate whether the acoustic input has the characteristics of voice/speech or whether the acoustic input is more likely the result of non-voice acoustic activity in the environment.”, and Par. 0078:” Any of a variety of mobile device components capable of providing one or more contextual cues may also be activated as part of a staged or incremental wake-up when the mobile device is operating in a low power mode including, but not limited to, a GPS system, an accelerometer, or a clock to provide location information, motion information and time of day, respectively.”, and Par. 0105:” Many mobile devices are equipped with one or more components that can detect motion of the mobile device, typically by sensing acceleration [e.g., using a gyroscope or other component that responds to acceleration forces].”).
	generating, using the on-device speech recognition, recognized text from a spoken utterance captured by the audio data and/or captured by additional hot-word free audio data detected by one or more of the microphones following the audio data, (Ganong, Par. 0174:” ASR component 930 may be configured to process received audio input [e.g., audio input representing the acoustic input] to form a textual representation of the audio input [e.g., a textual representation of the constituent words in the acoustic input that can be further processed to understand the meaning of the constituent words]. Such processing to produce a textual representation may be performed in any suitable way. In some embodiments, ASR convert speech to a representation other than a textual representation, or the speech may not be recognized as words, but instead a sequence or collection of abstract concepts.”, and Par. 0041:”Mobile device 100 includes one or more transducers 130 for converting acoustic energy to electrical energy and vice versa. For example, transducers 130 may include one or more speakers and/or one or more microphones arranged on the mobile device to allow input/output [I/O] of acoustic information.”).
	generating the recognized text comprising performing the on-device speech recognition on the audio data and/or the additional audio data; (Ganong, Par. 0145:”In the example illustrated in FIGS. 7A and 7B, VAD processing stage[s] 710 determine that acoustic input 705 includes voice content and the voice response system further evaluates acoustic input 705 using one or more speech processing stages 720 to determine whether the acoustic input includes a voice command. As discussed above, speech processing stages may include ASR, classification [e.g., using one or more statistical classifiers], NLP, etc. For example, according to some embodiments, acoustic input 705 may undergo limited vocabulary ASR to perform keyword spotting, any technique for which may be used to identify whether acoustic input 705 contains any words deemed suggestive of a voice command and/or to identify words needed to perform classification. Other ASR techniques may be utilized depending on the processing strategy being used to recognize one or more words in speech contained in the acoustic input.”, and Par. 0174:”ASR component 930 may be configured to process received audio input [e.g., audio input representing the acoustic input] to form a textual representation of the audio input [e.g., a textual representation of the constituent words in the acoustic input that can be further processed to understand the meaning of the constituent words]. Such processing to 
	determining, based on the recognized text, whether to activate on-device natural language understanding of the recognized text and/or to activate on-device fulfillment that is based on the on-device natural language understanding; (Ganong, Par. 0059:” Initiating one or more further processes when acoustic input is determined to include a voice command may include, but is not limited to, engaging one or more language processing stages to understand the meaning of the voice command, initiating one or more tasks needed to carry out the voice command such as initiating a search, launching one or more applications or processes to, for example, initiate a search, schedule an appointment, update a calendar, create an alert, alarm or other electronic reminder, generate a text or email, make a telephone call, access a website, etc.,”).
	when it is determined to activate the on-device natural language understanding and/or to activate the on-device fulfillment: performing the on-device natural language understanding and/or initiating, on-device, the fulfillment; (Ganong, Par. 0148:” After concluding that a voice command is present, voice response system 750 may perform NLP stages 730 to evaluate the semantic content of the acoustic input to understand what the user intended the voice response system to do when speaking the voice command. In this respect, the acoustic input may be fully recognized to the extent that previous ASR stages were unable to [or not configured to] fully recognize the speech contained in the acoustic input prior to being ASR may be performed as part of the NLP processing]. In the example shown in FIG. 7B, NLP stage[s] 730 may ascertain that the user would like to view his/her calendar to check what appointments are scheduled for tomorrow. As a result, voice response system 750 may check to see what tomorrow's date is and launch a calendar application [see process 740] and pass to the calendar application any appropriate parameters 785, such as tomorrow's date so that the calendar can display the day that the user is interested in and/or list appointments on the calendar scheduled on the indicated date.”).
	when it is determined to not activate the on-device natural language understanding and/or to not activate the on-device fulfillment: deactivating the on-device speech recognition. (Ganong, Par. 0063:” A processing stage that determines that the acoustic input likely corresponds to spurious acoustic activity may terminate further processing of the acoustic input to avoid consuming additional power.”).
	Ganong does not teach wherein the one or more conditions comprise: a determination, based on a probability generated using the hot-word free audio data, that the hot-word free audio data includes an utterance that is directed to the client device as opposed to including an utterance that is not directed to the client device, and/or a detection of a user's gaze being directed at the client device, the detection being based on sensor frames from one or more vision sensors from among the at least one non-microphone sensor.
	Johnson teaches wherein the one or more conditions comprise: a determination, based on a probability generated using the hot-word free audio data, that the hot-word free audio data includes an utterance that is directed to the client device as opposed to including an gazing at an electromagnetic emissions sensor [EES] or a camera can toggle activation of the voice interface. For example, suppose a deactivated speech recognition system is equipped with a camera for detecting gazes. Then, in response to a first gaze at the camera, the speech recognition system can detect the first gaze as being in a voice-activation gaze direction and activate the speech recognition system. Later, in response to a second gaze at the camera, the speech recognition system can detect the second gaze as being in a voice-activation gaze direction and deactivate the speech recognition system. Subsequent gazes detected in voice-activation gaze directions can continue toggling an activation state [e.g., activated or deactivated] of the speech recognition system.”).
	Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Ganong in view of Johnson to detect of a user's gaze being directed at the client device, the detection being based on sensor frames from one or more vision sensors from among the at least one non-microphone sensor, in order to provide an indication of respective active interfaces, as evidence by Johnson (see Par. 0088).

	Regarding claim 2, Ganong teaches the method of claim 1, wherein the at least one non-microphone sensor on which the additional sensor data is based comprises an accelerometer, a magnetometer, and/or a gyroscope. (Ganong, Par. 0078:” Any of a variety of mobile device accelerometer, or a clock to provide location information, motion information and time of day, respectively.”, and Par. 0105:”Many mobile devices are equipped with one or more components that can detect motion of the mobile device, typically by sensing acceleration [e.g., using a gyroscope or other component that responds to acceleration forces].”).

	Regarding claim 5, Ganong teaches the method of claim 1, wherein determining the satisfaction of the one or more conditions based on processing the hot-word free audio data comprises: processing the hot-word free audio data using a voice activity detector to detect the presence of human speech; and (Ganong, Par. 0053:” According to some embodiments, act 220 may include performing one or more voice activity detection [VAD] processing stages that evaluate whether the acoustic input has the characteristics of voice/speech or whether the acoustic input is more likely the result of non-voice acoustic activity in the environment.”).
	determining the satisfaction of the one or more conditions based in part on detecting the presence of human speech. (Ganong, Par. 0053:” The result of performing one or more VAD processing stages may include assessing a likelihood that the acoustic input includes voice content, which assessment may be used to determine whether the acoustic input can be ignored as spurious acoustic activity, or whether the acoustic input should be further processed to determine the content of the acoustic input [e.g., determine the content and/or understand the content of speech].”).

	Regarding claim 6, Ganong teaches the method of claim 1, wherein determining the satisfaction of the one or more conditions based on processing the hot-word free audio data comprises: processing the hot-word free audio data using text-independent speaker identification model to generate a voice embedding; (Ganong, 0118:” Voice has been used as a biometric signature to facilitate verifying or authenticating the identity of a speaker electronically. Techniques for performing such speaker recognition often utilize a stored "voice print" of the speaker which can be compared to a received audio signal to assess whether the characteristics of the audio signal match those captured by the voice print.”).
	comparing the voice embedding to a recognized voice embedding stored locally on the client device; (Ganong, Par. 0118:” A voice print is typically comprised of one or more characteristics that have a facility for distinguishing one speaker from another. When acoustic input is received, one or more characteristics may be extracted and compared to the voice print to assess whether it is believed the acoustic input came from the same speaker from which the voice print was obtained.”).
	and determining the satisfaction of the one or more conditions based in part on the comparing. (Ganong, Par. 0119:” Speaker recognition techniques may be used as part of the process of determining whether acoustic input includes an actionable voice command. According to some embodiments, the voice response system may be configured to respond only to voice commands spoken by a particular user of the mobile device [e.g., the owner]. As such, when acoustic input has been determined to likely contain speech [e.g., using one or more VAD techniques], the acoustic input may undergo speaker recognition to determination whether the speech came from the user or from one or more other speakers. The system may be configured to disregard the acoustic input if it is determined that it did not originate from the specific user, whether it includes a voice command or not.”).

	Regarding claim 7, Ganong teaches the method of claim 1, wherein determining, based on the recognized text, whether to activate on-device natural language understanding and/or to activate the on-device fulfillment comprises: determining whether the text matches one or more phrases stored in a locally stored assistant language model, the locally stored assistant language model including a plurality of phrases that are each interpretable by an automated assistant. (Ganong, Par. 0087:” Limited vocabulary ASR may also be used in contexts other than detecting an explicit voice trigger, alternatively or in addition to explicit voice trigger detection. For example, limited vocabulary ASR may be performed using a restricted vocabulary having a desired number of key words that are frequently uttered people when speaking a voice command. For example, terms such as "what," "where," "how," etc., may be frequently used when speaking a voice query. Action words such as "search," "schedule," "locate," "call," "contact," "remind," etc., may also be common words uttered when speaking a voice command. It should be appreciated that any word deemed suggestive of a voice command may be included the limited vocabulary to facilitate relatively fast, relatively low power ASR to obtain information about whether acoustic input includes a voice command.”).

	Regarding claim 8, Ganong teaches the method of claim 1, wherein determining, based on the recognized text, whether to activate on-device natural language understanding and/or perform explicit voice trigger detection. For example, an exemplary speech processing stage may include performing ASR using a vocabulary restricted to the words in the explicit voice trigger phrase [which may include as few as a single word.]. For example, for the explicit voice trigger "Hello, Dragon," the vocabulary may be restricted to the two words "Hello" and "Dragon." By limiting the vocabulary to the words permitted in an explicit voice trigger, ASR may be performed using little processing to assess whether the acoustic input includes a voice command [e.g., whether the acoustic input includes the explicit voice trigger].”).

	Regarding claim 9, Ganong teaches the method of claim 1, wherein determining based on the recognized text, whether to activate on-device natural language understanding and/or to activate the on-device fulfillment comprises: determining one or more related action phrases based on the one or more related action phrases each having a defined correspondence to a recent action performed, at the client device responsive to prior user input; (Ganong, Par. 0150:” ... some mobile devices are capable of rendering music while in a low power mode. Voice commands such as "next track," "previous track," "repeat track," "pause music," "decrease volume," "increase volume," etc. may be performed without having to exit a low power mode. Thus, the acoustic input may be processed in a low power mode [e.g., where certain components are activated on an "as-needed" basis] to detect the voice command, and the voice command may be carried out without needing to further transition the mobile device into an active mode.”).
wherein the limited vocabulary is selected to include terms frequently associated with controlling a music player such as one or any combination of "track," "volume," "resume," "pause," "repeat," "skip," "shuffle," etc., or any other word or term deemed suggestive of a voice command to control the music player.”).

	Regarding claim 10, Ganong teaches the method of claim 1, wherein determining, based on the recognized text, whether to activate on-device natural language understanding and/or to activate the on-device fulfillment comprises: determining whether at least part of the recognized text conforms to content being rendered at the client device during the spoken utterance. (Ganong, Par. 0112:” In response to an incoming telephone call, the user may want to handle the interaction via voice with instructions such as "Answer call," "Send to voicemail," "Mute phone," etc. A user may want to respond via voice when a text is received by commanding the mobile device to "Respond to latest text," or may want to respond to an alert that a voicemail was just received by speaking the command "Listen to voicemail.").

	Regarding claim 11, Ganong teaches the method of claim 10, wherein the content being rendered at the client device comprises a graphically rendered suggested automated assistant action. (Ganong, Par. 0112:” After a calendar alert has activated, the user may be inclined to take some sort of action such as "Show me my Calendar," or in response to a reminder to call Call John Doe," to initiate a telephone call. In response to an incoming telephone call, the user may want to handle the interaction via voice with instructions such as "Answer call," "Send to voicemail," "Mute phone," etc. A user may want to respond via voice when a text is received by commanding the mobile device to "Respond to latest text," or may want to respond to an alert that a voicemail was just received by speaking the command "Listen to voicemail.").

	Regarding claim 12, Ganong teaches the method of claim 1, wherein determining, based on the recognized text, whether to activate on-device natural language understanding and/or to activate the on-device fulfillment comprises: determining, on-device, the fulfillment, and further comprising: executing the fulfillment on-device. (Ganong, Par. 0059:” Initiating one or more further processes when acoustic input is determined to include a voice command may include, but is not limited to, engaging one or more language processing stages to understand the meaning of the voice command, initiating one or more tasks needed to carry out the voice command such as initiating a search, launching one or more applications or processes to, for example, initiate a search, schedule an appointment, update a calendar, create an alert, alarm or other electronic reminder, generate a text or email, make a telephone call, access a website, etc., responding to the user with a request for more information regarding the voice command or to confirm an understanding of the voice command, and/or initiating or performing any other task that the voice response system is capable of initiating, engaging and/or performing, either locally on the mobile device and/or remotely via one or more networks that the mobile device is capable of connecting to and interacting with. Initiating further processing may 

	Regarding claim 13, Ganong teaches the method of claim 12, wherein executing the fulfillment on-device comprises providing a command to a separate application on the client device. (Ganong, Par. 0059:” Initiating one or more further processes when acoustic input is determined to include a voice command may include, but is not limited to, engaging one or more language processing stages to understand the meaning of the voice command, initiating one or more tasks needed to carry out the voice command such as initiating a search, launching one or more applications or processes to, for example, initiate a search, schedule an appointment, update a calendar, create an alert, alarm or other electronic reminder, generate a text or email, make a telephone call, access a website, etc., responding to the user with a request for more information regarding the voice command or to confirm an understanding of the voice command, and/or initiating or performing any other task that the voice response system is capable of initiating, engaging and/or performing, either locally on the mobile device and/or remotely via one or more networks that the mobile device is capable of connecting to and interacting with. Initiating further processing may include evaluating or modifying the evaluation of subsequently received acoustic input, for example, when the detected voice command includes an explicit voice trigger.”).

	Regarding claim 15, the method of claim 1, wherein performing the on- device natural language understanding and/or the on-device fulfillment comprises: performing the on-device natural language processing stages to ascertain the semantic meaning of speech recognized using one or more ASR processing stages.”).
	performing the on-device fulfillment using the natural language understanding data. (Ganong, Par. 0093:” NLP stages may be used either to evaluate whether speech contained in acoustic input corresponds to a voice command, or to determine the meaning of the voice command and/or intent of the user so that the voice command can be carried out.”).

	Regarding claim 18, Ganong teaches the method of claim 1, further comprising altering the graphical interface when it is determined to activate the on-device natural language understanding and/or to activate the on-device fulfillment. (Ganong, Par. 0167:” Receipt of acoustic input may also be performed using visual indicators such as using one or more LEDs, flashing the display, or via non-visual indicators such as vibration to let the user know that acoustic input was received. In some embodiments, one or more indicators may immediately provide feedback to the user based on any acoustic activity sensed by the mobile device. For example, one or more LEDs on the mobile device may be powered in correspondence to the amplitude of incoming acoustic information, thereby providing a meter of sorts to show the user the intensity of acoustic information being sensed by the mobile device.”).

Claim  3 is rejected under 35 U.S.C. 103 as being unpatentable over Ganong, and Johnson, as applied to claim 1, and in further view of Dolignon (US20200012916A1).

Dolignon was applied in the previous Office Action.
	Regarding claim 3, Ganong does not teach the method of claim 1, wherein the at least one non-microphone sensor on which the additional sensor data is based comprises a laser- based vision sensor.
Dolignon teaches wherein the at least one non-microphone sensor on which the additional sensor data is based comprises a laser- based vision sensor. (Dolignon, Par 0029 of the US Patent Application number US20200012916A1, or equivalently Par. 0015 of provisional application number 62/694177 filed in July 2018 teaches: The virtual assistant kiosk 104 can include one or more sensors 114 including one or more video cameras, audio recording devices [e.g., microphones], motion detectors, IR sensors, WiFi/Bluetooth receivers, three-dimensional depth sensors [e.g., LIDAR], or the like. The one or more sensors 114 can be used by the holographic virtual assistant system 102 to interact with a user 116. One or more of the sensors 114 can be used to detect a presence of a user 116 in the proximity [e.g., within an encounter area 118] of the virtual assistant kiosk 104 and the holographic virtual assistant system 102 can determine that the user 116 is requesting an encounter with the virtual assistant kiosk 104.”).
Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Ganong and Johnson in view of Dolignon to employ wherein the at least one non-microphone sensor on which the additional sensor data is based comprises a laser- based vision sensor in order to provide more effective information/guidance to the user such as improving the effectiveness of the exchange between .

Claim  4 is rejected under 35 U.S.C. 103 as being unpatentable over Ganong, and Johnson, as applied to claim 1, and in further view of Metallinou et al (US10515625B1)(herein after “Metallinou”), Nakadai (US20090018828A1) and Prasad et al. (US9697828B1)(hereinafter “Prasad”).

Nakadai was applied in the previous Office Action.
Regarding claim 4, neither Ganong, nor Johnson teach The method of claim 1, wherein the one or more conditions comprise the determination that the hot-word free audio data includes an utterance that is directed to the client device, and wherein determining the satisfaction of the one or more conditions based on processing the hot-word free audio data comprises: processing the hot-word free audio data using an acoustic model to generate a directed speech metric, the acoustic model trained to differentiate between spoken utterances that are directed to a client device and spoken utterances that are not directed to a client device; and determining, the probability based at least in part on the directed speech metric; and determining, based on the probability, that the hot-word free audio data includes an utterance that is directed to the client device.
Metallinou teaches wherein the one or more conditions comprise the determination that the hot-word free audio data includes an utterance that is directed to the client device, and wherein determining the satisfaction of the one or more conditions based on processing utterance 106 of the user via one or more microphones. In certain implementations, the utterance 106 may include or be preceded by a wakeword or other trigger expression or event that is spoken by the user 104 to indicate that subsequent user speech is device-directed speech [e.g., speech intended to be received and acted upon by the voice-enabled device 102 and/or speech processing system 200]. The voice-enabled device 102 may detect the wakeword and begin streaming audio signals to the speech processing system 200.").
Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Ganong and Johnson in view of Metallinou to employ wherein the one or more conditions comprise the determination that the hot-word free audio data includes an utterance that is directed to the client device, in order to improve the accuracy of natural language understanding, as evidence by Merallinou (See Col. 2, lines 22 - 23).
Ganong, Johnson and  Merallinou do not teach processing the hot-word free audio data using an acoustic model to generate a directed speech metric, the acoustic model trained to differentiate between spoken utterances that are directed to a client device and spoken utterances that are not directed to a client device; and determining, the probability based at least in part on the directed speech metric; and determining, based on the probability, that the hot-word free audio data includes an utterance that is directed to the client device.
Nakadai teaches processing the hot-word free audio data using an acoustic model to generate a directed speech metric, the acoustic model trained to differentiate between spoken sound direction, the acoustic model composition module composes an acoustic model adjusted to a direction based on the sound direction and direction-dependent acoustic models and the speech recognition module performs speech recognition with the acoustic model.”, and Par. 0118:”Direction dependent acoustic models H[.theta..sub.n], which are adjusted to respective directions .theta..sub.n with respect to the front of a robot RB, are stored in the acoustic model memory 49. A direction-dependent acoustic model H[.theta..sub.n] is trained on speech of a person uttered from a particular direction .theta..sub.n by way of Hidden Markov Model [HMM]. As shown in FIG. 14, a direction-dependent acoustic model H[.theta..sub.n] employs a phoneme as a unit for recognition, storing a corresponding sub-model h[m,.theta..sub.n] for the phoneme. In this connection, it may be possible that other units for recognition such as monophone, PTM, biphone, triphone and the like are adopted for generating a sub-model.”).
Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Ganong, Johnson and Metallinou in view of Nakadai to process the hot-word free audio data using an acoustic model to generate a directed speech metric, the acoustic model trained to differentiate between spoken utterances that are directed to a client device and spoken utterances that are not directed to a client device, in order to provide an automatic speech recognition system which is able to recognize with high accuracy while a speaker and a moving object are moving around, as evidence by Nakadai (see Par. 0008).

Prasad teaches determining, the probability based at least in part on the directed speech metric; and (Prasad, Col. 5, lines 20 – 23:”In order to reduce or minimize false detections, the wake word detector 100 may use information in addition to acoustic features associated with the wake word when computing wake word detection scores.").
determining, based on the probability, that the hot-word free audio data includes an utterance that is directed to the client device. (Prasad, Col. 5, lines 5 – 9:”The wake word detector 100 may calculate detection scores [e.g., confidence scores, likelihoods, probabilities] reflecting the likelihood that an utterance was directed at the computing device or, more generally, that an audio signal included the wake word.”).
Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Ganong, Johnson, Metallinou and Nakadai in view of Prasad to determine, the probability based at least in part on the directed speech metric; and determining, based on the probability, that the hot-word free audio data includes an utterance that is directed to the client device, in order to improve the wake word detection model, as evidence by Prasad (See Col. 10, lines 33-34).

Claim  14 is rejected under 35 U.S.C. 103 as being unpatentable over Ganong, and Johnson, as applied to claim 1, and in further view of Woo (US20190066680A1).

Woo was applied in the previous Office Action.
	
Regarding claim 14, Ganong teaches the method of claim 1, wherein deactivating the on-device speech recognition comprises deactivating the on-device speech recognition when it is determined to not activate the on-device natural language understanding and/or the fulfillment, and (Ganong, Par. 0103:” On the other hand, if the current time corresponds to a time of day when the user infrequently utters voice commands, the comparison may be used to influence the evaluation to discourage, to an extent desired, the conclusion that the acoustic input includes a voice command. It should be appreciated that a history of the times of past voice commands may be collected and utilized in other ways to influence the determination of whether acoustic input includes a voice command…”)
Ganong does not teach further based on at least a threshold duration of time passing without further voice activity detection and/or further recognized text.
Woo teaches further based on at least a threshold duration of time passing without further voice activity detection and/or further recognized text. (Woo, Par. 0085:” The instructions may be configured to count an activation standby time after processing a task for the voice information, and when a voice is not detected during the activation standby time, deactivate the voice recognition service.”, and Par. 0111:” When no user's speech is detected until the activation standby time passes after the first task information 415 is provided, the processor may deactivate [or stop] the voice recognition service.”).
Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Ganong, and Johnson in view of .

Claim  16 is rejected under 35 U.S.C. 103 as being unpatentable over Ganong, and Johnson, as applied to claim 1, and in further view of Braden (US9997086B1).

Braden was applied in the previous Office Action.
Regarding claim 16, Ganong and Johnson do not teach the method of claim 1, further comprising, during generating the recognized text using the on-device speech recognition: causing a streaming transcription of the recognized text to be rendered in a graphical interface at a display of the client device.
Braden teaches causing a streaming transcription of the recognized text to be rendered in a graphical interface at a display of the client device. (Braden, Col. 6, lines 41-46:"The operating system 118 can implement the speech-to-text software 120 and can be configured to have the display 108 show text outputted by the speech-to-text software 120. The operating system 118 can include a messaging application for storing, displaying, and creating messages.").
it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to substitute Ganong and Johnson teaching with Braden .

Claim  17 is rejected under 35 U.S.C. 103 as being unpatentable over Ganong, Johnson, and Braden, as applied to claim 16, and in further view of Lee (US20160217789A1).

Lee was applied in the previous Office Action.
Regarding claim 17, Ganong, Johnson and  Braden do not teach the method of claim 16, further comprising rendering, in the graphical interface with the streaming transcription, a selectable interface element that, when selected, causes the on-device speech recognition to halt.
Lee teaches the method of claim 16, further comprising rendering, in the graphical interface with the streaming transcription, a selectable interface element that, when selected, causes the on-device speech recognition to halt. (Lee, Par. 0114:” Meanwhile, the controller 180 may terminate a voice recognition function, during the process of the operation according to the voice recognition function, when there is a user input by an interface [e.g., an end button] configured to terminate the voice recognition function, or when the voice signal input to the microphone 143 is not detected for a preconfigured time [for example, T seconds, T is a natural number] or more. In addition, when the voice recognition function is terminated, or during the processing of the voice recognition function, the controller 180 may output a 
 Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Ganong, Johnson and Braden in view of Lee to employ selectable interface element that, when selected, causes the on-device speech recognition to halt, in order to improve the usability, accessibility and competitiveness of the electronic device, as evidence by Lee (see Par. 0172).


Claim  19 is rejected under 35 U.S.C. 103 as being unpatentable over Ganong (US20140274203A1), and Aleksic et al (US20170270929A1)(hereinafter "Aleksic").

Ganong was applied in the previous Office Action.
	Regarding claim 19, Ganong teaches a method performed by an automated assistant application of a client device, the method performed using one or more processors of the client device, and the method comprising: (Ganong, Par. 0001:” Many mobile communications devices, such as smart phones, are equipped with a voice response system [e.g., a virtual assistant or agent] that can recognize speech and respond to voice commands to perform desired tasks [perform an Internet search, make a phone call, provide directions, answer questions, make recommendations, schedule appointments, etc.].”, and Par. 0123:” Exemplary system components of a mobile device may include a primary processor 115, a secondary processor 125 and an audio codec 105, all illustrated for convenience and clarity of illustration as being interconnected via a common bus 155.”).
determining to activate on-device speech recognition, wherein determining to activate the on-device speech recognition is in response to determining satisfaction of one or more conditions, (Ganong, Par. 0057:” Other techniques may be used to assist in minimizing false positive and false negative rates while keeping power consumption relatively low, when performing act 220…. It should be appreciated that anyone or combination of techniques described herein may be used to determine whether the acoustic input includes a voice command, as the aspects are limited to using any particular technique or combination of techniques.”, and Par. 0058:” The voice response system may then continue to monitor the acoustic environment to obtain further acoustic input [e.g., the voice response system may return to or continue to perform act 210].”, and Par. 0130:” The at least one first processing stage may also include one or more speech processing stages, provided the secondary processor has the processing power and/or functionality implemented to do so. For example, the secondary processor may be configured to perform limited vocabulary ASR on the acoustic input such as detecting an explicit voice trigger or keyword spotting.”).
determining the satisfaction of the one or more conditions comprising determining the satisfaction based on processing of one or both of: hot-word free audio data detected by one or more microphones of the client device, and additional sensor data that is based on output from at least one non- microphone sensor of the client device; (Ganong, Par. 0047:” a mobile device having a voice response system that evaluates received acoustic input to ascertain whether a user has spoken a voice command, without requiring an explicit trigger [hot-word free audio]... one or more microphones may sense acoustic activity in the environment and obtain the resulting acoustic input for further processing to assess whether the acoustic input includes a voice command.”, and Par. 0026:” ... conventional voice response systems require one or more explicit triggers to engage the voice response system. An "explicit trigger" refers herein to one or more specific, designated and predetermined actions required to engage a voice response system, and includes manual triggers [i.e., actions performed on the mobile device via a user's hands] and explicit voice triggers [i.e., speaking a specific, designated word or phrase to engage the voice response system].”, and Par. 0051:” In act 220, the acoustic input is processed to determine whether the acoustic input includes a voice command, without requiring an explicit trigger to do so. That is, the user is not required to manually engage the voice response system [e.g., by performing one or more manual triggers such as manipulating one or more interface controls by hand], nor is the user required to speak an explicit voice trigger to notify the voice response system that the user is uttering or will immanently utter an actionable voice command [though in some embodiments, the user can optionally use an explicit voice trigger if the user so desires, while not requiring the user to do so].”, and Par. 0053:” According to some embodiments, act 220 may include performing one or more voice activity detection [VAD] processing stages that evaluate whether the acoustic input has the characteristics of voice/speech or whether the acoustic input is more likely the result of non-voice acoustic activity in the environment.”, and Par. 0078:” Any of a variety of mobile device components capable of providing one or more contextual cues may also be activated as part of a staged or incremental wake-up when the mobile device is operating in a low power mode including, but not limited to, a GPS system, an accelerometer, or a clock to provide location motion information and time of day, respectively.”, and Par. 0105:” Many mobile devices are equipped with one or more components that can detect motion of the mobile device, typically by sensing acceleration [e.g., using a gyroscope or other component that responds to acceleration forces].”).
generating, using the on-device speech recognition, recognized text from a spoken utterance captured by the audio data and/or captured by additional hot-word free audio data detected by one or more of the microphones following the audio data, (Ganong, Par. 0174:” ASR component 930 may be configured to process received audio input [e.g., audio input representing the acoustic input] to form a textual representation of the audio input [e.g., a textual representation of the constituent words in the acoustic input that can be further processed to understand the meaning of the constituent words]. Such processing to produce a textual representation may be performed in any suitable way. In some embodiments, ASR component 930 may convert speech to a representation other than a textual representation, or the speech may not be recognized as words, but instead a sequence or collection of abstract concepts.”, and Par. 0041:”Mobile device 100 includes one or more transducers 130 for converting acoustic energy to electrical energy and vice versa. For example, transducers 130 may include one or more speakers and/or one or more microphones arranged on the mobile device to allow input/output [I/O] of acoustic information.”).
generating the recognized text comprising performing the on-device speech recognition on the audio data and/or the additional audio data; Ganong, Par. 0145:”In the example illustrated in FIGS. 7A and 7B, VAD processing stage[s] 710 determine that acoustic input 705 includes voice content and the voice response system further evaluates acoustic input 705 using one or more speech processing stages 720 to determine whether the acoustic input includes a voice command. As discussed above, speech processing stages may include ASR, classification [e.g., using one or more statistical classifiers], NLP, etc. For example, according to some embodiments, acoustic input 705 may undergo limited vocabulary ASR to perform keyword spotting, any technique for which may be used to identify whether acoustic input 705 contains any words deemed suggestive of a voice command and/or to identify words needed to perform classification. Other ASR techniques may be utilized depending on the processing strategy being used to recognize one or more words in speech contained in the acoustic input.”, and Par. 0174:”ASR component 930 may be configured to process received audio input [e.g., audio input representing the acoustic input] to form a textual representation of the audio input [e.g., a textual representation of the constituent words in the acoustic input that can be further processed to understand the meaning of the constituent words]. Such processing to produce a textual representation may be performed in any suitable way. In some embodiments, ASR component 930 may convert speech to a representation other than a textual representation, or the speech may not be recognized as words, but instead a sequence or collection of abstract concepts.”).
determining, based on the recognized text, to activate on-device natural language understanding of the recognized text; (Ganong, Par. 0059:” Initiating one or more further processes when acoustic input is determined to include a voice command may include, but is not limited to, engaging one or more language processing stages to understand the meaning of the voice command, initiating one or more tasks needed to carry out the voice command such as initiating a search, launching one or more applications or processes to, for example, initiate a 
performing the activated on-device natural language understanding of the recognized text; and (Ganong, Par. 0148:” After concluding that a voice command is present, voice response system 750 may perform NLP stages 730 to evaluate the semantic content of the acoustic input to understand what the user intended the voice response system to do when speaking the voice command. In this respect, the acoustic input may be fully recognized to the extent that previous ASR stages were unable to [or not configured to] fully recognize the speech contained in the acoustic input prior to being processed by NLP stage[s] 730 [or large vocabulary and/or unrestricted ASR may be performed as part of the NLP processing]. In the example shown in FIG. 7B, NLP stage[s] 730 may ascertain that the user would like to view his/her calendar to check what appointments are scheduled for tomorrow. As a result, voice response system 750 may check to see what tomorrow's date is and launch a calendar application [see process 740] and pass to the calendar application any appropriate parameters 785, such as tomorrow's date so that the calendar can display the day that the user is interested in and/or list appointments on the calendar scheduled on the indicated date.”, and par. 0093:” Speech processing stages that may be utilized to evaluate the content of input also include one or more natural language processing stages to ascertain the semantic meaning of speech recognized using one or more ASR processing stages. NLP stages may be used either to evaluate whether speech contained in acoustic input corresponds to a voice command, or to determine the meaning of the voice command and/or intent of the user so that the voice command can be carried out.”).
After concluding that a voice command is present, voice response system 750 may perform NLP stages 730 to evaluate the semantic content of the acoustic input to understand what the user intended the voice response system to do when speaking the voice command. In this respect, the acoustic input may be fully recognized to the extent that previous ASR stages were unable to [or not configured to] fully recognize the speech contained in the acoustic input prior to being processed by NLP stage[s] 730 [or large vocabulary and/or unrestricted ASR may be performed as part of the NLP processing]. In the example shown in FIG. 7B, NLP stage[s] 730 may ascertain that the user would like to view his/her calendar to check what appointments are scheduled for tomorrow. As a result, voice response system 750 may check to see what tomorrow's date is and launch a calendar application [see process 740] and pass to the calendar application any appropriate parameters 785, such as tomorrow's date so that the calendar can display the day that the user is interested in and/or list appointments on the calendar scheduled on the indicated date.”).
Ganong does not teach wherein determining, based on the recognized text, to activate on-device natural language understanding of the recognized text comprises: determining whether at least part of the recognized text conforms to content text, the content text being rendered at the client device during the spoken utterance or being related to an entity being rendered at the client device during the spoken utterance.
Aleksic teaches wherein determining, based on the recognized text, to activate on-device natural language understanding of the recognized text comprises: determining whether at least part of the recognized text conforms to content text, the content text being rendered at context data assigned to a given dialog state may be analogized to a fingerprint that uniquely identifies the dialog state. Thus, when a speech recognizer receives a voice input transcription request that includes context data, the context data from the request may be compared to the respective sets of context data assigned to each of the dialog states. If a match or strong correlation is determined between the context data in the request and one of the assigned sets of context data, then speech recognizer may identify that the request pertains to the dialog state that corresponds to the matching set of context data. In some implementations, the set of context data that the computing system assigns to a dialog state may be based on the context data associated with all or some of the transcriptions in the group that corresponds to the dialog state. For example, if a significant plurality or a majority of the transcriptions in a given group are associated with a first screen signature value, then the first screen signature value may be assigned to the dialog state corresponding to that group.”, and Par. 0038:”As described further below, examples of context data include user account information, anonymized user profile information [e.g., gender, age, browsing history data, data indicating previous queries submitted on the device 108], location information, and a screen signature [i.e., data that indicates content displayed by the device 108 at or near a time when the voice input 110 was detected by the device 108].”).
Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Ganong in view of Aleksic to determine whether at least part of the recognized text conforms to content text, the content .

Claim  20 is rejected under 35 U.S.C. 103 as being unpatentable over Ganong, and  Aleksic as applied to claim 19, and in further view of  Gleaves et al. (US5548681A)(hereinafter "Gleaves").

	Regarding claim 20, Ganong and Aleksic do not teach the method of claim 19, wherein the content text is being rendered at the client device during the spoken utterance and comprises a graphically rendered suggested automated assistant action.
Gleaves Teaches wherein the content text is being rendered at the client device during the spoken utterance and comprises a graphically rendered suggested automated assistant action. (Gleaves, Col. 2, lines 27 – 38:” … a speech recognition unit 5 for recognizing the content of the speech input entered by the human speaker according to the output of the synthetic speech response cancellation unit 2; a dialogue control unit 6 for selectively controlling the synthetic speech response appropriate for the content of the speech input recognized at the speech recognition unit 5; a synthetic speech response generation unit 7 for outputting the synthetic speech response selected by the dialogue control unit 6 to the loudspeaker 8 as well as to the synthetic speech response cancellation unit 2; and a display unit 16 for displaying visual data such as graphic data and image data to the human speaker.”).
.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Mohajer et al. (US Patent Application No: 20200410983A1) teaches (Par. 0013): ”Beyond using different wake-up phrases to invoke speech recognition using different language vocabularies, embodiments of the invention support wake-up phrase selection to vary many other features or components of a virtual assistant, as well as attributes of such components. Various embodiments configure one or more of: their text-to-speech (TTS) system, such as by speech morphing; the vocabulary that they recognize; the vocabulary that they use for responses; their ASR acoustic model; a graphic animation; parameters controlling the personality of a virtual character; the use of a particular user profile; and authentication functionalities. Various embodiments further perform configuration based on: a voice characteristic; the immediate state of a dialog system; and the location of the speaker.”
THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689. The examiner can normally be reached Monday - Thursday and alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available 





/DARIOUSH AGAHI/Examiner, Art Unit 2656                                                                                                                                                                                                        
/HUYEN X VO/Primary Examiner, Art Unit 2656