DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
All objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.

Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on 26 January 2022, 28 March 2022, and 21 April 2022 is/are being considered by the examiner.

Response to Amendments 
Applicant’s amendment filed on March 28, 2022 has been entered. 
In view of the amendment to the claim(s), the amendment of claim(s) 1, 3, 5, 7, 8, 15, 17-19, and 24 have been acknowledged and entered. 
After entry of the amendments, claims 1-11, 13-20, and 24 remain pending.
In view of the amendment to claim(s) 15, the rejection of claim(s) 15 under 35 U.S.C. §112 is withdrawn.
In view of the amendment to claim(s) 1, 3, 5, 7, 8, 15, 17-19, and 24, the rejection of claims 1, 3, 5, 7, 8, 15, 17-19, and 24 under 35 U.S.C. §103 is withdrawn.
In light of the amended/newly added claims, new grounds for rejection under 35 U.S.C. §103 and under 35 U.S.C. §112 are provided in the response below. 

Response to Arguments
Applicant’s arguments regarding the prior art rejections under 35 U.S.C. §103, see pages 11-15 of the Response to Non-Final Office Action dated 27 January 2022, which was received on 28 March 2022 (hereinafter Response and Office Action, respectively), have been fully considered.
As Applicant has amended independent claim(s) 1 to incorporate the limitations of claim(s) 3, the rejections of claim(s) 1 have been amended to incorporate the rejection of the respective limitations of claim(s) 3, as appropriate. 
With respect to the rejection(s) of claim(s) 1 under 35 U.S.C. §103 in light of White (U.S. Pat. App. Pub. No. 2019/0187787, hereinafter White) in view of Teller (U.S. Pat. App. Pub. No. 2013/0304479, hereinafter Teller), applicant asserts that “White fails to render obvious ‘determining ... based on detecting the occurrence of the gaze ... and based on the distance ... to perform: certain processing of audio data ..., wherein the audio data includes buffered audio data that is detected prior to detecting the gaze and that is buffered, prior to detecting the gaze, in a temporary buffer local to the client device; and ... wherein prior to initiating the certain processing of the audio data, the buffered audio data is buffered without performing the certain processing on the buffered audio data’.” Applicant’s argument in light of the amendments has been considered and is persuasive. Therefore, the rejection of claim 1 is withdrawn.
With respect to the rejection(s) of claim(s) 18 under 35 U.S.C. §103 in light of White in view of Teller, applicant asserts that Teller “fail[s] to teach or suggest doing anything ‘responsive to ... detecting one or multiple of the voice activity, the co- occurrence of the mouth movement of the user and the voice activity, and the gesture of the user” - much less performing such detecting “while rendering the first human perceptible cue” or “supplanting, at the display of the client device, rendering of the first human perceptible visual cue [that was rendered responsive to detecting the occurrence of the gaze] with rendering of a second human perceptible visual cue’.” This argument is not persuasive.
In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Applicant cannot assert the deficiencies of Teller with regards to limitations determined to be taught by White, as the basis for patentability of claims reciting said limitations. 
However, in light of the amendments to claim 18, the rejection under 35 U.S.C. §103 in light of White in view of Teller is withdrawn.
Applicant further argues that rejections to the dependent claims 1-11, 13-17, and 19-20 be withdrawn for at least the same reasons as independent claims 1 and 18. Applicant’s arguments in light of the amended claims are persuasive. As such, the rejections of claims 1-11, 13-17, and 19-20 under 35 U.S.C. §103 are withdrawn.
However, upon further consideration, new ground(s) of rejection under 35 U.S.C. §103 are made in light of combinations of White in view of Teller and newly cited reference Scanlon (U.S. Pat. App. Pub. No. 2020/0286484, hereinafter Scanlon).
With respect to the rejection(s) of claim(s) 24 under 35 U.S.C. §103 in light of White in view of Teller, applicant asserts that paragraph [0064] of Teller “fails to disclose ‘determining that the gesture is assigned to a plurality of responsive actions; selecting from the plurality of responsive actions, a single responsive action, wherein selecting the single responsive action is based on the content being rendered by the client device at the time of the gesture; and generating the response to cause performance of the selected single responsive action’ as set forth in amended independent claim 24.” However, this argument is not persuasive. 
As shown in the rejection below, White at paragraph [0064] teaches the cited elements of amended claim 24. As such, the rejection is maintained in light of the amendments presented below.
The Applicant has not provided any further statement and therefore, the Examiner directs the Applicant to the below rationale.	

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 18-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Regarding claim 18, applicant recites “wherein the first human perceptible visual cue is rendered without simultaneous rendering of a second human perceptible visual cue at the display of the client device.” Applicant further proffers that support for the amendments to claim 18 can be found “in at least FIGS. 3B1, 3B2 and paragraphs [0091], [0092].” However, “without simultaneous rendering” is not recited or taught at the cited paragraphs. Paragraph [0091] focuses on regions of an image creating possible false detections and systems for mitigation of those false detections. Paragraph [0092] discloses examples of visual cues with relation to FIGS. 3B1, 3B2, and 3B1. The figures, FIGS. 3B1 and 3B2, show exemplary embodiments of the visual cues. Further review of the specification, as a whole, fails to provide support for the “without simultaneous rendering…” recited in amended claim 18. As such, the amendments to claim 18 constitute new matter and the claim is rejected.
Claims 19 and 20 stand rejected in light of their dependence from a rejected base claim. 
Further regarding claim 19, claim 19 recites “wherein the third human perceptible visual cue is not rendered simultaneously…” The same explanations described above with relation to claim 18 are applicable, mutatis mutandis, to this element of claim 19. As such, the amendments to claim 19 constitute new matter and the claim is further rejected on independent grounds.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-5, 7-10, 13-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over White (U.S. Pat. App. Pub. No. 2019/0187787, hereinafter White) in view of Scanlon (U.S. Pat. App. Pub. No. 2020/0286484, hereinafter Scanlon).

Regarding claim 1, White discloses A method that facilitates hot-word free interaction between a user and an automated assistant (“a method for non-verbally engaging a virtual assistant” as performed by a virtual assistant device; White, ¶¶ [0032]), the method implemented by one or more processors and comprising (The “embodiments of the disclosure may be practiced in... a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors.”; White, ¶¶ [0090]): receiving, at a client device,… image frames that are based on output from one or more cameras of the client device (“the input data received and/or the processing results may be stored locally {receiving}” at the client device where input data includes “currently captured data of classified color eye images {image frames}” from “a high-resolution still camera {based on the output from one or more cameras}” and where the camera “may be attached to, or in communication with, a virtual assistant device”; White, ¶¶ [0049], [0058]); processing, at the client device, the image frames of the stream using at least one trained machine learning model (“the at least one machine-learning algorithm associated with the engagement system may become more familiar with user-specific non-verbal inputs” where “the machine-learning algorithm may utilize data captured from a high-resolution still camera and compare previous data of classified color eye images with the currently captured data of classified color eye images.”; White, ¶¶ [0049]) stored locally on the client device (“processed by a machine-learning algorithm … operating within the engagement system,” where the “engagement system may be deployed locally” on the virtual assistant device {stored locally on the client device}”; White, ¶¶ [0043], [0030]) to detect occurrence of: a gaze of a user that is directed toward the client device (a virtual assistant device can use “gaze tracking (where the eye tracking hardware is able to detect specific locations focused on by eye gaze, such as a virtual assistant icon on a user interface)”; White, ¶¶ [0033], [0034]); determining, at the client device, a distance of the user relative to the client device (“In addition to input data (e.g., eye gaze data, at least one attribute of the eye-gaze data, etc.), the virtual assistant device may also receive contextual data (or environmental topology data)” which can include “a distance between the user and the electronic device {distance of the user relative to the client device}”; White, ¶¶ [0036]), wherein determining the distance of the user relative to the client device is based on one or both of: one or more of the image frames, and additional sensor data from an additional sensor of the client device (“contextual data may be collected or retrieved” including “spatial topology data (e.g., placement of physical objects, location of walls and other obstructions, distances between objects, presence or absence of other humans or animals, spatial locations of other humans or animals, etc.)” where “a standalone virtual assistant device may be configured with various hardware (such as infrared, face recognition, eye-gaze or other hardware) that facilitates detection of... contextual data” as well as “an eye-gaze tracker) for detecting input data (e.g., eye-gaze data, attributes of eye-gaze data, contextual data, etc.) at a closer distance.”; White, ¶¶ [0023], [0033]); determining, at the client device and based on detecting the occurrence of the gaze of the user and based on the distance of the user relative to the client device (“For example, the device may receive one or more eye-gaze signals {determining at the client device} and locations associated with each of the one or more eye-gaze signals” where “If the locations of the eye-gaze signals {based on detecting the occurrence of the gaze} are within a specified boundary associated with the virtual assistant device {based on distance of the user relative to the client device}, the engagement system may determine that the user desires to engage with the virtual assistant device”; White, ¶¶ [0038]), to perform: certain processing of audio data detected by one or more microphones of the client device (“the engagement system may determine that the user desires to engage with the virtual assistant device” where engagement can be a conversation {certain processing of audio data}, and where the system can include an “audio interface 974... coupled to a microphone to receive audible input {one or more microphones of the client device}”; White, ¶¶ [0038], [0098])… and initiating, at the client device, the certain processing of the audio data responsive to determining to perform the certain processing of the audio data (“Engaging with the virtual assistant device may comprise initiating {initiating, at a client device} a new conversation with the device and/or maintaining a conversation with the device {certain processing of audio data responsive to determining to perform...}” where engaging is responsive to determining that the user desires to engage {determining to perform the certain processing of the audio data}.; White, ¶¶ [0040], [0038]). However, White fails to expressly recite receiving, at a client device, a stream of image frames that are based on output from one or more cameras of the client device, wherein the audio data includes buffered audio data that is detected prior to detecting the gaze and that is buffered, prior to detecting the gaze, in a temporary buffer local to the client device, and wherein prior to initiating the certain processing of the audio data, the buffered audio data is buffered without performing the certain processing on the buffered audio data.
Scanlon teaches systems and methods for non-verbal preconditioning of audio processing (Scanlon, ¶ [0018]). Regarding claim 1, Scanlon teaches receiving, at a client device, a stream of image frames that are based on output from one or more cameras of the client device (“Image processing may be performed on still images or on video images, streams or files, and terms such as “image” or “image processing” are intended to encompass both still and moving images.”; Scanlon, ¶¶ [0019]), wherein the audio data includes buffered audio data that is detected prior to detecting the gaze and that is buffered, prior to detecting the gaze, in a temporary buffer local to the client device (the system can include “record[ing] audio in a circular buffer” such that “the system will have a few seconds of past audio continually in the buffer after the prompt occurs. Any delay arising from a delay in face detection 54 or gaze detection 64 can be compensated for, by retrieving the audio from the buffer”; Scanlon, ¶¶ [0090], FIG. 4), and wherein prior to initiating the certain processing of the audio data, the buffered audio data is buffered without performing the certain processing on the buffered audio data (As indicated in FIG. 4, the system only retrieves from the buffer (element 68) after successful face detection (element 54) or gaze detection (element 64) where the system then uses “the retrieved audio to populate the start of the audio recording and the speech processing.”; Scanlon, ¶¶ [0090], FIG. 4).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal engagement systems of White to incorporate the teachings of Scanlon to include receiving, at a client device, a stream of image frames that are based on output from one or more cameras of the client device, wherein the audio data includes buffered audio data that is detected prior to detecting the gaze and that is buffered, prior to detecting the gaze, in a temporary buffer local to the client device, and wherein prior to initiating the certain processing of the audio data, the buffered audio data is buffered without performing the certain processing on the buffered audio data. Intent-specific interaction determined based on visual input with a user allows for the benefits of speech recognition systems while avoiding user privacy concerns created by “the ‘always listening’ approach,” as recognized by Scanlon. (Scanlon, ¶ [0007]-[0008] and [0018]).

Regarding claim 3, the rejection of claim 1 is incorporated. White disclose all of the elements of the current invention as stated above. White further discloses wherein initiating the certain processing of the audio data comprises initiating local automatic speech recognition of the audio data at the client device (“Once the virtual assistant is engaged, the user may proceed to interact with virtual assistant device 108,” where interaction includes a conversation with the device (local automatic speech recognition); White, ¶¶ [0028]), and wherein prior to initiating the local automatic speech recognition, the local automatic speech recognition is not performed on the... audio data (The system initiates the conversation in response to engaging the system, as indicated by the non-verbal input. As such, prior to initiating, the system is not engaged and the local automatic speech recognition which is responsive to the engagement does not occur”; White, ¶¶ [0039]). However, White fail(s) to expressly recite wherein the audio data is buffered audio data.
The relevance of Scanlon is described above with relation to claim 1. Regarding claim 3, Scanlon teaches wherein the audio data is buffered audio data (The system may “record audio in a circular buffer”; Scanlon, ¶¶ [0040]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal engagement systems of White to incorporate the teachings of Scanlon to include wherein the audio data is buffered audio data. Intent-specific interaction determined based on visual input with a user allows for the benefits of speech recognition systems while avoiding user privacy concerns created by “the ‘always listening’ approach,” as recognized by Scanlon. (Scanlon, ¶ [0007]-[0008] and [0018]).


Regarding claim 4, the rejection of claim 1 is incorporated. White further discloses further comprising, prior to initiating the certain processing of the audio data: rendering at least one human perceptible cue via an output component of the client device (“personal computer 701A comprises a feed 702A of an example of a detection process of a user who is not engaged with a virtual assistant,” thus prior to determining engagement and entering the conversation {initiating the certain processing of the audio data}. “In some example aspects, the feed 702A {including at least one human perceptible cue, e.g., “the indicator box 704A”} may be displayed to the user {rendering... via an output component of the client device}.”; White, ¶¶ [0081]).

Regarding claim 5, the rejection of claim 4 is incorporated. White further discloses wherein the at least one human perceptible cue comprises a first visual cue rendered responsive to detecting the occurrence of the gaze of the user that is directed toward the client device (“The indicator box 704A {at least one human perceptible cue...} may represent a detected gaze {rendered responsive to detecting the occurrence of the gaze of the user}... for engaging the virtual assistant” where engaging the virtual assistant includes determining that “locations of the eye-gaze signals {detecting the occurrence of the gaze of the user} are within a specified boundary associated with the virtual assistant device {directed toward the client device}.”; White, ¶¶ [0081], [0038]).

Regarding claim 7, the rejection of claim 1 is incorporated. White further discloses further comprising: prior to initiating the certain processing of the audio data detected via one or more microphones of the client device (As part of the determination of engagement (where determining engagement occurs prior to the “initiating a new conversation” {certain processing of the audio data...}); White, ¶¶ [0040]) : detecting, at the client device, co-occurrence of mouth movement of the user and the voice activity based on local processing of one or more of the image frames and at least part of the audio data (“the engagement system may receive the eye-gaze signals and head-pose locations, as well as the verbal output of the user. {voice activity}” where “the NLP algorithm results may indicate [engagement level of the user] to the engagement system” {...based on local processing of at least part of the audio data}. Further, “the engagement system may employ the gaze-lock engine 308 to determine where (e.g., toward what direction and/or at what location within the environment) the user is looking... [and] also employ the multimodal engine 310 to determine the shape of the user’s lips if the user is speaking {co-occurrence of mouth movement and voice activity}.”; White, ¶¶ [0039], [0078]); wherein initiating the certain processing of the audio data is further responsive to detecting one or both of the voice activity and the co-occurrence of the mouth movement of the user and the voice activity (“If the user speaks an attention word {detecting voice activity} while looking at or in the direction of the virtual assistant device {further responsive to...}, the combination between the non-verbal input and attention word may increase the confidence of the engagement system that the user desires to engage with the virtual assistant, and as a result, may prompt the virtual assistant to provide a response back to the user,” thus, beginning the conversation {initiating certain processing of the audio data}.; White, ¶¶ [0039]).

Regarding claim 8, the rejection of claim 7 is incorporated. White further discloses wherein detecting the occurrence of the gaze of the user occurs at a first time (“a user may be in front of a computer screen watching a video.” where “watching a video” is determined by eye gaze.; White, ¶¶ [0064]), wherein detecting the co-occurrence of the mouth movement of the user and the voice activity occurs at a second time that is subsequent to the first time (“ During the video, the user may ask “What is that?”“ which occurs at a second time, which is after the user begun watching the video {subsequent to the first time} Although the question does not include a “wake-up” word, the virtual assistant engagement system may receive the dialogue and promptly activate, responding to the user accordingly.”; White, ¶¶ [0064]), and further comprising: prior to initiating the certain processing of the audio data and prior to the second time: rendering a first human perceptible cue, via an output component of the client device (In this example, the user is in front of the screen and watching a video {after the first time}, but the virtual assistant is not engaged {prior to initiating the certain processing} and the user hasn’t spoken yet {prior to the second time}. “The indicator box 704A {a first human perceptible cue}” as displayed on the display of personal computer 701A {an output component of the client device} “may represent a detected gaze and/or proximity of the user to an engagement box 706A, which represents an outer boundary (or threshold) for engaging the virtual assistant.” thus, the indicator box 704A {first human perceptible cue} is displayed prior to engaging the virtual assistant; White, ¶¶ [0064], [0081]), responsive to detecting the occurrence of the gaze of the user that is directed toward the one or more cameras of the client device (“The indicator box 704A may represent a detected gaze and/or proximity of the user to an engagement box 706A” where representing a detected gaze is responsive to detecting the gaze of the user, as directed toward the client device and components thereof {e.g., one or more cameras of the client device}.; White, ¶¶ [0081]); and prior to initiating the certain processing of the audio data and subsequent to rendering of the first human perceptible cue: rendering a second human perceptible cue, via the output component or an additional output component of the client device (In the same example, the user is in front of the screen and watching a video {after the first time} and the user has spoken {thus, after rendering the first human perceptible cue}, but before the engagement {prior to initiating the certain processing of audio data}. “When alignment is achieved between indicator box 704B and engagement box 706B, the virtual assistant search bar 708B may illuminate {rendering a second human perceptible cue}” where the search bar 708B is displayed from the virtual assistant {via the output component of the client device}; White, ¶¶ [0064], [0083]), responsive to detecting the co-occurrence of mouth movement of the user and the voice activity (The search bar illumination occurs responsive to the voice activity, received after and in combination with the detected gaze.; White, ¶¶ [0064], [0083]).

Regarding claim 9, the rejection of claim 8 is incorporated. White further discloses wherein the first human perceptible cue is a first visual cue rendered via a display of the client device (The indicator box 704A {first human perceptible cue} is rendered via the display of the client device, as indicated in FIG. 7B.; White, ¶¶ [0064], [0081], FIG. 7B), and wherein the second human perceptible cue is a second visual cue rendered via the display of the client device (The search bar illumination {second human perceptible cue} is rendered via the display of the client device, as indicated in FIG. 7B.; White, ¶¶ [0064], [0083], FIG. 7B).

Regarding claim 10, the rejection of claim 8 is incorporated. White further discloses further comprising: in response to initiating the certain processing of the audio data and subsequent to the second time: rendering a third human perceptible cue, via the output component or the additional output component of the client device (“the virtual assistant engagement system may receive the dialogue and promptly activate, responding to the user accordingly” which can be “via textual output on the screen in search box 408 {rendering a third human perceptible cue, verbally through the speakers attached to personal computer 404, or a combination of both textual and verbal output.”; White, ¶¶ [0064], [0070]).

Regarding claim 13, the rejection of claim 1 is incorporated. White further discloses wherein determining, based on detecting the occurrence of the gaze of the user and based on the distance of the user relative to the client device, to perform the certain processing of the audio data comprises: determining to perform the certain processing based on the distance of the user satisfying a threshold distance (“when the engagement system detects that a user is in that location within the living room,” where the user being in the “expect engagement... location” indicates the distance of the user satisfying a threshold distance, the confidence level of engagement (as described in operation 206 of FIG. 2) may be automatically increased (and/or the threshold required for engagement may be decreased). {determine to perform certain processing}”; White, ¶¶ [0062]).

Regarding claim 14, the rejection of claim 1 is incorporated. White further discloses wherein determining, based on detecting the occurrence of the gaze of the user and based on the distance of the user relative to the client device, to perform the certain processing of the audio data comprises: determining to perform the certain processing of the audio data based on a magnitude of the distance of the user (“The engagement system may... adapt to the environmental context” thus determining engagement {determine to perform the certain processing of the audio data} based on the environmental context, “For instance, environmental [context] may include... a distance between the user and the electronic device,” where “a virtual assistant device... may be configured to receive “far-field” input data... [or] to receive “near-field” input data.” As near field refers to a first magnitude of distance (e.g., “a user being within about one meter or less of an electronic device running a virtual assistant application”) and far field refers to greater distances between the user and the electronic device, configuration to receive one or the other is determination based on magnitude of the distance of the user.; White, ¶¶ [0033], [0036], [0069]) and based on a gaze confidence metric for the gaze of the user (“The confidence level of the engagement system may consider a fixation threshold in determining the level of confidence. A fixation threshold may be defined as a predetermined period of time required to activate the virtual assistant,” where period of time refers to “eye-gaze signal... fixated on the virtual assistant device for a certain period of time”; White, ¶¶ [0042]), the gaze confidence metric generated based on the processing of the image frames of the stream using the at least one trained machine learning model (“the evaluation operation 206 may reference at least one machine-learning algorithm {generated... using the at least one trained machine learning model} to determine a threshold confidence level {the gaze confidence metric} for determining whether a user intends to engage with a virtual assistant... [using] received eye-gaze data {generated based on the processing of image frames}”; White, ¶¶ [0043]).

Regarding claim 15, the rejection of claim 1 is incorporated. White further discloses further comprising: determining, based on processing of one or more of the image frames locally at the client device, that the user is a recognized user (“face recognition technology {...based on processing of one or more image frames...} may allow the virtual assistant engagement system to discern when a particular user {determining...that the user is a recognized user} desires to engage with the virtual assistant,” where the system may process the input “locally, remotely, or using a combination of both.”; White, ¶¶ [0078], [0026]); wherein determining to perform the certain processing of the audio data is further based on determining that the user is a recognized user (“ the virtual assistant engagement system may be receiving multiple different dialogues from various people within the room, but once the engagement system detects the face of the user (e.g., owner) of the virtual assistant device, the engagement system may focus on that user’s facial expressions in addition to any dialog received from the user.”; White, ¶¶ [0078]).

Regarding claim 16, the rejection of claim 1 is incorporated. White further discloses wherein the certain processing of the audio data comprises automatic speech recognition of the audio data to generate recognized speech (“verbal input may be additionally processed concurrently with the non-verbal input at process input operation 204. For example, the processing operation 204 may consist of applying at least one natural language processing (“NLP”) algorithm to the input data” where verbal input can be “maintaining a conversation with the device.”; White, ¶¶ [0039], [0040]), and further comprising: determining, based at least in part on the recognized speech, an assistant request measure that indicates a probability that the recognized speech is a request directed to the automated assistant (“ The verbal input may be additionally processed concurrently with the non-verbal input at process input operation 204.” where “the NLP results may be used in conjunction with the non-verbal input to determine whether the user desires to...[maintain] a conversation with the device,” and where “the combination between the non-verbal input and [verbal input] may increase the confidence [value] of the engagement system”; White, ¶¶ [0039], [0040]); and determining, based at least in part on the assistant request measure, whether to render, via the client device, a response to the recognized speech (“the combination between the non-verbal input and attention word may increase the confidence of the engagement system that the user desires to engage with the virtual assistant, and as a result, may prompt the virtual assistant to provide a response back to the user,” where “Engaging with the virtual assistant device may comprise initiating a new conversation with the device and/or maintaining a conversation with the device.”; White, ¶¶ [0039], [0040]).

Regarding claim 17, the rejection of claim 16 is incorporated. White further discloses wherein determining whether to render the response to the recognized speech is further based on one or multiple of: the distance of the user (“when the engagement system detects that a user is in that location within the living room,” where the user being in the “expect engagement... location” indicates the distance of the user satisfying a threshold distance, the confidence level of engagement (as described in operation 206 of FIG. 2) may be automatically increased (and/or the threshold required for engagement may be decreased).{determine to perform certain processing}”; White, ¶¶ [0062]); whether the user is a recognized user, as determined based on facial recognition based on one or more of the image frames… (“face recognition technology {...based on facial recognition based on one or more image frames...} may allow the virtual assistant engagement system to discern when a particular user {determining...that the user is a recognized user} desires to engage with the virtual assistant,” where “the virtual assistant engagement system may be receiving multiple different dialogues from various people within the room, but once the engagement system detects the face {based on facial recognition} of the user (e.g., owner) {user is a recognized user} of the virtual assistant device, the engagement system may focus on that user’s facial expressions in addition to any dialog received from the user.” where {based on one or more of the image frames}; White, ¶¶ [0078]) and/or based on speaker identification based on at least part of the audio data (The system determines “prior conversational history between the user and the virtual assistant device” thus, speaker identification based on at least part of the audio data.; White, ¶¶ [0039]); and gaze confidence metric for the gaze of the user (“The confidence level of the engagement system may consider a fixation threshold in determining the level of confidence. A fixation threshold may be defined as a predetermined period of time required to activate the virtual assistant,” where period of time refers to “eye-gaze signal... fixated on the virtual assistant device for a certain period of time”; White, ¶¶ [0042]), the gaze confidence metric generated based on the processing of the image frames of the stream using the at least one trained machine learning model (“the evaluation operation 206 may reference at least one machine-learning algorithm {generated... using the at least one trained machine learning model} to determine a threshold confidence level {the gaze confidence metric} for determining whether a user intends to engage with a virtual assistant... [using] received eye-gaze data {generated based on the processing of image frames}”; White, ¶¶ [0043]).

Claims 6 and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over White and Scanlon as applied to claim 1 above, and further in view of Teller (U.S. Pat. App. Pub. No. 2013/0304479, hereinafter Teller).

Regarding claim 6, the rejection of claim 1 is incorporated. White and Scanlon disclose all of the elements of the current invention as stated above. White further discloses wherein processing the image frames using the at least one trained machine learning model to detect occurrence of the gaze of the user that is directed toward the one or more cameras of the client device comprises: processing a sequence of the image frames using the at least one trained machine learning model (“ the virtual assistant device 108 may receive eye-gaze data {processing a sequence of image frames} from a user… [and] additional processing of the input may include... applying a machine-learning algorithm [to the input]”; White, ¶¶ [0028]). However, White fails to expressly recite processing a sequence of the image frames …to determine, for each of the image frames of the sequence, whether the gaze of the user is directed toward the client device and detecting occurrence of the gaze of the user that is directed toward the client device, based on a quantity of the image frames of the sequence for which the gaze of the user is determined to be directed toward the one or more cameras.
The relevance of Scanlon is described above with relation to claim 1. Regarding claim 3, Scanlon teaches processing a sequence of the image frames … (face detection and gaze detection can “be performed on still images or on video images, streams or files, and terms such as “image” or “image processing” are intended to encompass both still and moving images.”; Scanlon, ¶¶ [0019]) to determine, for each of the image frames of the sequence, whether the gaze of the user is directed toward the client device (prior to processing the audio, “the one or more additional verification steps may comprise a gaze direction detection step to verify that the user is looking in a predefined direction or range of directions.”; Scanlon, ¶¶ [0025]); and detecting occurrence of the gaze of the user that is directed toward the client device, (In one example, “the gaze detection module 42 operates... to determine that a user is looking at the device screen”; Scanlon, ¶¶ [0080]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal engagement systems of White to incorporate the teachings of Scanlon to include processing a sequence of the image frames …to determine, for each of the image frames of the sequence, whether the gaze of the user is directed toward the client device and detecting occurrence of the gaze of the user that is directed toward the client device. Intent-specific interaction determined based on visual input with a user allows for the benefits of speech recognition systems while avoiding user privacy concerns created by “the ‘always listening’ approach,” as recognized by Scanlon. (Scanlon, ¶ [0007]-[0008] and [0018]). However, White and Scanlon fail to expressly recite detecting occurrence of the gaze of the user that is directed toward the client device, based on a quantity of the image frames of the sequence for which the gaze of the user is determined to be directed toward the one or more cameras.
Teller teaches systems and methods for using visual cues in device operation management. (Teller, ¶ [0006]). Regarding claim 6, Teller teaches detecting occurrence of the gaze of the user that is directed toward the client device (the system can detect that "a user has sustained their gaze {detecting occurrence of the gaze of the user} in the direction of the gaze target {...that is directed toward the client device}"; Teller, ¶ [0042]); based on a quantity of the image frames of the sequence for which the gaze of the user is determined to be directed toward the one or more cameras (The system further includes detecting that “a user has sustained their gaze {detecting occurrence of the gaze of the user} in the direction of the gaze target for longer than the predetermined time period” where sustained gaze over all frames is a quantity, thus determining gaze is based on a quantity of the image frames of the sequence. Further, “the gaze target may be a location of a gaze tracking device... [such as] a camera” and “a camera may be placed on top of a television, and the gaze target may be the center of the television.” Thus, the gaze target can be both the client device and the one or more cameras {gaze of the user determined to be directed toward the one or more cameras}.; Teller, ¶¶ [0042], [0038]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal engagement systems of White as modified by the non-verbal interaction including audio buffering of Scanlon to incorporate the teachings of Teller to include detecting occurrence of the gaze of the user that is directed toward the client device, based on a quantity of the image frames of the sequence for which the gaze of the user is determined to be directed toward the one or more cameras. “Software-based mechanisms” disclosed in Teller “leverage the visual detection capabilities of a computer camera” which improves “a user's overall computing experience.” (Teller, ¶ [0005]).

Regarding claim 11, the rejection of claim 1 is incorporated. White and Scanlon disclose all of the elements of the current invention as stated above. However, White and Scanlon fail to expressly recite wherein determining the distance of the user relative to the client device is based on one or more of the image frames.
The relevance of Scanlon is described above with relation to claim 6. Regarding claim 11, Teller teaches wherein determining the distance of the user relative to the client device is based on one or more of the image frames (“a parallax from multiple images {determining the distance… based on one or more of the image frames} and the movement of a user may be used to estimate a distance to the user [from the gaze target] {distance of the user relative to the client device}.”; Teller, ¶¶ [0039]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal engagement systems of White as modified by the non-verbal interaction including audio buffering of Scanlon to incorporate the teachings of Teller to include wherein determining the distance of the user relative to the client device is based on one or more of the image frames. “Software-based mechanisms” disclosed in Teller “leverage the visual detection capabilities of a computer camera” which improves “a user's overall computing experience.” (Teller, ¶ [0005]).

Claims 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over White in view of Scanlon and Teller.

Regarding claim 18, White discloses A method that facilitates hot-word free interaction between a user and an automated assistant (“a method for non-verbally engaging a virtual assistant” as performed by a virtual assistant device; White, ¶¶ [0032]), the method implemented by one or more processors of the client device and comprising (The “embodiments of the disclosure may be practiced in... a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors.”; White, ¶¶ [0090]): receiving... image frames that are based on output from one or more cameras of the client device (“the input data received and/or the processing results may be stored locally {receiving}” at the client device where input data includes “currently captured data of classified color eye images {image frames}” from “a high-resolution still camera {based on the output from one or more cameras}” and where the camera “may be attached to, or in communication with, a virtual assistant device”; White, ¶¶ [0049], [0058]); processing, at the client device, the image frames of the stream using at least one trained machine learning model stored locally on the client device to detect occurrence of: a gaze of a user that is directed toward the client device (“the at least one machine-learning algorithm associated with the engagement system may become more familiar with user-specific non-verbal inputs” where “the machine-learning algorithm may utilize data captured from a high-resolution still camera and compare previous data of classified color eye images with the currently captured data of classified color eye images.”; White, ¶¶ [0049]) to detect occurrence of: a gaze of a user that is directed toward the client device (a virtual assistant device can use “gaze tracking (where the eye tracking hardware is able to detect specific locations focused on by eye gaze, such as a virtual assistant icon on a user interface)”; White, ¶¶ [0033], [0034]); rendering, at a display of the client device, a first human perceptible visual cue responsive to detecting the occurrence of the gaze of the user that is directed toward the client device (“the indicator box 704A may track the head position and other spatial topological data, and the engagement box 706A may track the head-pose and eye-gaze of the user” where “When alignment is achieved between indicator box 704B and engagement box 706B {responsive to detecting the occurrence of the gaze of the user that is directed toward the client device}, the virtual assistant search bar 708B may illuminate {a first human perceptible visual cue}” as displayed on the display of personal computer 701A {rendering at the display of the client device}; White, ¶¶ [0064], [0081], FIG. 7B) wherein the first human perceptible visual cue is rendered without simultaneous rendering of a second human perceptible visual cue at the display of the client device (“After the indicator box 704B is co-located within engagement box 706B for a predetermined period of time, the virtual assistant may be activated, as may be evidenced by a textual and/or graphical change in the virtual assistant search box 708B {a second human perceptible visual cue}” where the transition from illumination alone at the search bar to the “graphical change” indicates the second human perceptible visual cue is not rendered simultaneously with the first human perceptible visual cue.; White, ¶¶ [0081], [0083], FIG. 7B); while rendering the first human perceptible visual cue without simultaneous rendering of the second human perceptible visual cue: detecting, at the client device, one or multiple of: voice activity based on local processing of at least part of audio data captured by one or more microphones of the client device (“ During the video, the user may ask “What is that?”“ which occurs at a second time, which is after the user begun watching the video but during the video itself. Thus, occurring while detecting eye-gaze and while rendering the indicator box {while rendering the first human perceptible visual cue} Although the question does not include a “wake-up” word, the virtual assistant engagement system may receive the dialogue and promptly activate, responding to the user accordingly.”; White, ¶¶ [0064]); co-occurrence of mouth movement of the user and the voice activity based on local processing of one or more of the image frames and at least part of the audio data; and a gesture of the user based on local processing of one or more of the image frames (“the engagement system may receive the eye-gaze signals and head-pose locations, as well as the verbal output of the user. {voice activity}” where “the NLP algorithm results may indicate [engagement level of the user] to the engagement system” {...based on local processing of at least part of the audio data}. Further, “the engagement system may employ the gaze-lock engine 308 to determine where (e.g., toward what direction and/or at what location within the environment) the user is looking... [and] also employ the multimodal engine 310 to determine the shape of the user’s lips if the user is speaking {co-occurrence of mouth movement and voice activity}.”; White, ¶¶ [0039], [0078]); and a gesture of the user based on local processing of one or more of the image frames (Thus the system, to “respond accordingly” to the dialogue, the system “receives {determines} the user’s gesture of physically pointing to the screen” {based on the gesture of the user} and the video {content being rendered by the client device at a time of the gesture}; White, ¶¶ [0064]); responsive to continuing to detect occurrence of the gaze, and detecting one or multiple of the voice activity, the co-occurrence of the mouth movement of the user and the voice activity, and the gesture of the user:...rendering of a second human perceptible visual cue (In the same example, the user is in front of the screen and watching a video {after the first time} and the user has spoken {thus, after rendering the first human perceptible cue}, but before the engagement {prior to initiating the certain processing of audio data}. “When alignment is achieved between indicator box 704B and engagement box 706B, the virtual assistant search bar 708B may illuminate {rendering a second human perceptible cue}” where the search bar 708B is displayed from the virtual assistant {via the output component of the client device}, and where the search bar illumination occurs responsive to the voice activity, received after and in combination with the detected gaze.; White, ¶¶ [0064], [0083]); subsequent to rendering the second human perceptible visual cue: initiating, at the client device, certain additional processing of the audio data and/or one or more of the image frames (“After the indicator box 704B is co-located within engagement box 706B for a predetermined period of time {subsequent to rendering the human perceptible visual cue}, the virtual assistant may be activated {initiating certain additional processing of the audio data and/or one or more of the image frames}, as may be evidenced by a textual and/or graphical change in the virtual assistant search box 708B and/or a verbal output from the virtual assistant.”; White, ¶¶ [0083]). However, White fails to expressly recite receiving a stream of image frames that are based on output from one or more cameras of the client device and supplanting, at the display of the client device, rendering of the first human perceptible visual cue with rendering of a second human perceptible visual cue.
The relevance of Scanlon is described above with relation to claim 1. Regarding claim 18, Scanlon teaches receiving a stream of image frames that are based on output from one or more cameras of the client device (“Image processing may be performed on still images or on video images, streams or files, and terms such as “image” or “image processing” are intended to encompass both still and moving images.” where the images are received from “a visual input device such as camera 18” which is part of the client device; Scanlon, ¶¶ [0019], [0078]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal engagement systems of White to incorporate the teachings of Scanlon to include receiving a stream of image frames that are based on output from one or more cameras of the client device. Intent-specific interaction determined based on visual input with a user allows for the benefits of speech recognition systems while avoiding user privacy concerns created by “the ‘always listening’ approach,” as recognized by Scanlon. (Scanlon, ¶ [0007]-[0008] and [0018]). However, White and Scanlon fail to expressly recite supplanting, at the display of the client device, rendering of the first human perceptible visual cue with rendering of a second human perceptible visual cue. 
The relevance of Teller is described above with relation to claim 6. Regarding claim 18, Teller teaches responsive to continuing to detect occurrence of the gaze, and detecting one or multiple of the voice activity, the co-occurrence of the mouth movement of the user and the voice activity, and the gesture of the user: supplanting, at the display of the client device, rendering of the first human perceptible visual cue with rendering of a second human perceptible visual cue (“FIGS. 7A-7B illustrate example conceptual illustrations 700A, 700B of feedback provided by an indicator component...the plurality of LEDs 704 may begin to blink when a gaze direction of a user is determined to be in the direction of the gaze target {a first human perceptible visual cue}. For instance, the plurality of LEDs 704 may blink in unison, randomly, or according to a predetermined sequential pattern. The plurality of LEDs 704 may continue to blink increasingly rapidly until the user has sustained their gaze in the direction of the gaze target for longer than a predetermined time period {supplanting... rendering of the first human perceptible visual cue with rendering of a second human perceptible visual cue}.; Teller, ¶¶ [0068]-[0069]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal engagement systems of White as modified by the non-verbal interaction including audio buffering of Scanlon to incorporate the teachings of Teller to include detecting occurrence of the gaze of the user that is directed toward the client device, based on a quantity of the image frames of the sequence for which the gaze of the user is determined to be directed toward the one or more cameras. “Software-based mechanisms” disclosed in Teller “leverage the visual detection capabilities of a computer camera” which improves “a user's overall computing experience.” (Teller, ¶ [0005]).

Regarding claim 19, the rejection of claim 18 is incorporated. White further discloses further comprising: responsive to initiating the certain additional processing of the audio data and/or one or more of the image frames: supplanting, at the display of the client device, rendering of the second human perceptible visual cue with rendering of a third human perceptible visual cue. (“the virtual assistant engagement system may receive the dialogue and promptly activate, responding to the user accordingly” which can be “via textual output on the screen in search box 408 {rendering a third human perceptible cue, verbally through the speakers attached to personal computer 404, or a combination of both textual and verbal output.”; White, ¶¶ [0064], [0070]) wherein the third human perceptible visual cue is not rendered simultaneously with the first human perceptible visual cue and is not rendered simultaneously with the second human perceptible visual cue (The textual output is responsive to engagement. As the system determines engagement after both illumination (which is based on determination of the gaze in the direction of the device) and after the graphical change (which is based on gaze over a predetermined period of time), the textual output {the third human perceptible visual cue} occurs after the illumination {the first human perceptible visual cue} and the graphical change {the second human perceptible visual cue}. Therefore, the third human perceptible visual cue is not rendered simultaneously with the first human perceptible visual cue or the second human perceptible visual cue.; White, ¶¶ [0064], [0081], [0083]).

Regarding claim 20, the rejection of claim 18 is incorporated. White further discloses wherein initiating the certain additional processing of the audio data and/or one or more of the image frames comprises: initiating transmission of the audio data and/or the image frames to a remote server associated with the automated assistant. (“the disclosed system may rely on... remote databases... to formulate an appropriate response. This may be accomplished by utilizing … remote databases stored on or associated with servers 118, 120, 122” where “the user input data and virtual assistant device 108 response data may be stored in a remote database (e.g., on servers 118, 120, 122).”; White, ¶¶ [0026], [0029]).

Claim 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over White in view of Teller.

Regarding claim 24, White discloses A method that facilitates hot-word free and touch-free gesture interaction between a user and an automated assistant (“a method for non-verbally engaging a virtual assistant” as performed by a virtual assistant device; White, ¶¶ [0032]), the method implemented by one or more processors and comprising (The “embodiments of the disclosure may be practiced in... a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors.”; White, ¶¶ [0090]) : receiving, at a client device,… image frames that are based on output from one or more cameras of the client device (“the input data received and/or the processing results may be stored locally {receiving}” at the client device where input data includes “currently captured data of classified color eye images {image frames}” from “a high-resolution still camera {based on the output from one or more cameras}” and where the camera “may be attached to, or in communication with, a virtual assistant device”; White, ¶¶ [0049], [0058]); processing, at the client device, the image frames of the stream using at least one trained machine learning model (“the at least one machine-learning algorithm associated with the engagement system may become more familiar with user-specific non-verbal inputs” where “the machine-learning algorithm may utilize data captured from a high-resolution still camera and compare previous data of classified color eye images with the currently captured data of classified color eye images.”; White, ¶¶ [0049]) stored locally on the client device (“processed by a machine-learning algorithm … operating within the engagement system,” where the “engagement system may be deployed locally” on the virtual assistant device {stored locally on the client device}”; White, ¶¶ [0043], [0030]) to detect occurrence of: a gaze of a user that is directed toward the client device (a virtual assistant device can use “gaze tracking (where the eye tracking hardware is able to detect specific locations focused on by eye gaze, such as a virtual assistant icon on a user interface)”; White, ¶¶ [0033], [0034]); determining, based on detecting the occurrence of the gaze of the user, to generate a response to a gesture of the user that is captured by one or more of the image frames of the stream (“a user may desire to initiate a conversation with a virtual assistant, and the input received from the user may be a series of eye-gaze signals and a hand gesture (e.g., wave).”; White, ¶¶ [0064]); generating the response to the gesture of the user, generating the response comprising: determining the gesture of the user based on processing of the one or more of the image frames of the stream (“In another example aspect, a user may be in front of a computer screen watching a video. The computer may be running a virtual assistant. During the video, the user may ask “What is that?” Although the question does not include a “wake-up” word, the virtual assistant engagement system may receive the dialogue and promptly activate, responding to the user accordingly. The system may not only receive the user’s verbal input, but the system may also receive {determines} the user’s physical gesture of pointing {the gesture of the user} to the screen and the screen contents”; White, ¶¶ [0064]), and generating the response based on the gesture of the user and based on content being rendered by the client device at a time of the gesture (Thus the system, to “respond accordingly” to the dialogue, the system “receives {determines} the user’s gesture of physically pointing to the screen” {based on the gesture of the user} and the video {content being rendered by the client device at a time of the gesture}; White, ¶¶ [0064]) wherein generating the response based on the gesture of the user and based on the content being rendered by the client device at the time of the gestures comprises: determining that the gesture is assigned to a plurality of responsive actions (“the machine-learning algorithms may...adapt to user-specific preferences, actions and/or gestures {determining that the gesture…}, to more accurately determine when a user desires to initiate and/or maintain interaction with a virtual assistant {is assigned to a plurality of responsive actions}” where pointing is one gesture assigned to a plurality of responsive actions, as indicated by the action taken by the system associated with pointing at the video.; White, ¶¶ [0005], [0064]); selecting, from the plurality of responsive actions, a single responsive action, (The system selects the single responsive action of “responding accordingly” to the dialogue, in light of the gesture; White, ¶¶ [0005], [0064]) wherein selecting the single responsive action is based on the content being rendered by the client device at the time of the gesture (The single responsive action of “responding accordingly” to the user’s exclamation of “what is that” alongside the user pointing at the screen, with a response based on the video playing on the screen {based on the content being rendered by the client device…} which occurs based on the gesture, the dialogue, and the contemporaneous portion of the video {at the time of the gesture}; White, ¶¶ [0064]); and generating the response to cause performance of the selected single responsive action (“The response determination engine 312 may consider the processed input data results in determining how best to respond to the input {generating the response to cause the performance of...}” to “respond accordingly” based on “user’s verbal input,... the user’s physical gesture of pointing to the screen and the screen contents (e.g., a series of screenshots may be captured and processed by the engagement system)” thus a selected single responsive action; White, ¶¶ [0064]); and effectuating the response at the client device. (The system “receive the dialogue and promptly activate[s], [and] respond[s] to the user accordingly,” thus, effectuating the response at the client.; White, ¶¶ [0064]). However, White fails to expressly recite receiving, at a client device, a stream of image frames that are based on output from one or more cameras of the client device.
The relevance of Teller is described above with relation to claim 6. Regarding claim 24, Teller teaches receiving, at a client device, a stream of image frames that are based on output from one or more cameras of the client device (The system can process “a video sequence”, where a video sequence is a stream of sequential image frames, and where “the gaze tracking component 104 may be a...video camera configured to obtain images of one or more people facing the camera.”; Teller, ¶¶ [0057], [0027]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal engagement systems of White to incorporate the teachings of Teller to include receiving, at a client device, a stream of image frames that are based on output from one or more cameras of the client device. “Software-based mechanisms” disclosed in Teller “leverage the visual detection capabilities of a computer camera” which improves “a user's overall computing experience.” (Teller, ¶ [0005]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
VanBlon et al. (U.S. Pat. App. Pub. No. 2017/0169818) discloses systems and methods for user focus activated voice recognition.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627. The examiner can normally be reached 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached on (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Sean E Serraguard/Patent Examiner, Art Unit 2657  

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657