DETAILED ACTION
Notice of AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Continued Examination under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after allowance under Ex Parte Quayle, 25 USPQ 74, 453 O.G. 213 (Comm'r Pat. 1935).  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, prosecution in this application has been reopened pursuant to 37 CFR 1.114.  Applicant's IDS submission filed on 28 January 2022 has been entered.  Claims filed 22 October 2021 are under examination. 

Priority
Acknowledgment is made of applicant's claim for foreign priority based on an application filed in the European Patent Office on 04 May 2014. It is noted, however, that applicant has not filed a certified copy of the PCT/US2018/031164 application as required by 37 CFR 1.55.  

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 

Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in this Office Action.
This application includes a claim limitation that does not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation uses a generic placeholder that is coupled 
an invocation gesture in claims 1, 16, and 17, and all respective dependent claims, is plainly interpreted as gesture which invokes.  In light of the specification an invocation gesture is interpreted as any user input which enables a system to detect another user input (see [0002]). 
Because this claim limitation is being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, the claim limitation is being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this limitation interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid interpretation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation recites sufficient structure to perform the claimed function so as to avoid interpretation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office Action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have 

Claims 1-12 and 16-17 are rejected under 35 USC § 103 as being unpatentable over Divakaran (Divakaran; Ajay et al., US 20170160813 A1) in view of Bradski (Bradski; Gary R. et al., US 20190094981 A1), and further in view of Funami (Funami, Atsushi, US 20180011543 A1). 
Regarding claim 1 (currently amended, as interpreted under 35 USC 112(f)), Divakaran discloses a method (Divakaran; see [0002]) 
implemented by one or more processors of a client device that facilitates touch-free interaction between one or more users and an automated assistant (Divakaran describes plural computing devices, such as smart phones and servers, as an operating platform for a virtual personal assistant; see [0049], [0071]; describing a client device 410 containing personal assistant client application 450 facilitating image or voice interaction, or touch-free interaction, with an automated assistant 450; see Fig. 4, [0073]-[0077]; one of ordinary skill in the art before the effective filing date would have inferred one or more processors from Divakaran’s disclosure cited above), 
the method comprising: receiving a stream of image frames that are based on output from one or more cameras of the client device (Divakaran describes image understanding engine 416 making a determination from a sequence of images, or a stream of image frames; see Fig. 5, [0095]); 
processing the image frames of the stream using at least one trained machine learning model stored locally on the client device to monitor for occurrence of both: an invocation gesture of a user captured by at least one of the image frames, and a gaze of the user that is directed toward the client device (Divakaran describes recognizing 
wherein processing the image frames of the stream using the at least one trained machine learning model to monitor for the occurrence of the invocation gesture comprises processing first resolution versions of the image frames without processing second resolution versions of the image frames (Divakaran describes recognizing gestures occurring over multiple images as inputs; see Fig. 19, [0224]-[0225]; discloses varying resolutions on input images; see [0254]; one of ordinary skill in the art before the effective filing date would have inferred capturing the data at one resolution level and processing the data at a different resolution level from Divakaran’s disclosure of a captured image, see Fig. 21, [0254], and subsequent processing sub-sampling the captured image for processing at a different resolution; see [0260]), 
and wherein processing the image frames of the stream using the at least one trained machine learning model to monitor for the occurrence of the gaze comprises processing the second resolution versions of the image frames (Divakaran describes a virtual personal assistant accepting gaze direction as an input; see [0244]; Divakaran describes determining a match with a captured image after processing the captured image at a different, second resolution; see [0260]-[0261]); 
detecting, based on the monitoring, occurrence of both: the invocation gesture, and the gaze (Divakaran describes a virtual personal assistant accepting gaze direction as an input; see [0244]; discloses recognizing gestures occurring over multiple images 
and in response to detecting the occurrence of both the invocation gesture and the gaze: causing at least one function of the automated assistant to be activated (Divakaran describes a computer-driven device invoking a multi-modal, conversational virtual personal assistant which interprets the received audible and visual sensory inputs to determine user intention; see [0036]-[0037], [0039]). 
Divakaran differs from the instant invention in that Divakaran does not appear to explicitly disclose: processing image data in a gaze direction at higher resolution, as might be implied by the clauses “processing first resolution versions of the image frames” and “processing second resolution versions of the image frames”.
However, in an analogous field of endeavor, Bradski discloses a method (Bradski; see [0017]) which 
processes image data in a gaze direction at a higher resolution (Bradski, describes using higher resolution scanning near a user’s gaze direction; see [0386]).
Before the effective filing date it would have been obvious to one of ordinary skill in the art to modify Divakaran’s method for a client device having at least one of a vision component, microphone, and processor executing memory stored operational instructions to receive image and speech inputs to a machine-learning trained automated assistant, with Bradski’s method which processes image data in a gaze direction at a higher resolution, especially when considering the motivation to modify Divakaran with Bradski arising from the stated desire to save computation resources by 
in the interest of compact prosecution, this examination considers the narrower-than-claimed condition that a detection of an invocation gesture occurs at the same time as, immediately before, or triggers a detection of a user gaze at a specific client device, which would lead to a conclusion that Divakaran and Bradski differ from the instant invention in that Divakaran and Bradski did not disclose the condition that a detection of an invocation gesture occurs at the same time as, immediately before, or triggers a detection of a user gaze direction. 
However, in an analogous field of endeavor, Funami discloses a method (Funami; see [0002]) in which 
a detection of an invocation gesture occurs at the same time as, immediately before, or triggers a detection of a user gaze at a specific region (Funami describes multiple user inputs which are interpreted as a movement for acquiring an operation right, or as an invocation gesture immediately preceding or triggering a detection of a user gaze at a specific region; see [0315]-[0324]). 
Before the effective filing date it would have been obvious to one of ordinary skill in the art to modify Divakaran’s and Bradski’s method for a client device having at least one of a vision component, microphone, and processor executing memory stored operational instructions to receive image and speech inputs to a machine-learning trained automated assistant and processing image data for detecting a gaze direction at a higher resolution, with Funami’s method in which a detection of an invocation gesture either occurs immediately before or triggers a detection of a user gaze at a specific 
Regarding claim 2 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 1, wherein the at least one function of the automated assistant that is activated in response to detecting the occurrence of both the invocation gesture and the gaze comprises: 
transmitting of audio data, captured via one or more microphones of the client device, to a remote server associated with the automated assistant (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]). 
The motivation to combine presented prior applies equally here.
Regarding claim 3 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 1, wherein the at least one function that is activated in response to detecting the occurrence of both the invocation gesture and the gaze comprises: 
transmitting of additional image frames to a remote server associated with the automated assistant, the additional image frames based on output from one or more of the cameras and received after detecting the occurrence of both the invocation gesture and the gaze (Divakaran describes a virtual personal assistant receiving image input representing gesture inputs; see [0040], [0041]). 
The motivation to combine presented prior applies equally here.
Regarding claim 4 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 1, wherein the at least one function that is activated in response to detecting the occurrence of both the invocation gesture and the gaze comprises: 
processing of buffered audio data at the client device, the buffered audio data being stored in memory at the client device and being captured via one or more microphones of the client device, and the processing of the buffered audio data including one or both of: invocation phrase detection processing, and automatic speech recognition (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]; discloses user spoken phrases obtained, or captured, and stored as training data; see [0219]). 
The motivation to combine presented prior applies equally here.
Regarding claim 5 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 4, wherein the processing of the buffered audio data comprises 
the automatic speech recognition, and wherein the automatic speech recognition comprises voice-to-text processing (Divakaran describes speech, or voice, to text processing; see [0075], [0134]). 
The motivation to combine presented prior applies equally here.
Regarding claim 6 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 4, wherein the processing of the buffered audio data comprises 

and further comprising: in response to the invocation phrase detection processing detecting presence of an invocation phrase in the buffered audio data, performing one or both of: transmitting further audio data, captured via the one or more microphones of the client device, to a remote server associated with the automated assistant (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]; discloses user spoken phrases obtained, or captured, and stored as training data; see [0219]); 
and transmitting of additional image frames to a remote server associated with the automated assistant, the additional image frames based on output from one or more of the cameras and received after detecting the occurrence of both the invocation gesture and the gaze (Divakaran describes a virtual personal assistant receiving image input representing gesture inputs; see [0040], [0041]). 
The motivation to combine presented prior applies equally here.
Regarding claim 7 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 1, wherein processing the image frames of the stream using at least one trained machine learning model stored locally on the client device to monitor for occurrence of both the invocation gesture and the gaze comprises: 
using a first trained machine learning model in processing the first resolution version of the image frames to monitor for occurrence of the invocation gesture 
and using a second trained machine learning model in processing the second resolution version of the image frames to monitor for the gaze of the user that is directed toward the client device (Bradski, describes using higher resolution scanning near a user’s gaze direction and a lower resolution outside the gaze direction; see [0384]; Divakaran describes determining a match with a captured image after processing the captured image at a different, second resolution; see [0260]-[0261]; it would have been obvious to one of ordinary skill in the art before the effective filing date to process image inputs at different resolutions depending on a need to save computational resources; see Bradski at [0384]).  
The motivation to combine presented prior applies equally here.
Regarding claim 8 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 7, 
wherein using the second trained machine learning model to monitor for the gaze of the user that is directed toward the client device occurs only in response to detecting occurrence of the invocation gesture using the first trained machine learning model (Bradski, describes detecting user gaze directed at particular objects; see [1003]). 
The motivation to combine presented prior applies equally here.
Regarding claim 9 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 8, 

The motivation to combine presented prior applies equally here. 
Regarding claim 10 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 1, further comprising: 
receiving a stream of audio data frames that are based on output from one or more microphones of the client device  (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]); 
processing the audio data frames of the stream using at least one trained invocation phrase detection machine learning model stored locally on the client device 
detecting the occurrence of the spoken invocation phrase based on the monitoring for the occurrence of the spoken invocation phrase (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]); 
wherein causing the at least one function of the automated assistant to be activated is in response to detecting the occurrence of the spoken invocation phrase in temporal proximity to both the invocation gesture and the gaze (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]; discloses user spoken phrases obtained, or captured, and developing a model based on the detected spoken phrases; see [0219]). 
The motivation to combine presented prior applies equally here. 
Regarding claim 11 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 10, 
wherein the at least one function that is activated comprises one or both of: transmitting of additional audio data frames captured via the one or more microphones of the client device, to a remote server associated with the automated assistant; and transmitting of one or more additional image frames from one or more of the cameras, to the remote server associated with the automated assistant (Divakaran describes a virtual personal assistant receiving image input representing gesture inputs; see [0040], [0041]). 
The motivation to combine presented prior applies equally here. 
Regarding claim 12 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 1, wherein processing the image frames of the stream using at least one trained machine learning model stored locally on the client device to monitor for occurrence of both the invocation gesture and the gaze comprises: 
processing the image frames using a first trained machine learning model to predict a region of the image frames that includes a human face; and processing the region of the image frames using a second trained machine learning model trained to detect the gaze of the user (Divakaran describes a virtual personal assistant accepting gaze direction as an input; see [0244]; discloses recognizing gestures occurring over multiple images as inputs; see Fig. 19, [0224]-[0225]; discloses deriving classifiers over multiple input images by machine learning techniques; Fig. 19, [0230]; discloses facial recognition and analysis of input images; see [0167]). 
The motivation to combine presented prior applies equally here. 
Regarding claim 15 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the method of claim 1, further comprising: 
detecting, based on a signal from a presence sensor, that a human is present in an environment of the client device (Funami describes an infrared camera 13 which detects a person, or a human; see [0105]); 
causing the one or more cameras to provide the stream of image frames in response to detecting that the human is present in the environment (Divakaran describes a camera capturing images of user input; see [0072]; Divakaran describes image understanding engine 416 making a determination from a sequence of images, 
The motivation to combine presented prior applies equally here. 
Regarding claim 16 (currently amended, as interpreted under 35 USC 112(f)), Divakaran discloses a client device (Divakaran, describing a client device 410; see Fig. 4, [0073]) comprising:
at least one vision component (Divakaran, describing a vision component, or image capture device 2002; see Fig. 20, [0240]); 
at least one microphone (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]); 
one or more processors (Divakaran describes plural computing devices, such as smart phones and servers, as an operating platform for a virtual personal assistant; see [0049], [0071]; one of ordinary skill in the art before the effective filing date would have inferred one or more processors from Divakaran’s above disclosure); 
memory operably coupled with the one or more processors, wherein the memory stores instructions that, in response to execution of the instructions by one or more of the processors, cause one or more of the processors to perform the following operations (Divakaran; see [0181], [0339]): 
receiving a stream of vision data that is based on output from the vision component of the client device (Divakaran describes image understanding engine 416 
receiving a stream of audio data that is based on output from the microphone of the client device (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]; discloses user spoken phrases obtained, or captured, and stored as training data; see [0219]); 
processing the vision data using at least one trained machine learning model stored locally on the client device to monitor for occurrence of both: an invocation gesture of a user captured by the vision data, and a gaze of the user that is directed toward the client device (Divakaran describes a virtual personal assistant accepting gaze direction as an input; see [0244]; Divakaran describes recognizing gestures occurring over multiple images as inputs; see Fig. 19, [0224]-[0225];  Divakaran further describes deriving classifiers over multiple input images by machine learning techniques; Fig. 19, [0230]); 
detecting, based on the monitoring, occurrence of both: the invocation gesture, and the gaze (Divakaran describes recognizing gestures occurring over multiple images as inputs; see Fig. 19, [0224]-[0225]; Divakaran describes a virtual personal assistant accepting gaze direction as an input; see [0244]);
determining that the audio data corresponds to the user that provided the invocation gesture and the gaze (Divakaran describes multi-modal input detection including speech, gesture, and gaze; see [0039], [0072], [0224]-[0225], [0244]); 
and in response to detecting the occurrence of both the invocation gesture and the gaze, and contingent on determining that the audio data corresponds to the user 
Divakaran differs from the instant invention in that Divakaran does not explicitly disclose: a computing device containing one or more processors.
However, in an analogous field of endeavor, Bradski discloses a system containing computing devices (Bradski; see [0170]) containing 
one or more processors (Bradski, disclosing a computing device 11 containing one or more processors; see [0170]; disclosing a client device 2402 containing one or more processors; see [0526]).
Before the effective filing date it would have been obvious to one of ordinary skill in the art to modify Divakaran’s client device having at least one of a vision component, microphone, and processor executing memory stored operational instructions to receive image and speech inputs to a machine-learning trained automated assistant, with Bradski’s system containing computing devices containing one or more processors, especially when considering the motivation to modify Divakaran with Bradski arising from the stated desire to save computation resources by only creating really high resolution rendering based on where the person is gazing (Bradski; see [0384]).
 Divakaran and Bradski differ from the instant invention in that Divakaran with Bradski did not disclose the condition that a detection of an invocation gesture occurs at the same time as, immediately before, or triggers a detection of a user gaze direction. 
However, in an analogous field of endeavor, Funami discloses a system (Funami describes a system 100; see Fig. 1A, [0042]) in which 
a detection of an invocation gesture occurs at the same time as, immediately before, or triggers a detection of a user gaze at a specific region (Funami describes multiple user inputs which are interpreted as a movement for acquiring an operation right, or as an invocation gesture immediately preceding or triggering a detection of a user gaze at a specific region; see [0315]-[0324]). 
Before the effective filing date it would have been obvious to one of ordinary skill in the art to modify Divakaran’s and Bradski’s client device having at least one of a vision component, microphone, and processor executing memory stored operational instructions to receive image and speech inputs to a machine-learning trained automated assistant and processing image data for detecting a gaze direction at a higher resolution, with Funami’s system in which a detection of an invocation gesture either occurs immediately before or triggers a detection of a user gaze at a specific region, especially when considering the motivation to modify Divakaran and Bradski with Funami arising from the stated desire to provide a system which detects a gesture 
Regarding claim 17 (original, as interpreted under 35 USC 112(f)), Divakaran, Bradski, and Funami disclose the client device of claim 16, wherein the operations further include:
receiving, in response to the transmitting, responsive content; rendering the responsive content via one or more user interface output devices of the client device (Divakaran describes a user device 102 transmitting microphone-captured audio data to a virtual personal assistant; see [0050], [0072]; discloses user spoken phrases obtained, or captured, and stored as training data; see [0219]).
The motivation to combine presented prior applies equally here.
Regarding claim 18 (original, as interpreted under 35 USC 112(f)), Divakaran discloses a system (Divakaran; see [0002]), comprising: 
at least one vision component (Divakaran, describing a vision component, or image capture device 2002; see Fig. 20, [0240]); 

one 

process the vision data using at least one trained machine learning model stored locally on the client device to monitor for occurrence of both: an invocation gesture of a user captured by the vision data, and a gaze of the user that is directed toward the client device (Divakaran describes deriving classifiers over multiple input images by machine learning techniques; Fig. 19, [0230]); 
detect, based on the monitoring, occurrence of both: the invocation gesture, and the gaze (Divakaran describes a virtual personal assistant accepting gaze direction as an input; see [0244]; discloses recognizing gestures occurring over multiple images as inputs; see Fig. 19, [0224]-[0225]; discloses deriving classifiers over multiple input images by machine learning techniques; Fig. 19, [0230]); 
and in response to detecting the occurrence of both the invocation gesture and the gaze: cause at least one function of the automated assistant to be activated (Divakaran describes a computer-driven device invoking a multi-modal, conversational virtual personal assistant which interprets the received audible and visual sensory inputs to determine user intention; see [0036]-[0037], [0039]). 
Divakaran differs from the instant invention in that Divakaran does not explicitly disclose: a computing device containing one or more processors. 
However, in an analogous field of endeavor, Bradski discloses a system containing computing devices (Bradski; see [0170]) containing 
one or more processors (Bradski, disclosing a computing device 11 containing one or more processors; see [0170]; disclosing a client device 2402 containing one or more processors; see [0526]).
Before the effective filing date it would have been obvious to one of ordinary skill in the art to modify Divakaran’s system having a vision component and processor executing memory stored operational instructions to receive image and speech inputs to a machine-learning trained automated assistant, with Bradski’s system containing computing devices containing one or more processors, especially when considering the motivation to modify Divakaran with Bradski arising from the stated desire to save computation resources by only creating really high resolution rendering based on where the person is gazing (Bradski; see [0384]). 
Divakaran and Bradski differ from the instant invention in that Divakaran and Bradski do not disclose a presence sensor detecting that a human is present.
However, in an analogous field of endeavor, Funami discloses a system (Funami describes a system 100; see Fig. 1A, [0042]) containing  
a presence sensor detecting that a human is present (Funami describes an infrared camera 13 which detects a person, or a human; see [0105]). 
Before the effective filing date it would have been obvious to one of ordinary skill in the art to modify Divakaran’s and Bradski’s system having at least one of a vision . 

Allowable Subject Matter
Claims 13-14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. 
The particularly-significant, distinguishing structural and functional features are a determination that a region of the image frames corresponds to an electronic display, and ignoring the electronic display in monitoring for the occurrence of both an invocation gesture and a user gaze, when these structural and functional features are considered with all other structural and functional features in each of the respective claims. 

Conclusion
References found pertinent to the instant Office Action but not cited in rejection are collected in the attached PTO-892 form and described below: 

Li, Chian Chiu, US 20190294252 A1, describes a vision system having one or more processors executing memory stored operational instructions to receive image and speech inputs (see Fig. 5), but does not expressly describe a machine-learning trained automated assistant or artificial intelligence; 
Okubo, Masafumi, et al., US 20160373269 A1, describes a vision system having one or more processors executing memory stored operational instructions to receive image and speech inputs (see Fig. 1), but does not expressly describe a machine-learning trained automated assistant or artificial intelligence;  
Scheessele; Evan, US 20150033130 A1, describes a vision system having one or more processors executing memory stored operational instructions to receive image and speech inputs (see Fig. 2), but does not expressly describe a machine-learning trained automated assistant or artificial intelligence; 
Fotland; David Allen, et al., US 9832452 B1, describes a system using machine learning to provide facial recognition and speech recognition, in which the resolution of captured images is adjusted (see Detailed Description paragraphs 32 and 33). 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL J EURICE whose telephone number is (571)270-5957. The examiner can normally be reached weekdays from about 6:00 AM to 2:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benjamin C. Lee can be reached on 571 272 2963. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Michael J Eurice/Primary Examiner, Art Unit 2693