Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Note: In order to better show what is and is not taught by the references, Examiner shows some words underlined. Words that are underlined indicate teachings of the cited reference, and may not specifically be claimed.


Claims 1-6, 8-13, 16, 17, 22-24 are rejected under 35 U.S.C. 103 as being unpatentable over Okubo et al. (US20160373269, hereinafter Okubo) in view of Cassidy et al. (US9263044, hereinafter Cassidy) in view of Rajendra (US20170026764, Rajendran).

As to claims 1, 16, 17:
Okubo shows a method, corresponding client device, and corresponding system, implemented by one or more processors of a client device that facilitates touch-free interaction between one or more users and an automated assistant, the method comprising:
receiving a stream of image frames that are based on output from one or more cameras of the client device (¶ [0123]) (e.g., camera devices detects whether the line-of-sight of the user is directed toward itself);
processing the image frames of the stream using at least one 
a gaze of a user that is directed toward the one or more cameras of the client device (¶ [0122]) (e.g., whether the line-of-sight of the user is directed toward itself or not), and
movement of a mouth of the user (¶ [0124]) (e.g., in addition to detection of the line-of-sigh (gaze tracking), lip detection, may be used for determination regarding whether or not the user will start talking;  Lip detection is detecting mount motions or lip actions of the user from images taken by cameras);
detecting, based on the monitoring, occurrence of both:
the gaze of the user, and the movement of the mouth of the user (¶ [0124], [0093]. [0120]) (e.g., in addition to detection of the line-of-sigh (gaze tracking),  lip detection, may be used for determination regarding whether or not the user will start talking;  Lip detection is detecting mount motions or lip actions of the user from images taken by cameras; Components in the following embodiments which are not included in an independent Claim indicating the highest concept are described as being optional components. Also, in all of the embodiments, the contents of each can be combined; it should be noted that the order of the processing of steps S101 and S103 are exemplary, and processing may be performed in the opposite order from these, and at least part may be performed in parallel.);
and in response to detecting the occurrence of both the gaze of the user and the movement of the mouth of the user, adapting 
identifies the direction where the speech source (user) is, based on position information of the user, obtained by the camera or the like, and reduces ambient noise from the obtained sound using the direction of the speech source (¶ [0122])

Okubo fails to specifically show:
processing the image frames of the stream using at least one trained machine learning model stored locally on the client device;
adapting rendering of user interface output of the client device, wherein adapting rendering of user interface output of the client device comprises: reducing a volume of rendering of audible user interface output rendered by the client device, or halting the rendering of the audible user interface output rendered by the client device.
In the same field of invention, Cassidy teaches: noise reduction based on mouth area movement recognition. Cassidy further teaches: processing the image frames of the stream using at least one trained machine learning model stored locally (col. 4, l.33-48;; fig. 4, el. 412; col. 4, l. 15-18) (e.g., an image process component 308 analyzes movement of a mouth area of a user, by communicating with a model library 312 which stores models of mouth movements known/predefined of oral communications such as speaking or singing). 
Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of Okubo and Cassidy before the effective filing date of the invention, to have combined the teachings of Cassidy with the method, corresponding client device, and corresponding system, as taught by Okubo. 
One would have been motivated to make such combination because a way to enable communication in environments that are not ideal for clear communications, such as a bar, would have been obtained and desired, as expressly taught by Cassidy (col. 1, l. 11-18).

In the same field of invention, Rajendran teaches: automatic car audio volume control. Rajendran further teaches:
detecting, based on monitoring, occurrence of:
the movement of the mouth of the user (¶ [0025]) (e.g., detecting lip movement with camera);
and in response to detecting the occurrence of the movement of the mouth of the user
adapting rendering of user interface output of the client device, wherein adapting rendering of user interface output of the client device comprises: reducing a volume of rendering of audible user interface output rendered by the client device, or halting the rendering of the audible user interface output rendered by the client device (¶ [0024]) (e.g., volume level of the car audio system is reduced so that the speech by the passenger may be more easily heard).
Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of Okubo, Cassidy, Rajendran before the effective filing date of the invention, to have combined the teachings of Rajendran with the method, corresponding client device, and corresponding system, as taught by Okubo, Cassidy. 
One would have been motivated to make such combination because a way to automatically turn down the volume of an audio signal in response to the start of a conversation would have been obtained and desired, as expressly taught by Rajendran (¶ [0003]).

As to claim 2, Okubo further shows:
The method of claim 1, wherein adapting audio data processing by the client device is also performed in response to detecting the occurrence of both the gaze of the user and the movement of the mouth of the user (¶ [0122]) (e.g. identifies the direction where the speech source (user) is, based on position information of the user, obtained by the camera or the like, and reduces ambient noise from the obtained sound using the direction of the speech source).

As to claim 3, Cassidy further teaches:
The method of claim 1, wherein adapting rendering of user interface output of the client device comprises:
reducing the volume of rendering of the audible user interface output rendered by the client device (col. 5, l. 20-24) (e.g., a device may decrease or increase noise reduction, for example, the volume of the device).
One would have been motivated to make such combination because a way to enable communication in environments that are not ideal for clear communications, such as a bar, would have been obtained and desired, as expressly taught by Cassidy (col. 1, l. 11-18).

As to claim 4, Cassidy further teaches:
The method of claim 3, further comprising:
performing voice activity detection of audio data that temporally corresponds with the movement of the mouth of the user;
determining occurrence of voice activity based on the voice activity detection of the audio data that temporally corresponds to the mouth movement of the user;
wherein reducing the volume of the audible user interface output rendered by the client device is further in response to determining the occurrence of voice activity, and based on the occurrence of the voice activity being for the audio data that temporally corresponds to the mouth movement of the user (col. 5, l. 3-24; claim 21) (e.g., automatically adjusting noise reduction based on the detection of the mouth area movement is producing oral communication, and a comparison of a user’s sound quality to that of noise).
One would have been motivated to make such combination because a way to enable communication in environments that are not ideal for clear communications, such as a bar, would have been obtained and desired, as expressly taught by Cassidy (col. 1, l. 11-18).

As to claim 5, Cassidy further teaches:
The method of claim 1, wherein adapting rendering of user interface output of the client device comprises:
halting the rendering of audible user interface output rendered by the client device (col. 5, l. 20-24; claim 21; col. 6, l. 38-44) (e.g., a device may decrease or increase noise reduction, for example, the volume of the device may be cancelled).
One would have been motivated to make such combination because a way to enable communication in environments that are not ideal for clear communications, such as a bar, would have been obtained and desired, as expressly taught by Cassidy (col. 1, l. 11-18)

As to claim 6, Cassidy further teaches:
The method of claim 5, further comprising:
performing voice activity detection of audio data that temporally corresponds with the movement of the mouth of the user;
determining occurrence of voice activity based on the voice activity detection of the audio data that temporally corresponds to the mouth movement of the user;
wherein halting the rendering of the audible user interface output rendered by the client device is further in response to determining the occurrence of voice activity, and based on the occurrence of the voice activity being for the audio data that temporally corresponds to the mouth movement of the user (col. 5, l. 3-24; claim 21; col. 6, l. 38-44) (e.g., automatically cancelling noise based on the detection of the mouth area movement is producing oral communication, and a comparison of a user’s sound quality to that of noise; cancelling noise may reasonably mean halting the output from the speakers, as one of ordinary of skill in the art would understand).
One would have been motivated to make such combination because a way to enable communication in environments that are not ideal for clear communications, such as a bar, would have been obtained and desired, as expressly taught by Cassidy (col. 1, l. 11-18)..

As to claim 8, Okubo further shows:
The method of claim 1, wherein adapting audio data processing by the client device is performed in response to detecting the occurrence of both the gaze of the user and the movement of the mouth of the user (¶ [0122]) (e.g., identifies the direction where the speech source (user) is, based on position information of the user, obtained by the camera or the like, and reduces ambient noise from the obtained sound using the direction of the speech source).

As to claim 9, Okubo further shows:
The method of claim 2, wherein adapting the audio data processing by the client device comprises initiating the transmission of audio data, captured via one or more microphones of the client device, to a remote server associated with the automated assistant (¶ [0128]) (e.g., the voice conversation unit 2143 issues a sound collection start command to the sound collection device 2013, thereby acquiring sound data including the user speech contents, and transfers the acquired sound data to the voice conversation server 2100).

As to claim 10, Okubo further shows:
The method of claim 9, further comprising:
performing voice activity analysis of certain audio data that temporally corresponds with the movement of the mouth of the user, the certain audio data being included in the audio data or preceding the audio data;
and determining occurrence of voice activity based on the voice activity analysis of the certain audio data that temporally corresponds to the mouth movement of the user;
wherein initiating the transmission of audio data is further in response to determining the occurrence of voice activity, and based on the occurrence of the voice activity being for the audio data that temporally corresponds to the mouth movement of the user (¶ [0128], [0129]) (e.g., the voice conversation unit 2143 issues a sound collection start command to the sound collection device 2013, thereby acquiring sound data including the user speech contents, and transfers the acquired sound data to the voice conversation server 2100; The voice conversation server 2100 identifies the speech contents from the sound data by analyzing the sound data, and uses the conversation dictionary 2101 to identify a control command from the speech contents).

As to claim 11, Okubo further shows:
The method of claim 2, wherein adapting the audio data processing by the client device in response to detecting the occurrence of both the gaze of the user and the movement of the mouth of the user comprises:
determining a position of the user, relative to the client device, based one or more of the image frames (¶ [0160]) (e.g., the home gateway 1102 may determine the position of the user from information obtained from cameras installed within the group 1100);
using the position of the user in processing of audio data captured via one or more microphones of the client device (¶ [0122]) (e.g., identifies the direction where the speech source (user) is, based on position information of the user, obtained by the camera or the like, and reduces ambient noise from the obtained sound using the direction of the speech source).

As to claim 12, Okubo further shows:
The method of claim 11, wherein using the position of the user in processing of audio data captured via one or more microphones of the client device comprises using the position in isolating portions of the audio data that correspond to a spoken utterance of the user (¶ [0122]) (e.g., identifying the direction where the speech source (user) is, based on position information of the user, obtained by the camera or the like, and reduces ambient noise from the obtained sound using the direction of the speech source).

As to claim 13, Okubo further shows:
The method of claim 11, wherein using the position of the user in processing of audio data captured via one or more microphones of the client device comprises using the position in removing background noise from the audio data (¶ [0122]) (e.g., identifying the direction where the speech source (user) is, based on position information of the user, obtained by the camera or the like, and reduces ambient noise from the obtained sound using the direction of the speech source).

As to claims 22-24, Okubo further shows:
wherein detecting the occurrence of both the gaze of the user and the movement of the mouth of the user comprises: detecting co-occurrence of the gaze of the user and the movement of the mouth of the user, or detecting occurrence of the gaze of the user and the movement of the mouth of the user within a threshold temporal proximity of one another (¶ [0120], [0124]) (e.g., it should be noted that the order of the processing of steps S101 –determining object device based on content of speech-- and S103 –determining object device based on line of sight--  are exemplary, and processing may be performed in the opposite order from these, and at least part may be performed in parallel; in addition to detection of the line-of-sigh (gaze tracking), lip detection, may be used for determination regarding whether or not the user will start talking;  Lip detection is detecting mount motions or lip actions of the user from images taken by cameras).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Okubo et al. (US20160373269, hereinafter Okubo) in view of Cassidy et al. (US9263044, hereinafter Cassidy) in view of Rajendra (US20170026764, Rajendran), further in view of Hernandez-Abrego et al. (US9250703, herinafer H).

As to claim 7:
Okubo, Cassidy, Rajendran show a method substantially as claimed, as specified above. 
Cassidy further teaches:
wherein adapting the audio data processing by the client device comprises initiating local automatic speech recognition at the client device, or initiating transmission of audio data, captured via one or more microphones of the client device, to a remote server associated with the automated assistant (col. 6, l. 44-47) (e.g., audio data processing may be transmitted over the network to a remote server).
Okubo, Cassidy, Rajendran fail to specifically show: wherein adapting rendering of user interface output of the client device comprises rendering a human perceptible cue;
and wherein initiating the local automatic speech recognition or initiating the transmission of audio data to the remote server is further in response to detecting the gaze of the user continues to be directed toward the one or more cameras of the client device following the rendering of the cue.
In the same field of invention, H teaches: interface with gaze detection and voice input. H further teaches: wherein adapting rendering of user interface output of the client device comprises rendering a human perceptible cue (col. 5, l. 24-29), wherein adapting audio data processing by the client device is performed in response to detecting the occurrence of both the gaze of the user and the movement of the mouth of the user (fig. 7, el. col. 724; fig. 6c);
and wherein initiating the local automatic speech recognition or initiating the transmission of audio data to the remote server is further in response to detecting the gaze of the user continues to be directed toward the one or more cameras of the client device following the rendering of the cue  (fig. 7, el. col. 724; fig. 6c).
Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of Okubo, Cassidy, Rajendran and H before the effective filing date of the invention, to have combined the teachings of H with the method as taught by Okubo, Cassidy, Rajendran. 
One would have been motivated to make such combination because a way to provide a less awkward and more convenient manner to provide voice activity detection and microphone array processing would have been obtained and desired, as expressly taught by H (col. 1, l. 30-43).
One would have been motivated to make such combination because a way to enable communication in environments that are not ideal for clear communications, such as a bar, would have been obtained and desired, as expressly taught by Cassidy (col. 1, l. 11-18)..

Claim 14 is  rejected under 35 U.S.C. 103 as being unpatentable over Okubo et al. (US20160373269, hereinafter Okubo) in view of Cassidy et al. (US9263044, hereinafter Cassidy) in view of Rajendra (US20170026764, Rajendran), further in view of Scherer et al. (US9691411, Scherer).


As to claim 14:
Okubo, Cassidy, Rajendran show a method substantially as claimed, as specified above. 
Okubo, Cassidy, Rajendran fail to specifically show: wherein processing the image frames of the stream using at least one trained machine learning model stored locally on the client device to monitor for occurrence of both the gaze of the user and the movement of the mouth of the user comprises: using a first trained machine learning model to monitor for occurrence of the gaze of the user; and using a second trained machine learning model to monitor for the movement of the mouth of the user.

In the same field of invention, Scherer teaches: assessing suicide risk of a patient based on non-verbal characteristics. Scherer further teaches: wherein processing image frames of a stream using at least one trained machine learning model stored locally on the client device to monitor for occurrence of both the gaze of the user and the movement of the mouth of the user comprises: using a first trained machine learning model to monitor for occurrence of the gaze of the user; and using a second trained machine learning model to monitor for the movement of the mouth of the user (col. 3, l. 1-14; col. 7, l. 32-40) (e.g., the machine learning algorithms can be trained by looking at utterances and interviews; voice  features such voice quality and visual features such as gaze may be used to train different machine learning algorithms).
Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of Okubo, Cassidy, Rajendran and Scherer before the effective filing date of the invention, to have combined the teachings of Scherer with the method as taught by Okubo, Cassidy, Rajendran. 
One would have been motivated to make such combination because a way to assess suicide risk would have been obtained and desired, as expressly taught by Scherer (col. 1, l. 16-19).


Claim 15 is  rejected under 35 U.S.C. 103 as being unpatentable over Okubo et al. (US20160373269, hereinafter Okubo) in view of Cassidy et al. (US9263044, hereinafter Cassidy) in view of Rajendra (US20170026764, Rajendran), further in view of Nicholson et al. (US20060192775, Nicholson).

As to claim 15:
Okubo, Cassidy, Rajendran show a method substantially as claimed, as specified above. 
Okubo, Cassidy, Rajendran fail to specifically show: detecting, based on a signal from a presence sensor, that a human is present in an environment of the client device; and causing the one or more cameras to provide the stream of image frames in response to detecting that the human is present in the environment.
In the same field of invention, Nicholson teaches: using detected visual cues to change computer system operating states. Nicholson further teaches: detecting, based on a signal from a presence sensor, that a human is present in an environment of the client device; and causing the one or more cameras to provide the stream of image frames in response to detecting that the human is present in the environment (abstract).
Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of Okubo, Cassidy, Rajendran and Nicholson before the effective filing date of the invention, to have combined the teachings of Nicholson with the method as taught by Okubo, Cassidy, Rajendran. 
One would have been motivated to make such combination because a way to reduce power consumed by a display when a user is not looking at a display, as detected by a webcam, would have been obtained and desired, as expressly taught by Nicholson (abstract).



It is noted that any citation to specific, pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way.  A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33,216 USPQ 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006,1009, 158 USPQ 275, 277 (CCPA 1968)).

Response to Arguments
Applicant’s arguments have been fully considered but are not persuasive. Examiner reiterates that references to specific columns, figures or lines should not be limiting in any way. The entire reference provides disclosure related to the claimed invention. 
1) Applicant argues:
Applicant's attorney submits that cited para. [0124] of Okubo fails to render obvious "detecting, based on the monitoring, occurrence of both:" "line-of-sight" (the alleged "gaze ...") and "lip detection" (the alleged "movement of the mouth").
Using both "lip detection" and "line-of-sight" is not described in para. [0124] of Okubo. Rather, one sentence describes that a "determination is made regarding whether or not the user will start talking, using detection of the line-of-sight ..." and another sentence separately describes "determining whether the user will start to talk based on the lip actions".
Examiner disagrees.
Using both "lip detection" and "line-of-sight" is indeed described in Okubo, partly in para. [0124]. Fig. 3 of Okubo clearly shows identifying an object device by using both line of sight (step S103) and user speech (step S101). As explained by Applicant, the trigger for analyzing user speech can be "line-of-sight" and/or "lip detection". However, as clearly explained in para. [0120] of Okubo, "line-of-sight" and “speech content” analysis (which may involve line of sight and/or lip detection triggers) may be performed in parallel to identify an object device. 
Therefore, Okubo, Cassidy, Rajendran continue to render obvious claims 1, 16, 17, including "detecting, based on the monitoring, occurrence of both:" "line-of-sight" ("gaze ...") and "lip detection" ("movement of the mouth").
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Arakawa et al. 		[U.S. 20180246569]
White et al. 		[U.S. 20190187787]
Nguyen et al. 		[U.S. 20200167597]

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jordany Núñez whose telephone number is (571)272-2753.  The examiner can normally be reached on M-F 8:30 AM - 5 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on 5712703264.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/JORDANY NUNEZ/Primary Examiner, Art Unit 2171                                                                                                                                                                                                        10/24/2022