DETAILED ACTION
1.	This communication is in response to the Amendments and Arguments (RCE) filed on 3/4/2022. Claims 1-4, 6-8, 10-14, 16-18, 20 are pending and have been examined. Claims 5, 9, 15, 19 are cancelled.
Response to Amendments and Arguments
2.	Applicant's arguments with respect to claim rejections under 35 U.S.C. 103 have been fully considered, but they are not persuasive. In particular, the applicant argues that the references do not teach “determine an accuracy associated with the acquired image information based on the context information .. the first speech recognition result is weighted based on the determined accuracy associated with the acquired image information .. determine an accuracy associated with the received voice signal based on the 2DOCKET No. SAMS07-97055APPLICATION NO. 16/714,386PATENTcontext information .. the second speech recognition result is weighted based on the determined accuracy associated with the voice signal.” In response, the examiner respectfully disagrees.
Note that BASU teaches: [Hierarchical Template Matching] “Thus, based on the confidence measure <read on accuracy>, the probability module 30 decides which probability, i.e., the probability from the visual information path or the probability from the audio information path, to rely on more. This determination may be represented in the following manner: w 1 v p +w 2 a p. It is to be understood that vp represents a probability associated with the visual information, ap represents a probability associated with the corresponding audio information, and w1 and w2 represent respective weights. Thus, based on the confidence measure 32, the module 30 assigns appropriate weights to the probabilities. For instance, if the surrounding environmental noise level is particularly high <read on the audio context information>, i.e., resulting in a lower acoustic confidence measure, there is more of a chance 30 assigns a lower weight for w2 than for w1, placing more reliance on the decoded information from the visual path. However, if the noise level is low and thus the acoustic confidence measure is relatively higher, the module may set w2 higher than w1. Alternatively, a visual confidence measure <read on the corresponding visual context information>, may be used. It is to be appreciated that the first joint use of the visual information and audio information in module 30 is referred to as decision or score fusion.”
Claim Rejections - 35 USC § 103
3.	Claims 1-4, 6-7, 10-14, 16-17, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Kim, et al. (US 20110071830; hereinafter KIM) in view of Takayanagi, et al. (US 20170309275; hereinafter TAKAYANAGI), and further in view of Basu, et al. (US 6594629; hereinafter BASU).
As per claim 1, KIM (Title: Combined lip reading and voice recognition multimodal interface system) discloses “An electronic device comprising: a camera; a microphone; a display; a memory; and a processor (KIM, [0009], camera; [0049], microphone; [0010], application service screen; [0037], memory; [0036], processor) configured to:    
 receive an input for activating an intelligent agent service from a user while at least one application is executed (KIM, [Abstract], a combined lip reading and voice recognition multimodal interface system, which can issue a navigation operation instruction <read on activating an intelligent agent service/application> only by voice and lip movements), [ identify context information ] of the electronic device (KIM, [0009], receive an instruction in an environment where a voice recognizer does not work due to noise <read on context>. Also see TAKAYANAGI and BASU), 
control to acquire image information of the user through the camera [ based on the identified context information ] including information on the executed at least one application, [determine an accuracy associated with the acquired image information based on the context information ], detect movement of a user's lips included in the acquired image information to obtain a first speech recognition result, wherein [ the first speech recognition result is weighted based on the determined accuracy associated with the acquired image information ] (KIM, [0009], a lip reading system that effectively detects lips from a face image through a camera, suitably tracks lip movements, and suitably recognizes a voice instruction based on feature values of the lips, and then suitably combines the lip reading system with an audio-based voice recognition system such that lip reading using a camera image can suitably receive an instruction in an environment where a voice recognizer does not work due to noise <read on ‘information on the executed at least one application’>), 
control to receive a voice signal through the microphone, determine an accuracy associated with the received voice signal [ based on the 2DOCKET No. SAMS07-97055APPLICATION NO. 16/714,386PATENTcontext information ], analyze the voice signal to obtain a second speech recognition result, wherein [ the second speech recognition result is weighted based on the determined accuracy associated with the voice signal ] (KIM, [0012], a voice recognition unit that suitably recognizes voice from the input audio signal and calculates an estimated recognition accuracy),
[ recognize a combination speech based on the first speech recognition result and the second speech recognition result ], and perform a function of the executed at least one application corresponding to the recognized combination speech (KIM, [0012], a voice recognition and lip reading recognition result combining unit that suitably outputs the voice .”
KIM does not explicitly disclose “identify context information .. based on the identified context information ..” However, this feature is taught by TAKAYANAGI (Title: Method and apparatus for recognizing speech by lip reading). 
In the same field of endeavor, TAKAYANAGI teaches: [0119] “The signal to noise data and signal to brightness data can be obtained by the audio input device 314 and the video input device 316 together with the controller 306” and [0126] “one rule associated with the variable text conversion value can be a signal to noise ratio between the audio signal and a background noise is below a predetermined threshold. In this case, the controller can be configured to disable the audio input device.”
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of TAKAYANAGI in the system taught by KIM to determine the signal-to-noise or signal-to-brightness ratio to be used as context information for speech recognition mode selection.
KIM in view of TAKAYANAGI does not explicitly disclose “determine an accuracy associated with the acquired image information based on the context information .. the first speech recognition result is weighted based on the determined accuracy associated with the acquired image information .. (determine an accuracy associated with the received voice signal) based on the 2DOCKET No. SAMS07-97055APPLICATION NO. 16/714,386PATENTcontext information .. the second speech recognition result is weighted based on the determined accuracy associated with the voice signal .. recognize a combination speech based on the first speech recognition result and the second speech recognition result.” However, this feature is taught by BASU (Title: Methods and apparatus for audio-visual speech detection and recognition).
In the same field of endeavor, BASU teaches: [Hierarchical Template Matching] “Thus, based on the confidence measure <read on accuracy>, the probability module 30 decides which probability, i.e., the probability from the visual information path or the probability from the audio information path, to rely on more. This determination may be represented in the following manner: w 1 v p +w 2 a p. It is to be understood that vp represents a probability associated with the visual information, ap represents aprobability associated with the corresponding audio information, and w1 and w2 represent respective weights. Thus, based on the confidence measure 32, the module 30 assigns appropriate weights to the probabilities. For instance, if the surrounding environmental noise level is particularly high <read on the audio context information>, i.e., resulting in a lower acoustic confidence measure, there is more of a chance that the probabilities generated by the acoustic decoding path contain errors. Thus, the module 30 assigns a lower weight for w2 than for w1, placing more reliance on the decoded information from the visual path. However, if the noise level is low and thus the acoustic confidence measure is relatively higher, the module may set w2 higher than w1. Alternatively, a visual confidence measure <read on the corresponding visual context information>, may be used. It is to be appreciated that the first joint use of the visual information and audio information in module 30 is referred to as decision or score fusion.”

As per Claim 2 (dependent on claim 1), KIM in view of TAKAYANAGI and BASU further discloses “acquire noise around the electronic device through the microphone; and store information on the acquired noise around the electronic device as the context information (TAKAYANAGI, [0119], The signal to noise data and signal to brightness data can be obtained by the audio input device 314 and the video input device 316 together with the controller 306; KIM, [0049], microphone; [0037], memory <read on a ready mechanism to store any information>).” 
As per Claim 3 (dependent on claim 2), KIM in view of TAKAYANAGI and BASU further discloses “activate the camera based on the noise around the electronic device being higher than or equal to a preset value (TAKAYANAGI, [0126], one rule associated with the variable text conversion value can be a signal to noise ratio between the audio signal and a background noise is below a predetermined threshold. In this case, the controller can be configured to disable the audio input device; KIM, [0009], a combined lip reading and voice recognition multimodal interface system, which implements a lip reading system that effectively detects lips from a face image through a camera, suitably tracks lip movements, and suitably recognizes a voice instruction based on feature values of the lips, and then suitably combines the lip reading system with an audio-based voice recognition system such that lip reading using a camera image <read .”  
As per Claim 4 (dependent on claim 1), KIM in view of TAKAYANAGI and BASU further discloses “store at least one of a type or an execution state of the at least one application being executed as the context information; and activate the camera based on the at least one application being executed reproducing music or a video (KIM, [Abstract], issue a navigation operation instruction .. allowing a driver to look ahead during a navigation operation <read on the execution state of the navigation application as context>; [0037], memory <read on information storage>; [0010], an application service screen of a navigation system as an interactive system based on a suitable scenario; [0074], a locating screen, a routing screen, an actual road guide screen, etc. <read on reproducing video>; [0009], a lip reading system that effectively detects lips from a face image through a camera <read on the associated activation under any condition per system design choice>).”    
As per Claim 6 (dependent on claim 1), KIM in view of TAKAYANAGI and BASU further discloses “based on brightness of the acquired image information being equal to or lower than a preset value: recognize a voice recognition-based speech corresponding to the voice signal except for the user’s movement of the lips; and perform a function of the at least one executed application corresponding to the recognized voice recognition-based speech (TAKAYANAGI, [0127], one rule associated with the variable text conversion value can be a signal to brightness ratio is below a predetermined threshold. In this case, the controller can be configured to disable the video input device; KIM, [Abstract], a combined lip reading and voice recognition multimodal interface .”  
As per Claim 7 (dependent on claim 1), KIM in view of TAKAYANAGI and BASU further discloses “based on brightness of the acquired image information being lower than a preset value, display a user interface indicating failure of recognition of the movement of the user’s lips to the user through the display (TAKAYANAGI, [0127], one rule associated with the variable text conversion value can be a signal to brightness ratio is below a predetermined threshold. In this case, the controller can be configured to disable the video input device <read on failure of lips movement recognition, which can be broadly interpreted>; [0118], use the first dictation when video input device detects no lip movement; KIM, [0010], application service screen <read on a ready mechanism to display any information. Also see CUTLER below>).”
As per Claim 10 (dependent on claim 1), KIM in view of TAKAYANAGI and BASU further discloses “based on the intelligent agent service being activated, provide at least one piece of information on the at least one application (KIM, [Abstract], issue a navigation operation instruction only by voice and lip movements, thus allowing a driver to look ahead during a navigation operation <read on an intelligent agent service being activated> and reducing vehicle accidents related to navigation operations <read on the associated application information such as route guidance> during driving).”
Claims 11-14, 16-17, 20 (similar in scope to claims 1-4, 6-7, 10) are rejected under the same rationale as applied above for claims 1-4, 6-7, 10. 
s 8, 18 are rejected under 35 U.S.C. 103 as being unpatentable over KIM in view of TAKAYANAGI and BASU, and further in view of Cutler, et al. (US 20040267521; hereinafter CUTLER).
As per Claim 8 (dependent on claim 1), KIM in view of TAKAYANAGI and BASU further discloses “based on a plurality of lips being detected based on the acquired image information: [identify the voice signal ] and movements of the plurality of lips; and [ display a user interface for distinguishing lips corresponding to the user from other lips through the display ] (KIM, [0009], a lip reading system that effectively detects lips from a face image through a camera, suitably tracks lip movements, and suitably recognizes a voice instruction based on feature values of the lips; [0010], application service screen).”
KIM in view of TAKAYANAGI and BASU does not explicitly disclose “identify the voice signal and movements of the plurality of lips .. display a user interface for distinguishing lips corresponding to the user from other lips through the display ..” However, this feature is taught by CUTLER (Title: System and method for audio/video speaker detection). 
In the same field of endeavor, CUTLER teaches: [Abstract] “The audio and video are inputted into a time-delay neural network that processes the data to determine which target is speaking. The neural network processing is based upon a correlation to detected mouth movement from the video data and audio sounds detected by the microphone,” [0015] “simultaneous speakers and background noise can be handled by first using a microphone array to beam form on each face detected and then evaluating the TDNN 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190.”
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of CUTLER in the system taught by KIM, TAKAYANAGI and BASU to correlate received voice to one particular speaker from a plurality of users.
Claim 18 (similar in scope to claim 8) is rejected under the same rationale as applied above for claim 8. 
Conclusion 
5.	 Any inquiry concerning this communication or earlier communications from the examiner should be directed to FENG-TZER TZENG whose telephone number is (571)272-4609. The examiner can normally be reached on M-F (8:00-5:30). The fax phone number where this application or proceeding is assigned is 571-273-4609.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir (SPE) can be reached on (571)272-7799.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would 

/FENG-TZER TZENG/	3/9/2022

Primary Examiner, Art Unit 2659