DETAILED ACTION
This Final action is in response to an amendment filed 3/23/2011.  Currently claims 1-7 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretation
The phrase “ the user and the device.. in a relative direct view state” was interpreted as the user looking at the device directly per the Specification par. 23.
 Claim Objections
Claims 1-5 are objected to because of the following informalities:  
Claims 1-5: 
Claim 1 last line in pg. 3 and first three lines in pg. 4
Claim 2 lines 3-6 [preamble]
Claim 3 lines 5-7
Claim 4 lines 3-6 [preamble] 
Claim 5 lines 3-6 [preamble] 
recite “performing, by the control device, the operation corresponding to the current behavior and the intention of the user according to the preset corresponding relationship between the current behavior and the intention of the user and the operation” appears to be left by mistake. This limitation was deleted from the recognizing step, as such it wasn’t given patentable weight.  Appropriate correction is required.
Claim 2 last 4 lines in the step “when the time” recite “and performing, by the control device, the operation corresponding to the current behavior and the intention of the user according to the preset corresponding relationship between the current behavior and the intention of 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-5 are rejected under 35 U.S.C. 103 as being unpatentable over Yamada in US 2015/0331490 (hereinafter Yamada) in view of Goodman et al. in US 2014/0176662 (hereinafter Goodman) and Kusaka et al. in US 2005/0001024 (hereinafter Kusaka).

Regarding claim 1, Yamada discloses a human-computer interaction method (Yamada’s Figs. 3, 20) based on a direct view (Yamada’s Figs. 7, 19), comprising steps of: acquiring direct view image data (Yamada’s Fig. 7a and par. 73-80: image frame when direct line of sight at start of voice section, and Figs. 19-20 par. 253-257) collected by a camera (Yamada’s Figs. 1-2 and par. 42, 73) when a user and a device are in a relative direct view state (Yamada’s Fig. 7 and par. 80: example (a) and Fig. 19); collecting current image data (Yamada’s Fig. 7a and par. 73-80: consecutive image frames, e.g. direct line of sight image frame at end of voice section) of the user (Yamada’s Fig. 8 per par. 96) by the camera 
Yamada fails to explicitly disclose the collecting of current image data of the user to be in real time, explicitly comparing the collected current image data with the direct view image data or the step of orienting a sound acquisition device towards the source position to be prior to the step of determining the user and the device being in the relative direct view state. Yamada also fails to disclose calling, by a control device, a video phone according to user indent and expression, or the visual recognition technology and the speech technology comprising voiceprint recognition, age recognition, card recognition, pupil recognition or iris recognition.
However, because Yamada does disclose the method to occur based on time of speaking to enter a command to control a TV (Yamada’s Figs. 1, 4 and par. 43), the determination of the target voice source based on whether the user is viewing a specific region for the whole voice utterance (Yamada’s Fig. 19-20 and par. 240) and directing the microphone array to the user based on visual signals (Yamada’s par. 376), thus, it would have been obvious to one of ordinary skill in the art that the collection of current image would be in real time to obtain the predictable result of entering commands as the TV is being watched (Yamada’s par. 43), that the collected current image data and the direct image data are compared in order to obtain the predictable result of determining that the user is viewing the specific area the whole time of utterance (Yamada’s par. 240) and that the step of orienting the microphone towards the source position would be prior to the step of determining the user and the device being in the relative direct view state, in order to obtain the predictable result of orienting the microphone based on data already collected (at steps S502 to S504 of Yamada’s Fig. 20) before the voice utterance recognition (at last step of Yamada’s Figs. 3, 6 and par. 43) to have high accuracy even with a high level of noise (Yamada’s par. 69). 

Still, Yamada in view of Goodman fail to disclose the visual recognition technology and the speech technology comprising voiceprint recognition, age recognition, card recognition, pupil recognition or iris recognition. However, in the related field of user identification, Kusaka discloses identifying a user by recognizing voiceprint (Kusaka’s par. 263), age (Kusaka’s par. 269, 273), card (Kusaka’s par. 157), pupil (Kusaka’s par. 263) and iris (Kusaka’s par. 272). Therefore, it would also have been obvious to one of ordinary skill in the art, to include these technologies in Yamada in view of Goodman’s visual recognition and speech technology, in order to obtain the benefit of user identification by using a card (Kusaka’s par. 157), voice print and pupils (Kusaka’s par. 263), irises (Kusaka’s par. 272) and age (Kusaka’s par. 269) given that Yamada already discloses identifying the user (Yamada’s par. 297). By doing such combination, Yamada in view of Goodman and Kusaka disclose:
A human-computer interaction method (Yamada’s Figs. 3, 20) based on a direct view (Yamada’s Figs. 7, 19), comprising steps of: 
acquiring direct view image data (Yamada’s Fig. 7a and par. 73-80: image frame when direct line of sight at start of voice section, and Figs. 19-20 par. 253-257) collected by an image acquisition device (112f: maps to camera per instant spec. par. 23)(Yamada’s Figs. 1-2 and par. 
collecting current image data (Yamada’s Fig. 7a and par. 73-80: consecutively captured image frames , e.g. image frame when direct line of sight at end of voice section and Figs. 19-20 par. 253-257) of the user in real time (Yamada’s par. 73: image frames consecutively capture and obvious for TV watching and changing channel of Yamada’s par. 43) by the image acquisition device (Yamada’s Figs. 1-2 and par. 42, 73), and comparing the collected current image data with the direct view image data (Yamada’s par. 264-265: obvious to determine user has been viewing the specific position from start to end);
determining the user and the device being in the relative direct view state (Yamada’s par. 264-265) when the collected current image data is consistent with the direct view image data (Yamada’s par. 264-265: user is viewing specific condition at start and end); and 
recognizing behavior (Yamada’s Figs. 7, 19 and par. 73-80, 253-257: whether user is looking at a specific area) and intention of the user (Yamada’s par. 3, 43: utterance that results in processing) by a visual recognition technology and a speech recognition technology (Yamada’s Fig. 3) of a computer (Yamada’s Figs. 1-3 and par. 323) when the user and the device are in the relative direct view state (Yamada’s Fig. 7 and par. 80: example (a) and Fig. 19), 
calling (Goodman’s Figs. 6, 8), by a control device, a video phone (Goodman’s par. 22 and Fig. 11: video call initiated by the user) according to user identity (Goodman’s Fig. 11 and par. 22: initiated by user, par. 29, 56, 79: user ID for video call) and expression (Goodman’s Figs. 6, 8, 12 and par. 22: facial expression in video stream),
wherein the visual recognition technology and the speech recognition technology (Yamada’s Fig. 3) of the computer comprises face recognition (Yamada’s Fig. 3 and par. 73), speech recognition (Yamada’s Fig. 3 and par. 68: voice extraction and recognition), semantic understanding (Yamada’s Fig. 3 and par. 68: expression according to dictionary), gesture 
wherein when the collected current image data is consistent with the direct view image data (Yamada’s par. 264-265: user is viewing specific condition at start and end), prior to the step of determining the user and the device being in the relative direct view state (Yamada’s par. 264-265: Yes out of step S508 in Fig. 20), the method further comprises: 
locating a face position of the user (Yamada’s Fig. 20 and par. 82-83, 250 steps S503: lip region) as a sound source position (Yamada’s par. 250) when the user is detected (Yamada’s par. 43, when a user gives an utterance then the process takes place); and 
orienting a sound acquisition device (112f: maps to microphone per instant spec. par. 51)(Yamada’s par. 42) towards the sound source position (Yamada’s par. 360, 376: microphone array directed at user [sound source position]. Obvious before step S508 to orient microphones for voice recognition per Yamada’s par. 43, 69); 
wherein the step of recognizing the behavior and the intention of the user by the visual recognition technology and the speech recognition technology of the computer comprises: 
collecting user sound data by the sound acquisition device (Yamada’s par. 42-44: voice collected by microphone); when the collected user sound data carries a speech operation instruction (Yamada’s par. 3, 43: utterance), 
extracting the speech operation instruction when the collected user sound data carries the speech operation instruction (Yamada’s par. 43-44), and performing, by the control device (Yamada’s Figs. 1-3 and par. 323), an operation corresponding to the speech operation instruction (Yamada’s par. 3, 43). 

claim 2, Yamada in view of Goodman and Kusaka disclose wherein the step of recognizing the behavior and the intention of the user by the visual recognition technology and the speech recognition technology of the computer comprises: 
counting time that the user and the device are in the relative direct view state (Yamada’s par. 240, counting time is required for conditions 1  and 3); 
when the time that the user and the device are in the relative direct view state is greater than a preset time (Yamada’s par. 240: whole utterance or 2 seconds), recognizing the behavior (Yamada’s Figs. 7, 19 and par. 73-80, 253-257: whether user is looking at a specific area) and the intention of the user (Yamada’s par. 3, 43: utterance that results in processing) by the visual recognition technology and the speech recognition technology of the computer (Yamada’s Fig. 3), and performing, by the control device, an operation corresponding to the current behavior (Yamada’s Figs. 7, 19 and par. 73-80, 253-257: user is looking at a specific area) and the intention of the user (Yamada’s par. 3, 43: utterance that results in processing) according to a preset corresponding relationship between the current behavior and the intention of the user and the operation (Yamada’s Figs. 19-20 and par. 3, 43, 264-266 looking at a specific area, uttering a command and changing the channel). 

Regarding claim 4, Yamada in view of Goodman and Kusaka disclose wherein the step of recognizing the behavior and the intention of the user by the visual recognition technology and the speech recognition technology of the computer comprises: 
performing the speech recognition (Yamada’s par. 42-43) and the lip recognition to the user (Yamada’s par. 82-83); when a speech recognition result is consistent with a lip recognition result (Yamada’s Fig. 21 and par. 290: lip motion information coincides with voice source direction), responding, by the control device, to the speech operation of the user (Yamada’s par. 43). 

claim 5, Yamada in view of Goodman and Kusaka disclose wherein the step of recognizing the behavior and the intention of the user by the visual recognition technology and the speech recognition technology of the computer comprises: 
performing the speech recognition (Yamada’s Fig. 3) and the semantic understanding (Yamada’s par. 43) of the user; 
when a speech recognition result (Yamada’s Fig. 20: voice recognition at steps S501-S504) and a semantic understanding result (Yamada’s par. 43) are consistent with a current scene of the device (Yamada’s Fig. 19 and par. 48, 240: scene of camera is user looking at specific area), responding, by the control device, to the speech operation of the user (Yamada’s par. 143). 

Claims 3 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over Yamada in view of Goodman and Kusaka as applied above, in further view of Benea et al. in US 8,965,170 (hereinafter Benea).

Regarding claim 3, Yamada in view of Goodman and Kusaka fail to explicitly disclose wherein when the time that the user and the device are in the relative direct view state is greater than the preset time (Yamada’s par. 240: whole utterance or 2 seconds), after the step of recognizing the behavior and the intention of the user by the visual recognition technology and the speech recognition technology of the computer, the method further comprises: finding preset video image data matching with user identity, and displaying, by the control device, the found video image data. However, in the same field of endeavor of automatic control of a device based on facial recognition, Benea discloses using a user identifier to play preset content (Benea’s Figs. 5, 7, col. 2 lines 63-67 and col. 9 lines 17-63). Therefore, it would have been obvious to one of ordinary skill in the art to use Benea’s teachings in Yamada in view of Goodman and Kusaka’s method, in order to obtain the benefit of transition of content among 
wherein when the time that the user and the device are in the relative direct view state is greater than the preset time (Yamada’s par. 240: whole utterance or 2 seconds), after the step of recognizing the behavior and the intention of the user by the visual recognition technology and the speech recognition technology of the computer (Yamada’s Figs. 1-3 and par. 323. This is equivalent to Benea’s Fig. 5 and col. 7 lines 47-48), the method further comprises: 
finding preset video image data (Benea’s Fig. 7 and col. 9 lines 28-35) matching with user identity (Benea’s col. 9 lines 28-57), and displaying, by the control device, the found video image data (Benea’s Fig. 7 and col. 9 lines 49-63).

Regarding claim 6, Yamada in view of Goodman and Kusaka fail to disclose wherein when the collected current image data is consistent with the direct view image data, after the step of determining the user and the device being in the relative direct view state, the method further comprises: receiving an operation instruction inputted by the user, the operation instruction comprising a non-direct view state operation instruction and a direct view state operation instruction; responding to the non-direct view state operation instruction inputted by the user when detecting the user being no longer in the direct view state; and responding to the direct view state operation instruction inputted by the user when detecting the user being in the direct view state again. However, in the same field of endeavor of automatic control of a device based on facial recognition, Benea discloses receiving an operation instruction inputted by the user (Benea’s Fig. 5 and col. 7 lines 39-40, 63-64: power-on 1st device or 2nd device), the operation instruction comprising a non-direct view state operation instruction (Benea’s Fig. 5 and col. 7 lines 63-64 the power on of the 2nd device is not in the direct view of state of the 1st device when devices are in different rooms such as in Figs. 2) and a direct view state operation instruction (Benea’s Fig. 5 and col. 7 lines 39-63: power on 1st device); responding to the non-
wherein when the collected current image data is consistent with the direct view image data (Yamada’s par. 264-265: user is viewing specific condition at start and end), after the step of determining the user and the device being in the relative direct view state (Yamada’s Fig. 7 and par. 80: example (a) and Fig. 19), the method further comprises: 
receiving an operation instruction inputted by the user (Benea’s Fig. 5 and col. 7 lines 39-40, 63-64: power-on 1st device or 2nd device), the operation instruction comprising a non-direct view state operation instruction (Benea’s Fig. 5 and col. 7 lines 63-64 the power on of the 2nd device is not in the direct view of state of the 1st device when devices are in different rooms such as in Figs. 2) and a direct view state operation instruction (Benea’s Fig. 5 and col. 7 lines 39-40: power on 1st device which is in direct view of Yamada’s Figs. 7 and 19 to enter a command); 
responding to the non-direct view state operation instruction inputted by the user when detecting the user being no longer in the direct view state (Benea’s transition content to Fig. 4B from Fig. 4A and Fig. 5 steps 510-512 per col. 7 lines 63-67); and 
responding to the direct view state operation instruction (Benea’s Fig. 5 and col. 7 lines 43-55: viewing content when face recognized, equivalent to Yamada’s Figs. 7 and 19) inputted . 

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Yamada in view of Goodman and Kusaka as applied above, in further view of Hinde et al. in US 2002/0105575 (hereinafter Hinde).
Yamada in view of Goodman and Kusaka disclose wherein after the step of acquiring current image data of the user in real time by the image acquisition device (Yamada’s Fig. 7a and par. 73-80: image frame when direct line of sight at start of voice section, and Figs. 19-20 par. 253-257. After this, there are other iterations of the method), the method further comprises: 
acquiring the image data when the user and the device are in the direct view state (Yamada’s Fig. 7a and par. 73-80: image frame when direct line of sight at start of voice section, and Figs. 19-20 par. 253-257); 
comparing the image data when the user and the device are in the direct view state with the collected current image data (Yamada’s par. 264-265: obvious to determine user being viewing the specific position from start to end).
Yamada in view of Goodman and Kusaka fail to disclose when the image data when the user and the device are in the direct view state is consistent with the collected current image data, activating the visual recognition technology and the speech recognition technology of the computer, or a preset operation comprising recording and playing a video. However, in the same field of endeavor of voice control over apparatus, Hinde discloses activating a speech recognition unit after collecting image data (Hinde’s par. 24, 26). Therefore, it would have been obvious to one of ordinary skill in the art at the time of filing to use Hinde’s teaching in Yamada in view of Goodman and Kusaka’s invention, in order to obtain the benefit of enabling only one device at a time responsive to voice control (Hinde’s par. 28). By doing such combination, Yamada in view of Goodman, Kusaka and Hinde disclose:
.

Response to Arguments
Applicant's arguments filed 3/23/2021 have been fully considered but they are not persuasive. On the Remarks pgs. 8-9 Applicant argues that Yamada fails to disclose “comparing the collected current image data with the direct view image data” because Yamada determination as to whether the user is viewing the specific position is performed based on the face estimation information or the line-of-sight direction estimation information and not on the comparing as claimed. The office must respectfully disagree, as explained in the rejection above, it is obvious that in order to determine that the user is looking at a specific area for a whole voice utterance (Yamada’s par. 264-265 and Figs. 19-20: start and end), the system obviously compares each collected capture image frame (Yamada’s Fig. 7a and par. 73-80: consecutively captured image frames, e.g. at least at the end and start of voice utterance) to determine the line of sight is in the specific area (Yamada’s Fig. 7a and par. 73-80: image frame when direct line of sight at specific area), in other words the direct-view image data can be the captured image when direct line of sight at the start of voice utterance and the current image data can be the captured image when direct line of sight at the end of voice utterance.
With respect to the added limitation of calling a video phone according to user identity and expression, please see newly cited reference to Goodman.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Liliana Cerullo whose telephone number is (571)270-5882.  The examiner can normally be reached on 8AM to 3PM MT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amr Awad can be reached on 571-272-7764.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR 






/LILIANA CERULLO/Primary Examiner, Art Unit 2621