DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the office action from 10/13/2021, the applicant has submitted an amendment, filed 1/10/2022, amending claims 1-3, 5, 7, 10-16, cancelling claim 4, while arguing to traverse the prior art rejections. Applicant’s arguments have been fully considered but are moot with respect to new grounds of rejections further in view of Fouillade et al. (US 2012/0316676) mandated by the latest amendments.
Response to Arguments
In what follows applicant’s arguments and comments will be addressed in the order presented with each argument presented in a given paragraph to be followed by one or more paragraphs of respective responses.
Page 13 ¶’s 1-2 provide broad overview of the latest amendments and their support, as well as the last office action without any arguments.
Page 13 ¶ 3 discusses the previous 112(b) rejections of claims 13 and 15.
Due to the latest amendments the said rejections are withdrawn.
Page 14 ¶ 1 discusses the previous 101 rejection of claim 16.
Due to the latest amendments the said rejection is now withdrawn.
nd ¶ on page 14 to the end of the 3rd ¶ on page 15, it is argued why the primary reference Nakadai et al. fails to teach the latest amendments.
Since a new reference is used for those amendments, therefore the applicant is respectfully directed to the new office action for further details.
Sections B and A on pages 15-16 assert that dependent claims are allowable by dependence on their presumed allowed parent claims.
Since applicants have not argued the merits of these dependent claims, but assert patentability solely through their dependence on the allegedly patentable parent claims, they stand or fall with said parent claims and hence no further response to applicant’s arguments is necessary.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-11, 13-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nakadai et al. (US 2004/0104702), and further in view of Fouillade et al. (US 2012/0316676).

:
a voice processing unit is configured to execute a voice recognition process based on a user utterance (Title: “ROBOT” (an information processing apparatus) “AUDIOVISUAL SYSTEM”; Abstract lines 8+ teach an “audition module” (a voice processing unit of the “ROBOT”) which receives “sound” using “microphones”, which according to ¶ 0156 sentence 1 is sent to “speech (voice) recognition circuit” (for voice recognition process of any utterance received from the “microphones”) , 
wherein
the voice processing unit includes: a sound source direction/voice section designation unit configured to designate a sound source direction and a voice section of the user utterance (¶ 0047: “audition module, the vision module” “cooperate” “to allow determining the direction of each speaker on the basis of the directional information of locating the sound source from the auditory event and locating the speaker from the visual event” (“audition module” “vision module” (the voice processing unit) includes a sound source direction/voice section as it can “determin[e]” (designate) sound source  direction of voice of a “speaker” (user) utterance; Abstract line 11: “auditory event” (voice section) is “extract[ed]” using the “identifi[ed]” “sound source”); ¶ 0037: “audition module” “to identify the individual speakers as” “sound sources” “and then extracts their own auditory events” (to also determine “speakers” (user) voice sections, 
wherein the designation is based on a line of sight of its user, who has executed the user utterance, is at a specified area (¶ 0144 sentence before last: “face locator 33 determines the direction” (determining user’s line of sight) “in which the identified face lies”; ¶ 0252 last 7 lines: “the association module upon locating the source of the auditory event and locating the face of the visual event” (e.g. upon determining the “participant” (user responsible for the utterance) looking toward the “robot” (a predetermined specified area)) “determines the direction in which the speaker is present and forms an auditory and a visual stream and an associated stream therefor” (executes a designation process)); e.g. according to ¶ 0253 lines 4+: “directing the robot to face opposite to the object speaker”);
and
a voice recognition unit configured to execute a voice recognition process that corresponds to voice data in the sound source direction and the voice section (¶ 0193 last 7 lines: “the speech recognition circuit 55” (a voice recognition unit) “capable of recognizing a speech” (directed at “speech” (voice section)) for the processes including interaction between the “robot” and a “participant”, e.g. “robot” “ask[s]” “Good afternoon, Mr. XXX?” (¶ 0194) and “Participant P answers” “My name is XXX” (¶ 0197), where all the input by the “participant” is collected by “microphone” (part of the ;
an image processing unit configured to:
accept  a camera- captured image, and identify, based on the camera captured image, the user included in the camera captured image (¶ 0083: “vision module” (an image processing unit) “on the basis of an image taken by a camera” (accepts an input of a camera’s captured image) “is allowed to identify by face” (to identify the user on a basis of the camera image) “each such speaker”); and
a display information generation unit that displays a character image corresponding to the identified user, wherein the character image is displayed in the specified area (the said “robot” comprises of a “display” (¶ 0054 page 5 last 5 lines) which according to ¶ 0057: “the said display preferably includes a visual display for displaying” “an image” (a character image) “of an extracted face”(corresponding to the “participant” (user) who was identified) on the “robot” (specified area) “display”)).
Nakadai et al. do not specifically disclose:
The character image is different from the camera captured image.
Fouillade et al. do teach:
The character image is different from the camera captured image (¶ 0030 sentence 4: “When the robot 102 subsequently recognizes the user 106 (through face or voice recognition) or is otherwise interacting the user 106, the robot 102 can display the 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “voice” and “face” “recognition” and “display” functions of the “robot” of Fouillade et al. into the respective ones of the “robot” of Nakadai et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Nakadai et al. to use the “avatar” image on the robot so that the robot can “act as a proxy of the user” as disclosed in Fouillade et al. ¶ 0030 last sentence.

Regarding claim 2, Nakadai et al. do teach the information processing apparatus according to claim 1, wherein the voice recognition unit is further configured to execute the voice recognition process based on the line of sight of the, user who has executed the user utterance, at the specified area (¶ 0193 lines 5+: “robot 10 keeps looking toward the participant” “P” (e.g. as the “participant” (user) looks at the “robot” (specified area)) “association module 60 gives an input to the speech recognition” “Then”(under that condition) “the speech recognition” [is] “capable of recognizing a 

Regarding claim 3, Nakadai et al. do teach the information processing apparatus according to claim 1, wherein the
image processing unit is further configured to determine whether the line of sight of the user is at the specified area based on the camera captured image (Abstract last 8 lines: “upon” “locating the face for the visual event” (upon accepting input of a camera captured image) “thereby determining” (it determines) “the direction” (e.g. where the “face” line of sight is looking at (e.g. the “robot” (the specified area)); note the “visual event” is associated with “image taken by a camera” (Abstract lines 13-15) , and “motor control module” “directs the robot to face opposite to the object speaker” (¶ 0045 sentence 1); i.e., one “direction” in the “robot” “speaker” interaction is attributed to the 

Regarding claim 5, Nakadai et al. do teach the information processing apparatus according to claim 1, wherein the display information generation unit is further configured to alter the character image displayed in the specified area, based on the line of sight of user at the specified area (¶ 0145 lines 5+: “detected faces” (images) “change” (are altered) “in their size, direction” (based on e.g., a user’s “face” “direction” (line of sight) or as his “face” is turned away from the “robot” e.g., its image “size” is “changed” (altered)); note also according to ¶ 0193 lines 6+: “even if the participant P moves” “vision module 30 is allowed to continued imaging the participant” (i.e., as “P” moves his “face” away from the robot’s camera clearly his “image” (the user corresponding image) will be altered from when he directly “faces” the “robot” (is looking at the specified area)).

Regarding claim 6, Nakadai et al. do teach the information processing apparatus according to claim 1, wherein the specified area
includes a character image area included in an output image of the information processing apparatus (the said “robot” comprises of a “display” (¶ 0054 page 5 last 5 lines) which according to ¶ 0057 lines 2+: “the said display includes a visual display for 

Regarding claim 7, Nakadai et al. do teach the information processing apparatus according to claim 6, wherein the display information generation unit is further configured to display the character image in the character image area  (the said “robot” comprises of a “display” (¶ 0054 page 5 last 5 lines) which according to ¶ 0057 lines 2+: “the said display includes a visual display for displaying” “an image of an extracted face” (i.e., the “robot” (specified area) includes a character image area which displays the “face” (a character image of) each “participant” or “speaker” (user) on the “display” (the character image area)).

Regarding claim 8, Nakadai et al. do teach the information processing apparatus according to claim 1, wherein the specified area
includes an image area of an output image of the information processing apparatus (Abstract last 5 lines: “The system” (i.e., the “robot” (specified area)) “includes a display” (includes an image area) “for displaying at least a portion of auditory” (i.e., for displaying an output image which in the “robot” “participant” example is associated with that of the “participant” “face” (¶ 0057 lines 2+))).


includes an apparatus area of the information processing apparatus (Abstract last 5 lines: “The system” (i.e., the “robot” (specified area)) “includes a display” (includes an apparatus area) “for displaying at least a portion of auditory” (i.e., for displaying an output image which in the “robot” “participant” example is associated with that of the “participant” “face” (¶ 0057 lines 2+))).

Regarding claim 10, Nakadai et al. do teach the information processing apparatus according to claim 1, wherein the sound source direction/voice section designation unit is further configured to: accept inputs of two types of detection results (“audition module” “vision module” (the voice processing unit of the information processing apparatus) “cooperate” for “determining the direction of each such speaker” (¶ 0047 lines 1-5) comprises of),
Wherein
The two types of detection results include first detection results for the sound source direction and the voice section, and the second detection results for the sound source direction and the voice section (Abstract lines 8+: “the audio module (20)” (first detection means) can “locate sound sources” (“determin[es] direction of each such speaker” (¶ 0047 line 4)); Abstract lines 13+: “The vision module” (second means of 
The first detection results for the sound source direction and the voice section are based on an input voice (“audition module” which according to the Abstract line 8+: “the audition module (20)” (the first means of detection) “in response to sound signals from microphones” (detects input voice and thus voice section) “extracts pitches therefrom” “separate” “locate sound sources” (to finally using those data associated with the “audition module” for “determining the direction of each such speaker” (for detecting sound source direction and  voice section (¶ 0047 line 4))), and
The second detection results for the sound source direction and the voice section are based on the camera captured image, and designate the sound source direction and the voice section of the user utterance based on the two types of detection results (“vision module” (second means of detection) according to the Abstract lines 13+: “The vision module (30) on the basis of an image taken by a camera” (detects input image based on camera) “identifies by face, and locate, each such speaker” (to use the said data associated with the “vision module” for “determining the direction of each speaker” (for detecting sound source direction  (¶ 0047 line 4)); ¶ 0252 last sentence:   “the association module upon locating the sound source of the auditory event and locating the face of the visual event” “determines the direction” (the sound source 

Regarding claim 11, Nakadai et al. do teach the information processing apparatus according to claim 10, wherein the first detection results for the sound source direction and the voice section based on the input voice include information from an analysis result for a voice signal acquired by a microphone array (Fig. 3 shows “microphone set 16”(a microphone array installed in the robot head); ¶ 0191 lines 4+: “the microphone set 16” (the microphone array that receives input voice and voice section) “picking up the voice of the participant P, and the audition module 20 forming an auditory event 28” (inputs are analyzed by the voice processing unit to determine the “auditory event” (voice section)) “that identifies the direction” (and to determine or detect the sound source direction) “of the voice (sound) source” ).

Regarding claim 13, Nakadai et al. main embodiment do teach an information processing system comprising a user terminal and a data processing server (¶ 0016: “In this robot visuoauditory system” (an information processing system comprising) “according to the present invention, preferably the said association module is made a server” (a server) “and each of the said audition, vision and motor control modules are made a client” (and a user terminal) “connected to the said server”), wherein

A voice input unit configured to input a user utterance (Abstract lines 8+: “the audition module” (a voice input unit) “in response to sound signals from microphones” (that inputs a user utterance) “extracts pitches therefrom”);
And an image input unit configured to input a camera captured image (Abstract lines 13+: “The vision module” (an image input unit) “on the basis of an image taken by a camera” (to input a user image by a camera) “identifies by face”),
 a voice processing unit configured to execute a voice recognition process based on the user utterance received from the user terminal (Title: “ROBOT” (an information processing apparatus) “AUDIOVISUAL SYSTEM”; Abstract lines 8+ teach an “audition module” (a voice processing unit of the “ROBOT”) which receives “sound” using “microphones”, which according to ¶ 0156 sentence 1 is sent to “speech (voice) recognition circuit” (for voice recognition process of any utterance received from the “microphones”), 
wherein the voice processing unit includes: a sound source direction/voice section designation unit configured to designate a sound source direction and a voice section of the user utterance (¶ 0047: “audition module, the vision module” “cooperate” “to allow determining the direction of each speaker on the basis of the directional information of locating the sound source from the auditory event and locating the speaker from the visual event” (“audition module” “vision module” (the 
wherein the designation is based on a line of sight of its user, who has executed the user utterance, is at a specified area (¶ 0144 sentence before last: “face locator 33 determines the direction” (determining user’s line of sight) “in which the identified face lies”; ¶ 0252 last 7 lines: “the association module upon locating the source of the auditory event and locating the face of the visual event” (e.g. upon determining the “participant” (user responsible for the utterance) looking toward the “robot” (a predetermined specified area)) “determines the direction in which the speaker is present and forms an auditory and a visual stream and an associated stream therefor” (executes a designation process)); e.g. according to ¶ 0253 lines 4+: “directing the robot to face opposite to the object speaker”);
and
a voice recognition unit configured to execute a voice recognition process that corresponds to voice data in the sound source direction and the voice section (¶ 0193 
an image processing unit configured to:
accept  a camera- captured image, and identify, based on the camera captured image, the user included in the camera captured image (¶ 0083: “vision module” (an image processing unit) “on the basis of an image taken by a camera” (accepts an input of a camera’s captured image) “is allowed to identify by face” (to identify the user on a basis of the camera image) “each such speaker”); and
a display information generation unit that displays a character image corresponding to the identified user, wherein the character image is displayed in the specified area (the said “robot” comprises of a “display” (¶ 0054 page 5 last 5 lines) which according to ¶ 0057: “the said display preferably includes a visual display for displaying” “an image” (a character image) “of an extracted face”(corresponding to the “participant” (user) who was identified) on the “robot” (specified area) “display”)).

Nakadai et al. alternative embodiment does teach some voice processing functions carried out by its server (¶ 0123 sentence 1: “Here, the association module 60 (block 5 in FIG. 9) is made up of a server of a client –server system” where the “association module” according to ¶ 0193 sentence 4: “gives an input to the speech recognition circuit 55” (is responsible for some voice processing functions)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate all functions of voice processing including that associated with “the speech recognition circuit 55” (voice recognition) to the “server” by Nakadai et al. main embodiment, so as to “allow their events to be processed rapidly in real time” as suggested in Nakadai et al. ¶ 0029 last sentence.
Nakadai et al. do not specifically disclose:
The character image is different from the camera captured image.
Fouillade et al. do teach:
The character image is different from the camera captured image (¶ 0030 sentence 4: “When the robot 102 subsequently recognizes the user 106 (through face or voice recognition) or is otherwise interacting the user 106, the robot 102 can display the avatar selected by the user” (based on “face” and “voice” “recognition” (e.g. by 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “voice” and “face” “recognition” and “display” functions of the “robot” of Fouillade et al. into the respective ones of the “robot” of Nakadai et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Nakadai et al. to use the “avatar” image on the robot so that the robot can “act as a proxy of the user” as disclosed in Fouillade et al. ¶ 0030 last sentence.

Regarding claim 14, Nakadai et al. do teach an information processing method comprising : 
an information processing apparatus (Title: “ROBOT” (an information processing apparatus) “AUDIOVISUAL SYSTEM”; Abstract lines 8+ teach an “audition module” (a voice processing unit of the “ROBOT”) which receives “sound” using “microphones”, which according to ¶ 0156 sentence 1 is sent to “speech (voice) recognition circuit” (for voice recognition process of any utterance received from the “microphones”):

wherein the designation is based on a line of sight of its user, who has executed the user utterance, is at a specified area (¶ 0144 sentence before last: “face locator 33 determines the direction” (determining user’s line of sight) “in which the identified face lies”; ¶ 0252 last 7 lines: “the association module upon locating the source of the auditory event and locating the face of the visual event” (e.g. upon determining the “participant” (user responsible for the utterance) looking toward the “robot” (a predetermined specified area)) “determines the direction in which the speaker is present and forms an auditory and a visual stream and an associated stream therefor” 
Executing, by a voice recognition unit,  voice recognition process that corresponds to voice data in the sound direction and the voice section (¶ 0193 last 7 lines: “the speech recognition circuit 55” (a voice recognition unit) “capable of recognizing a speech” (directed at “speech” (voice section)) for the processes including interaction between the “robot” and a “participant”, e.g. “robot” “ask[s]” “Good afternoon, Mr. XXX?” (¶ 0194) and “Participant P answers” “My name is XXX” (¶ 0197), where all the input by the “participant” is collected by “microphone” (part of the “Audition module” (sound source direction/voice section designation unit) as disclosed in the Abstract lines 8+));
identifying, based on a camera captured image, the user included in the camera captured image (¶ 0083: “vision module” (an image processing unit) “on the basis of an image taken by a camera” (accepts an input of a camera’s captured image) “is allowed to identify by face” (to identify the user on a basis of the camera image) “each such speaker”); and
displaying a character image corresponding to the identified user, wherein the character image is displayed in the specified area (the said “robot” comprises of a “display” (¶ 0054 page 5 last 5 lines) which according to ¶ 0057: “the said display preferably includes a visual display for displaying” “an image” (a character image) “of an 
Nakadai et al. do not specifically disclose:
The character image is different from the camera captured image.
Fouillade et al. do teach:
The character image is different from the camera captured image (¶ 0030 sentence 4: “When the robot 102 subsequently recognizes the user 106 (through face or voice recognition) or is otherwise interacting the user 106, the robot 102 can display the avatar selected by the user” (based on “face” and “voice” “recognition” (e.g. by capturing image as well as voice of the user), the robot displays an “avatar” (a character different from the “face” (the camera captured image)) “on the display screen of the robot 102” (while interacting with the user); note that step “504” and “506” teach “CAMERA TO CAPTURE” “IMAGES” of “AN INDIVIDUAL”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “voice” and “face” “recognition” and “display” functions of the “robot” of Fouillade et al. into the respective ones of the “robot” of Nakadai et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Nakadai et al. to use the “avatar” image on the robot so that the robot can “act as a proxy of the user” as disclosed in Fouillade et al. ¶ 0030 last sentence.

Regarding claim 15, Nakadai et al. main embodiment do teach an information processing method, comprising:
 an information processing system that includes a user terminal and a data processing server (¶ 0016: “In this robot visuoauditory system” (an information processing system comprising) “according to the present invention, preferably the said association module is made a server” (a server) “and each of the said audition, vision and motor control modules are made a client” (and a user terminal) “connected to the said server”), 
Executing, in the user terminal:
inputting a user utterance (Abstract lines 8+: “the audition module” (a voice input unit) “in response to sound signals from microphones” (that inputs a user utterance) “extracts pitches therefrom”);
And inputting a camera captured image of a user (Abstract lines 13+: “The vision module” (an image input unit) “on the basis of an image taken by a camera” (to input a user image captured by a camera) “identifies by face”),
executing, by  a sound source direction/voice section designation unit, a process of designating a sound source direction and a voice section of the user utterance (¶ 0047: “audition module, the vision module” “cooperate” “to allow determining the direction of each speaker on the basis of the directional information of locating the 
wherein the designation is based on a line of sight of its user, who has executed the user utterance, is at a specified area (¶ 0144 sentence before last: “face locator 33 determines the direction” (determining user’s line of sight) “in which the identified face lies”; ¶ 0252 last 7 lines: “the association module upon locating the source of the auditory event and locating the face of the visual event” (e.g. upon determining the “participant” (user responsible for the utterance) looking toward the “robot” (a predetermined specified area)) “determines the direction in which the speaker is present and forms an auditory and a visual stream and an associated stream therefor” (executes a designation process)); e.g. according to ¶ 0253 lines 4+: “directing the robot to face opposite to the object speaker”);


that corresponds to voice data in the sound source direction and the voice section (¶ 0193 last 7 lines: “the speech recognition circuit 55” (a voice recognition unit) “capable of recognizing a speech” (directed at “speech” (voice section)) for the processes including interaction between the “robot” and a “participant”, e.g. “robot” “ask[s]” “Good afternoon, Mr. XXX?” (¶ 0194) and “Participant P answers” “My name is XXX” (¶ 0197), where all the input by the “participant” is collected by “microphone” (part of the “Audition module” (sound source direction/voice section designation unit) as disclosed in the Abstract lines 8+)) ;
identifying, based on a camera captured image, the user included in the camera captured image (¶ 0083: “vision module” (an image processing unit) “on the basis of an image taken by a camera” (accepts an input of a camera’s captured image) “is allowed to identify by face” (to identify the user on a basis of the camera image) “each such speaker”); and
displaying a character image corresponding to the identified user, wherein the character image is displayed in the specified area (the said “robot” comprises of a “display” (¶ 0054 page 5 last 5 lines) which according to ¶ 0057: “the said display preferably includes a visual display for displaying” “an image” (a character image) “of an extracted face”(corresponding to the “participant” (user) who was identified) on the “robot” (specified area) “display”)).

Nakadai et al. alternative embodiment does teach some voice processing functions executed by its server (¶ 0123 sentence 1: “Here, the association module 60 (block 5 in FIG. 9) is made up of a server of a client –server system” where the “association module” according to ¶ 0193 sentence 4: “gives an input to the speech recognition circuit 55” (is responsible for some voice processing functions)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate all functions of voice processing including that associated with “the speech recognition circuit 55” (voice recognition) to the “server” by Nakadai et al. main embodiment, so as to “allow their events to be processed rapidly in real time” as suggested in Nakadai et al. ¶ 0029 last sentence.
Nakadai et al. do not specifically disclose:
The character image is different from the camera captured image.
Fouillade et al. do teach:
The character image is different from the camera captured image (¶ 0030 sentence 4: “When the robot 102 subsequently recognizes the user 106 (through face or voice recognition) or is otherwise interacting the user 106, the robot 102 can display the avatar selected by the user” (based on “face” and “voice” “recognition” (e.g. by 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “voice” and “face” “recognition” and “display” functions of the “robot” of Fouillade et al. into the respective ones of the “robot” of Nakadai et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Nakadai et al. to use the “avatar” image on the robot so that the robot can “act as a proxy of the user” as disclosed in Fouillade et al. ¶ 0030 last sentence.

Regarding claim 16, Nakadai et al. do teach a non-transitory computer-readable medium having stored thereon, computer executable instructions, which when executed by a computer, cause the computer to execute operations (Title: “ROBOT” (an information processing apparatus) “AUDIOVISUAL SYSTEM”; Abstract lines 8+ teach an “audition module” (a voice processing unit of the “ROBOT”) which receives “sound” using “microphones”, which according to ¶ 0156 sentence 1 is sent to “speech (voice) recognition circuit” (for voice recognition process of any utterance received from the “microphones” (operations to be executed); according to ¶ 0123 last sentence teach 
The operations comprising:
designating a sound source direction and a voice section of the user utterance (¶ 0047: “audition module, the vision module” “cooperate” “to allow determining the direction of each speaker on the basis of the directional information of locating the sound source from the auditory event and locating the speaker from the visual event” (“audition module” “vision module” (the voice processing unit) includes a sound source direction/voice section as it can “determin[e]” (designate) sound source  direction of voice of a “speaker” (user) utterance); Abstract line 11: “auditory event” (voice section) is “extract[ed]” using the “identifi[ed]” “sound source”); ¶ 0037: “audition module” “to identify the individual speakers as” “sound sources” “and then extracts their own auditory events” (to also determine “speakers” (user) voice sections, e.g. recognizing their utterance, such as Fig. 10 depicting a human robot dialog interaction))   ; and
wherein the designation is based on a line of sight of a user, who has executed the user utterance, is at a specified area (¶ 0144 sentence before last: “face locator 33 determines the direction” (determining user’s line of sight) “in which the identified face lies”; ¶ 0252 last 7 lines: “the association module upon locating the source of the auditory event and locating the face of the visual event” (e.g. upon determining the 
executing a voice recognition process that corresponds to voice data in the sound source direction and the voice section (¶ 0193 last 7 lines: “the speech recognition circuit 55” (a voice recognition unit) “capable of recognizing a speech” (directed at “speech” (voice section)) for the processes including interaction between the “robot” and a “participant”, e.g. “robot” “ask[s]” “Good afternoon, Mr. XXX?” (¶ 0194) and “Participant P answers” “My name is XXX” (¶ 0197), where all the input by the “participant” is collected by “microphone” (part of the “Audition module” (sound source direction/voice section designation unit) as disclosed in the Abstract lines 8+)),
identifying, based on a camera captured image, the user included in the camera captured image (¶ 0083: “vision module” (an image processing unit) “on the basis of an image taken by a camera” (accepts an input of a camera’s captured image) “is allowed to identify by face” (to identify the user on a basis of the camera image) “each such speaker”); and
displaying a character image corresponding to the identified user, wherein the character image is displayed in the specified area (the said “robot” comprises of a 
Nakadai et al. do not specifically disclose:
The character image is different from the camera captured image.
Fouillade et al. do teach:
The character image is different from the camera captured image (¶ 0030 sentence 4: “When the robot 102 subsequently recognizes the user 106 (through face or voice recognition) or is otherwise interacting the user 106, the robot 102 can display the avatar selected by the user” (based on “face” and “voice” “recognition” (e.g. by capturing image as well as voice of the user), the robot displays an “avatar” (a character different from the “face” (the camera captured image)) “on the display screen of the robot 102” (while interacting with the user); note that step “504” and “506” teach “CAMERA TO CAPTURE” “IMAGES” of “AN INDIVIDUAL”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “voice” and “face” “recognition” and “display” functions of the “robot” of Fouillade et al. into the respective ones of the “robot” of Nakadai et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nakadai et al. in view of Fouillade et al., and further in view of James et al. (US 2016/0167648).
Regarding claim 12, Nakadai et al. do teach the information processing apparatus according to claim 10, wherein the second detection results for the sound source direction and the voice section based on the input image include information from an analysis result for a face direction (¶ 0144 last 4 lines: “the face locator” (part of the “vision module” (part of the sound source direction and voice section unit)) “determines the direction” (determines sound source direction) “in which the identified face lies” (based on face direction); Abstract line 11: “auditory event” (voice section) is “extract[ed]” using the “identifi[ed]” “sound source”)).
Nakadai et al. in view of Fouillade et al.  do not specifically disclose using:
and a lip motion of the user included in the camera-captured image.
James et al. do teach:

It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “lip movement” of James et al. to aid in “identify[ing]” “words” in parallel to analysis of “audial data” “to identify the words” as taught in James et al. ¶ 0058 into Nakadai et al. “audition” “vision” “modules” in Nakadai et al. in view of Fouillade et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Nakadai et al. in view of Fouillade et al. to resolve “the ambiguities which the audition and vision of the robot individually possess” “thereby rising the so-called robustness of the system” in achieving superior speech recognition using both audial as well visual means as required in Nakadai et al. ¶ 0071.

Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARZAD KAZEMINEZHAD whose telephone number is (571)270-5860. The examiner can normally be reached 10:30 am to 11:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL C WASHBURN can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/Farzad Kazeminezhad/
Art Unit 2657
March 5th 2022.