DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

	Response to Amendment
1. This office action is in response to communications filed 5/20/2021 Claims 1-3, 6-9, 11-15, 17-19 are amended. Claim 4 is original. Claims 5, 10 and 16 are canceled.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-4, 6-9, 11-15, 17-19 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1. 	Claims 1-4, 6-9, 11-15, 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over U.S.
Patent Application 2016/0098622, Ramachandrula in view of U.S. Patent 2009/0154896, Matono.

Regarding Claim 1, Ramachandrula discloses A data processing apparatus (Fig. 1; [0010], System 100 includes computing device 102 and camera 104; fig. 2: computer system 202; [0013]) comprising:
 	a processor (Fig. 2: processor 204; [0015]),
 	wherein the processor performs processing of ([0016], processing logic that interprets and executes instructions):
 	acquiring image data at a time of image capturing (Fig. 5; Fig. 3: capture audiovisual of a user 302; [0024]);
 	acquiring plural pieces of audio data in a situation at the time of image capturing (Figs. 4 and 5. Examiner notes that applicant’s definition of “a situation” includes noise (i.e. speaking));
 	extracting a specific piece of audio data corresponding to the position of the subject identified as the audio source from the acquired plural pieces of audio data (Figs. 1, 4, 5; [0011], An audio-video input may include user 106 speaking in the vicinity of camera 104 (for instance, while facing camera 104) for some duration. Examiner notes that the camera can used to capture the lip movements associated with the user speaking into the system 100. Examiner equates the lip movements to the position of the subject identified as the audio source is gathered when the user speaks into the camera/mic.); and
 	controlling such that the specific piece of audio data and the subject are associated with each other (Examiner notes that the code words and lip vector images are associated with the user; Fig. 1; [0011], [0030], Computing device 102 includes microphone 108 for capturing an audio input of a user. In an implementation, user 106 provides an audio-video (audiovisual) input (for example, for the purpose of authentication) to computing device 102. An audio-video input may include user 106 speaking in the vicinity of camera 104 (for instance, while facing camera 104) for some duration).
However, Ramachandrula does not explicitly disclose analyzing the image data so as to identify a position of a subject, the subject being included in the image data and being an audio source in the image data;
 	Matono teaches analyzing the image data so as to identify a position of a subject, the subject being included in the image data and being an audio source in the image data ([0016], extracting a sound of the specific subject from an audio signal and adjusting the extracted sound on the basis of the detected location);
 	It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the computer system taught in Ramachandrula with a speaker detection representing location on the screen a person is recorded as taught in Matono for the purposes of improving user authentication from a predetermined distance.

3. 	Regarding Claim 2, Ramachandrula discloses The data processing apparatus according to claim 1, wherein the processor analyzes the plural pieces of audio data so as to acquire audio features thereof ([0008], analyze human body characteristics, see Fig. 3: 306 evaluate speech vector quantization corresponding to lip vector sequence (e.g. frames) with code words (e.g. speech); Fig 4: speech signals 404, 406 associated with the user’s lip images in Fig 5, [0033]), and analyzes the image data based on the audio features so as to identify a position of the subject, the subject being included in the image data, being the audio source in the image data and having the audio features (Fig. 4; [0033], Fig 4, a speech signal 404 is extracted from an audiovisual 402 of an individual 400. Next speech vectors 406 are extracted from speech signal 404. Later, vector quantization is used to compress speech feature vectors 406 by representing a vector with a codevector. Subsequently, an index 408 identifying the codevector is used to represent the original speech signal).

 	Regarding Claim 3, Ramachandrula discloses The data processing apparatus according to claim 1, wherein the processor analyzes a subject in the image data so as to identify an appearance feature of the subject, the subject being the audio source in the image data in the image data, and analyzes the plural pieces of audio data based on the appearance feature of the subject identified as the audio source so as to extract a specific piece of audio data corresponding to the subject having the appearance feature from the plural pieces of audio data and control such that the specific piece of audio data and the subject are associated with each other (Examiner notes that the code words and lip vector images are associated with the user; Fig. 1; [0011], Computing device 102 includes microphone 108 for capturing an audio input of a user. In an implementation, user 106 provides an audio-video (audiovisual) input (for example, for the purpose of authentication) to computing device 102. An audio-video input may include user 106 speaking in the vicinity of camera 104). 

5. 	Regarding Claim 4, Ramachandrula discloses The data processing apparatus according to claim 1, wherein the processor identifies the subject that is the audio source by analyzing motions of subjects in the image data (Fig. 5, [0007], FIG. 5 illustrates lip shapes of different individuals, see Fig. 3: 306 evaluate speech vector quantization corresponding to lip vector sequence (e.g. frames) with code words (e.g. speech).

6. 	Regarding Claim 6, Ramachandrula discloses The data processing apparatus according to claim 1, further comprising:
 	a display section which displays the image data (a (Fig. 1: 102; [0011], Computing device 102 may be a desktop computer, notebook computer, tablet computer, mobile phone, personal digital assistant (PDA), smart phone, server computer, and the like. Fig. 2: display device 210 and Fig. 5: lip images),
(Fig. 3: 304 process the audio visual), on the display section (FIg. 2: 220), image data (Fig. 5: lip images) including the subject identified as the audio source(sound from individuals), and controls such that the extracted specific piece of audio data and the displayed subject are associated with each other (Examiner notes that the code words and lip vector images are
associated with the user; Fig. 1; [0011], Computing device 102 includes microphone 108 for capturing
an audio input of a user; [0033], a speech signal 404 is extracted from an audiovisual 402 of an
individual 400. Next, speech feature vectors 406 are extracted from speech signal 404).

7. 	Regarding Claim 7, Ramachandrula discloses The data processing apparatus according to claim 6, wherein the processor clips an image of an area of the subject, the area including the position of the subject identified as the audio source from the acquired image data(([0037], FIG. 5 illustrates lip shapes of three different individuals. Each row in FIG. 5 corresponds to different frames of a common video recorded for each individual. In other words, these frames represent lip shapes of individuals when they are mouthing or speaking same words.)), and
 	wherein the processor (i) displays (Fig. 3: 304 process the audio visual), on the display section (Fig. 2: display device), the image of the area of the subject acquired by the area being clipped ([0037], FIG. 5 illustrates lip shapes of three different individuals, according to an example. Each row in FIG. 5 corresponds to different frames of a common video recorded for each individual. In other words, these frames represent lip shapes of individuals when they are mouthing or speaking same words.), 
 	(ii) extracts, from the plural pieces of audio data, a specific piece of audio data corresponding to the subject included in the image of the area of the subject as the audio source ([0037], Lip shapes may be extracted and their features measured by using Active shape models (ASM) of individuals when they are mouthing or speaking same words), and 
piece of audio data and the subject are associated with each other ([0037], Lip shapes may be extracted and their features measured by using Active shape models (ASM) of individuals when they are mouthing or speaking same words. [0040], A video database of a
plurality of individuals (speaking while facing the camera) is created to train the universal codebook
for speech signals and universal codebook for lip shapes).

8. 	Regarding Claim 8, Ramachandrula discloses The data processing apparatus according to claim 7, wherein the processor clips an image of an area of a subject, the area including the position of the subject arbitrarily specified as the audio source from the image data displayed on the display section ((Examiner notes that the camera 104 shown in fig. 1 can be switched to different users to capture their lip movements and voice input), and
 	wherein the processor (i) displays ((Fig. 3: 304 process the audio visual), on the display section (Fig 2: 220; Fig 5), the image of the area of the subject acquired by the area being clipped ([0037], FIG. 5 illustrates lip shapes of three different individuals, according to an example. Each row in FIG. 5 corresponds to different frames of a common video recorded for each individual. In other words, these frames represent lip shapes of individuals when they are mouthing or speaking same words.), (ii) extracts, from the plural pieces of audio data, a specific piece of audio data corresponding to the subject included in the image of the area of the subject as the audio source ((Figs. 4 and 5; [0033], A universal codebook for speech signals is created by extracting, from a speech signal repository of a plurality of speakers, feature vectors on each sequential frame of speech), and (ili) controls such that the specific piece of audio data and the subject are associated with each other (([0037], Lip shapes may be extracted and their features measured by using Active shape models (ASM)
of individuals when they are mouthing or speaking same words. [0040], A video database of a
plurality of individuals (speaking while facing the camera) is created to train the universal codebook for speech signals and universal codebook for lip shapes).

9. 	Regarding Claim 9, Ramachandrula discloses The data processing apparatus according to claim 1, wherein the processor analyzes the plural pieces of audio data so as to extract pieces of audio data for respective audio sources ([0033], A universal codebook for speech signals is created by extracting, from a speech signal repository of a plurality of speakers, feature vectors on each sequential frame of speech), and
 	wherein the processor selects a specific piece of audio data corresponding to the position of the subject identified as the audio source from the extracted pieces of audio data for the respective audio sources and controls such that the specific piece of audio data and the subject are associated with each other ([0049], The individuals may be asked to read out some text, wherein the selected text may include all the phonemes in sufficient numbers so that all possible lip shapes which an individual can make could be covered).

10. 	Regarding Claim 11, Ramachandrula discloses The data processing apparatus according to claim 1 (Fig. 1; [0010], System 100 includes computing device 102 and camera 104; fig. 2: computer system
202; [0013]), further comprising:
 	an audio output section (Fig. 1: user speaking into microphone 108, examiner equates the
output to the user’s voice) which outputs the specific piece of audio data extracted from the plural pieces audio data ([0033], A universal codebook for speech signals is created by extracting, from a speech signal repository of a plurality of speakers, feature vectors on each sequential frame of speech
However, Ramachandrula does not explicitly disclose wherein the processor controls an output status of the specific audio data outputted from the audio output section in accordance with a position of the subject identified as the audio source in image of the image data. 
 	Matono discloses wherein the processor controls an output status of the specific audio data outputted from the audio output section in accordance with a position of the subject identified as the audio source in image of the image data (Fig. 7: microphone 706, sound recognition processor and object detector 703; [0055]-[0056], Fig. 3: location of speaker, see [0031]) 
 	It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the computer system taught in Ramachandrula with a speaker detection representing location on the screen a person is recorded as taught in Matono for the purposes of improving user authentication from a predetermined distance.

11. 	Regarding Claim 12, Ramachandrula in view of Matono discloses The data processing apparatus according to claim 11, 
 	Ramachandrula discloses wherein the audio output section includes a plurality of loudspeakers arranged at different positions ([0040], A video database of a plurality of individuals (speaking while facing the camera) is created to train the universal codebook for speech signals and universal codebook for lip shapes), and
 	However, Ramachandrula does not explicitly disclose wherein the processor controls, for each loud speaker, a sound volume of the specific piece of audio data outputted from the audio output section in accordance with the position of the subject identified as the audio source.
 	Matono discloses wherein the processor controls, for each loud speaker, a sound volume of the specific piece of audio data outputted from the audio output section in accordance with the position of (Fig. 7: microphone 706, sound recognition processor and object detector 703; [0055]-[0056], Fig. 3: location of speaker see [0031]).
 	 It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the computer system taught in Ramachandrula with a speaker detection representing location on the screen a person is recorded as taught in Matono for the purposes of improving user authentication from a predetermined distance.

12. 	Regarding Claim 13, Ramachandrula in view of Matono discloses The data processing apparatus according to claim 11, 
 	Ramachandrula discloses wherein the processor controls the output status of the specific piece of audio data outputted from the audio output section (see Fig. 3: 306, code words (e.g. speech)) so as to follow based-on a movement of the position of the subject identified as the audio source (Examiner notes that the camera 104 shown in fig. 1 can be switched to different locations or the users can move to different positions to capture their lip movements and voice input).


13. 	Regarding Claim 15, Ramachandrula in view of Matono discloses The data processing apparatus according to claim 11, 
 	Ramachandrula discloses wherein the processor combines the specific piece of audio data corresponding to the position of the subject identified as the audio source from among the plural pieces of audio data with other pieces of audio data and controls the audio output section to output resultant data (Examiner notes that when user is speaking into the mic background noise will also be output into the system). 

Regarding Claim 17, Ramachandrula discloses The data processing apparatus according to claim 1, wherein the processor controls such that the position of the subject identified as the audio source and the specific piece of audio data corresponding to the subject are associated with each other (Examiner notes that the code words and lip vector images are associated with the user; Fig. 1; [0011], Computing device 102 Application/Control Number: 16/442,217 Page 11 Art Unit: 2422 includes microphone 108 for capturing an audio input of a user. In an implementation, user 106 provides an audio-video (audiovisual) input (for example, for the purpose of authentication) to computing device 102. An audio-video input may include user 106 speaking in the vicinity of camera 104), and creates a file for managing image data including the subject and the specific piece of audio data corresponding to the subject (Fig. 2: 206 memory; see [0018]).

15. 	Claim 18 is a method claim, rejected with respect to the same limitation rejection in the apparatus claim 1. 

16. 	Claim 19 is a CRM claim, rejected with respect to the same limitation rejection in the apparatus claim 1.


17. 	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Ramachandrula in view of Matono as applied to claim 11 above, and further in view of U.S. Patent Application 2016/0313973, Yajima.

18. 	Regarding Claim 14, Ramachandrula in view of Matono discloses The data processing apparatus according to claim 11, 
Ramachandrula discloses wherein the processor controls the audio output section (e.g. user speaking) to output only the specific piece of audio data corresponding to the position of the subject identified as the audio source from among the plural pieces of audio data (see Fig. 3: 306, code words (e.g. speech)), and prevents output of other pieces of audio data from among the plural pieces of audio data.
 	Ramachandrula in view of Matono does not explicitly disclose prevents output of other pieces of audio data from among the plural pieces of audio data.
 	Further, Yajima teaches prevents output of other pieces of audio data from among the plural pieces of audio data (Fig. 5: s37, Filtering out background noise from voice, [0148]).
 	It would have been obvious to one of ordinary skill in the art before the effective filing date of
the invention to modify the microphone as taught in Ramachandrula in view of Matono with a control
section filtering out background noise as taught in Yajima for the purposes of improving the audibility of
the target sound (Yajima, [0148]).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OMER KHALID whose telephone number is (571)270-5997.  The examiner can normally be reached on Monday- Friday 9am-7pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John Miller can be reached on (571) 272-7353.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/OMER KHALID/Examiner, Art Unit 2422                                                                                                                                                                                                        /JOHN W MILLER/Supervisory Patent Examiner, Art Unit 2422