Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending.
Drawings as filed 07/05/2022 are accepted.
Response to arguments/amendments
Independent claims 1, 8 and 15 are now amended to states that the identification of the target voice is based on a mask, which is generated according to at least the received audio data and the extracted video feature data. As such, new search and consideration are necessitated.
Argument(s) presented in the Remarks dated 07/05/2022 are directed to newly introduced amendments and thus are moot in view of the new ground of rejection discussed below.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-3, 5-10, 12-17, 19 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wexler et al. (US 2022/0021985) in view of Wang et al. (CN 11035159) - 2019.
As to claim 1:
 Wexler discloses a method of separating a target voice from among a plurality of speakers (Abstract), the method comprising:

receiving video data associated with the plurality of speakers; (¶0405, collect video data for all speakers, and in this particular illustrated example, both of the speakers 3302 and 3303)

receiving audio data associated with each of the one or more speakers; (¶0405, collect audio data for all speakers, and in this particular illustrated example, both of the speakers 3302 and 3303)

extracting video feature data from the received video data; (See ¶0231, 0259, analyzing images/videos of the individual(s) to detect lips movement (feature data) of each individual, See further disclosure in ¶0246, 0261, 0405)

identifying the target voice from among the plurality of speakers based on the received audio data and the extracted video feature data. (See at least ¶0405, 0246, 0261, based on analysis of lips movements in combination of audio signal, the system can identify individual voice of each of the plurality of speakers)

 Wexler however is silent on the particular detail of the identification as claimed, namely the identification the target voice specifically as based on a mask generated according the received audio data and the extracted video feature data.

However, Wang does in fact, in same field of speaker identification technology, delve into such details. Namely, in the text describing steps S1011 through S1013 found in page 3 and 4, Wang discloses a process to identify a target voice/person from a video stream, in which: based on the extracted voice signal and video segment that includes lip of the target character (person), generate a time-frequency mask, which is ten used for extracting a speaker person associated with the speech.

It would have been obvious to one of ordinary skill in the art before the effective filing time of the invention that the identification of target voice and speaker in Wexler would include the generation of a time-frequency mask using the input multimedia data as described in Wang. Time-Frequency masking is known technique to separate a target speech from other voice/noise mixture. In view of Wexler’s expressive demand for prioritizing certain voices and/or attenuating audio signals of bystanders or background, such T-F masking’s advantage in achieving such goal/criteria would be unquestionable.   


As to claim 8:
Wexler discloses a computer system for separating a target voice from among a plurality of speakers, the computer system comprising: one or more computer-readable non-transitory storage media configured to store computer program code; and one or more computer processors configured to access said computer program code and operate as instructed by said computer program code,  (See ¶0033, 0642,  a system having processor and memory containing codes) said computer program code including:
 first receiving code configured to cause the one or more computer processors to receive video data associated with the plurality of speakers (¶0405, collect video data for all speakers, and in this particular illustrated example, both of the speakers 3302 and 3303);
 second receiving code configured to cause the one or more computer processors to receive audio data associated with each of the one or more speakers; (¶0405, collect audio data for all speakers, and in this particular illustrated example, both of the speakers 3302 and 3303)
extracting code configured to cause the one or more computer processors to extract video feature data from the received video data; (See ¶0231, 0259, analyzing images/videos of the individual(s) to detect lips movement (feature data) of each individual, See further disclosure in ¶0246, 0261, 0405)
 identifying code configured to cause the one or more computer processors to identify the target voice from among the plurality of speakers based on the received audio data and the extracted video feature data. (See at least ¶0405, 0246, 0261, based on analysis of lips movements in combination of audio signal, the system can identify individual voice of each of the plurality of speakers)
Wexler however is silent on the particular detail of the identification as claimed, namely the identification the target voice specifically as based on a mask generated according the received audio data and the extracted video feature data.

However, Wang does in fact, in same field of speaker identification technology, delve into such details. Namely, in the text describing steps S1011 through S1013 found in page 3 and 4, Wang discloses a process to identify a target voice/person from a video stream, in which: based on the extracted voice signal and video segment that includes lip of the target character (person), generate a time-frequency mask, which is ten used for extracting a speaker person associated with the speech.

It would have been obvious to one of ordinary skill in the art before the effective filing time of the invention that the identification of target voice and speaker in Wexler would include the generation of a time-frequency mask using the input multimedia data as described in Wang. Time-Frequency masking is known technique to separate a target speech from other voice/noise mixture. In view of Wexler’s expressive demand for prioritizing certain voices and/or attenuating audio signals of bystanders or background, such T-F masking’s advantage in achieving such goal/criteria would be unquestionable.   

Claim 15 is directed to a non-transitory CRM having stored thereon a computer program for separating a target voice from among a plurality of speakers, (See ¶0033, 0642,  a system having processor and memory containing codes)  the computer program configured to cause one or more computer processors to perform similar method steps of claim 1 and is rejected by the same reasoning. 


As to claim 2:
Wexler in view of Wang discloses all limitations of claim 1, wherein the extracted video feature data comprises direction data corresponding to the one or more users. (See Wexler,  ¶0227, 0234 image-based analysis to determine visual direction corresponding to a user/speaker)
Claim 9 is directed to a system having a CRM with limitation(s) directed to similar subject matter similar to claim and 2 is rejected by the same reasoning.
Claim 16 is directed to a CRM with limitation(s) directed to similar subject matter similar to claim and 2 is rejected by the same reasoning.


As to claim 3:
Wexler in view of Wang discloses all limitations of claim 1, wherein the extracted video feature data comprises lip movement data corresponding to each of the one or more speakers. . ( Wexler, ¶0405, analysis of lips movements of each of the plurality of speakers)

Claim 10 is directed to a system having a CRM with limitation(s) directed to similar subject matter similar to claim 3 and is rejected by the same reasoning.
Claim 17 is directed to a CRM with limitation(s) directed to similar subject matter similar to claim 3 and is rejected by the same reasoning.

As to claim 5:
Wexler in view of Wang discloses all limitations of claim 1, wherein the audio data comprises an enrollment utterance associated with each of the one or more speakers. (See Wexler, ¶0524, identifying from the audio signals, for each individual speakers, their respective audioprint (i.e. enrollment utterance) unique to each person for identification purpose)
Claim 19 is directed to a CRM with limitation(s) directed to similar subject matter similar to claim 5 and is rejected by the same reasoning.


Claim 12 is directed to a system having a CRM with limitation(s) directed to similar subject matter similar to claim 5 and is rejected by the same reasoning.


As to claim 7:
Wexler in view of Wang discloses all limitations of claim 1, wherein the video feature data is extracted using a convolutional neural network. (See Wexler,  ¶0201, 0261, 0376 various video feature data being obtained using CNN)

Claim 14 is directed to a system having a CRM with limitation(s) directed to similar subject matter similar to claim 7 and is rejected by the same reasoning.


As to claim 6:
Wexler in view of Wang discloses all limitations of claim 1, wherein identifying the target voice comprises generating the mask a time- frequency mask for the target speaker. (steps S1011 through S1013 found in page 3 and 4, Wang discloses time-frequency mask for the target speaker)
Claim 13 is directed to a system having a CRM with limitation(s) directed to similar subject matter similar to claim 6 and is rejected by the same reasoning.
Claim 20 is directed to a CRM with limitation(s) directed to similar subject matter similar to claim 6 and is rejected by the same reasoning.





Claim(s) 4, 11 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wexler et al. (US 2022/0021985) in view of Wang et al. (CN 11035159) – 2019, and in further view of Tagra (US 2021/0385212).

As to claim 4:
Wexler in view of Wang discloses all limitations of claim 3, wherein the lip movement data comprises images corresponding to the mouths of each of the one or more speakers. ( Wexler, ¶0405, analysis of lips movements of each of the plurality of speakers)

However Wexler/Wang does not explicitly mention the video image associated with the lips (mouth) being “cropped”.  However cropping particular images associated with the particular feature (lips in this instance) is standard practice in image analysis so as to focus on the feature being analyzed. 

Tagra, in a related field of endeavor, discloses similar lips movement analysis procedure in which video frames (i.e. images) containing lips are cropped (See Tagra, ¶0034).

It would have been obvious to one of ordinary skill in the art before the effective filing time of the invention that Wexler’s procedure for speech/speaker analysis that cropping would be incorporated.  This is advantageous because cropping is a known technique to localize images to the most desired region of interest. As such, removing unwanted visual data would help reducing processing time/effort at the system’s end.

Claim 11 is directed to a system having a CRM with limitation(s) directed to similar subject matter similar to claim 4 and is rejected by the same reasoning.
Claim 18 is directed to a CRM with limitation(s) directed to similar subject matter similar to claim 4 and is rejected by the same reasoning.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Maeng et al. (US 2021/0327447) - Disclosed are an electronic device performing an audio zoom based on speaker detection using lip reading and a method for controlling the electronic device. According to an embodiment, the electronic device detects a direction of a sound source while recording a video and determines a speaker's direction via facial recognition and mouth shape recognition in the sound source direction. Microphone beamforming may be performed based on the speaker's direction. Thus, the accuracy of audio zoom may be enhanced.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


Any inquiry concerning this communication or earlier communications from the examiner should be directed to QUAN M HUA whose telephone number is (571)270-7232. The examiner can normally be reached 10:30-6:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Anthony Addy can be reached on 571-272-7795. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/QUAN M HUA/Primary Examiner, Art Unit 2645