Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1.	This action is responsive to AFCP remarks filed 5/6/22.
Response to Amendment
2.	Independent claims 1, 10, 21-22 have been amended.
Response to Arguments
3.	Applicants arguments filed have been considered and are persuasive.
Allowable Subject Matter
4.	Claims 1-4, 6-22 are allowed.
5.	The following is an examiner’s statement of reasons for allowance: the claims are allowed as they further teach:
A method comprising: 
obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; 
processing, for each speaker, the per-frame face embeddings of the face of the speaker using a video convolutional neural network to generate visual features for the face of the speaker; 
obtaining a spectrogram of an audio soundtrack for the video; 
processing the spectrogram using an audio convolutional neural network to generate an audio embedding for the audio soundtrack; 
combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video, wherein the audio- visual embedding represents both audio features of the audio soundtrack and visual features of the respective faces of the one or more speakers; 
processing, using a masking neural network, the audio-visual embedding for the video that represents both the audio features of the audio soundtrack and the visual features of the respective faces of the one or more speakers to generate a respective spectrogram mask for each of the one or more speakers, wherein each spectrogram mask describes time-frequency relationships between clean speech for a respective speaker and background interference in a spectrogram of the audio soundtrack; and 
determining, from the respective spectrogram masks and the corresponding audio soundtrack, a respective isolated speech spectrogram for each speaker that isolates the speech of the speaker in the video, 
wherein the video convolutional neural network, the audio convolutional neural network, and the masking neural network are trained end-to-end such that network parameters of the video convolutional neural network, the audio convolutional neural network, and the masking neural network are updated jointly on a same set of training samples.

Regarding claim 1 Marcheret teaches A method (abstract: methods, computing devices, systems, computer-readable media) comprising: 
obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker (fig 3 video; fig 5 510 visual features, 515 frames; para 43: frames; 45: face detection; 46); 
processing, for each speaker, the per-frame face embeddings of the face of the speaker using a video convolutional neural network to generate visual features for the face of the speaker (fig 1 135; fig 3, fig 5 520; 46 visual neural network; 47 convolutional network; 51 vector, extracted visual features); 
obtaining a spectrogram of an audio soundtrack for the video (fig 3; fig 5 525; 44 extract audio features, extract MFCCs, LPCs, LPCCs; 58 determine audio features for frame which characterize spectral content of speech – frequency representation of speech); 
processing the spectrogram using an audio convolutional neural network to generate an audio embedding for the audio soundtrack (fig 1, 3, 5; para 44 audio neural network…audio feature vector; 47 convolutional network; 59); 
combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video, wherein the audio- visual embedding represents both audio features of the audio soundtrack and visual features of the respective faces of the one or more speakers  (fig 3; 5 535 combine output of visual neural network and audio neural network; para 52; 53: combine audio features and video features to generate combined feature vector; 85); 
processing using a [masking] neural network, the audio-visual embedding for the video that represents both the audio features of the audio soundtrack and the visual features of the respective faces of the one or more speakers, a respective spectrogram [mask] for each of the one or more speakers, wherein each spectrogram [mask] describes time- frequency relationships between clean speech for a respective speaker and background interference in a spectrogram of the audio soundtrack (fig 3, 5; 44 classifier trained to generate predictions regarding speech status of subject, trained using known corpus of audio samples with associated speech statuses and may employ models associating audio features to speech statuses; 54 results from neural networks used by prediction engine; 70; 85-86: provide combined output to third neural network…to generate prediction regarding speech status of the subject  – appears to be stored user profile information for a particular speaker and used to classify/identify input); and 
determining, from the respective spectrogram [masks] and the corresponding audio soundtrack, a respective isolated speech spectrogram for each speaker that isolates the speech of the speaker in the video (fig 3, 5
3: speaker detection
5 detect a currently speaking subject, speaker diarization, speech separation
10: determining whether the subject is speaking, liveness
39: subject, such as a user).  
	
	Yu (2017/0178666) teaches process an acoustic signal comprising speech from multiple speakers to trace an individual speaker’s speech (abstract) and output a different mask for each speaker in the audio; the mask can be used to generate an isolated audio signal for an individual speaker (38);
[0015] The technology described herein uses a multiple-output layer RNN to process an acoustic signal comprising speech from multiple speakers to trace an individual speaker's speech. The multiple-output layer RNN has multiple output layers, each of which is meant to trace one speaker (or noise) and represents the mask for that speaker (or noise). 
	
	However the closest references of record do not specifically teach
processing, using a masking neural network, the audio-visual embedding for the video that represents both the audio features of the audio soundtrack and the visual features of the respective faces of the one or more speakers to generate a respective spectrogram mask for each of the one or more speakers, wherein each spectrogram mask describes time-frequency relationships between clean speech for a respective speaker and background interference in a spectrogram of the audio soundtrack; and 
determining, from the respective spectrogram masks and the corresponding audio soundtrack, a respective isolated speech spectrogram for each speaker that isolates the speech of the speaker in the video, 
wherein the video convolutional neural network, the audio convolutional neural network, and the masking neural network are trained end-to-end such that network parameters of the video convolutional neural network, the audio convolutional neural network, and the masking neural network are updated jointly on a same set of training samples.


Therefore the closest art of record does not teach or make obvious the limitations of the claim.

The additional independent claims are allowed for similar rationale and reasoning as claim 1.
The dependent claims are allowed as they further limit the parent claims.

6.	Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SHAUN ROBERTS/
Primary Examiner, Art Unit 2655