Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1.	This action is responsive to Application no.16/761,707.  All claims have been examined and are currently pending.
	Claim 22 recites non-transitory computer-readable storage media and the claim is therefore 101 compliant.
Information Disclosure Statement
2.	The information disclosure statement (IDS) submitted is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
3.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

4.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have 

5.	Claims 1-5, 9-10, 15-18, 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Marcheret et al (2017/0061966) in view of Yu (2017/0178666).

Regarding claim 1 Marcheret teaches A method (abstract: methods, computing devices, systems, computer-readable media) comprising: 
obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker (fig 3 video; fig 5 510 visual features, 515 frames; para 43: frames; 45: face detection; 46); 
processing, for each speaker, the per-frame face embeddings of the face of the speaker using a video convolutional neural network to generate visual features for the face of the speaker (fig 1 135; fig 3, fig 5 520; 46 visual neural network; 47 convolutional network; 51 vector, extracted visual features); 
obtaining a spectrogram of an audio soundtrack for the video (fig 3; fig 5 525; 44 extract audio features, extract MFCCs, LPCs, LPCCs; 58 determine audio features for frame which characterize spectral content of speech – frequency representation of speech); 
processing the spectrogram using an audio convolutional neural network to generate an audio embedding for the audio soundtrack (fig 1, 3, 5; para 44 audio neural network…audio feature vector; 47 convolutional network; 59); 
combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the fig 3; 5 535 combine output of visual neural network and audio neural network; para 52; 53: combine audio features and video features to generate combined feature vector; 85); 
determining, from the audio-visual embedding for the video, a respective spectrogram [mask] for each of the one or more speakers, wherein each spectrogram [mask] describes time- frequency relationships between clean speech for a respective speaker and background interference in a spectrogram of the audio soundtrack (fig 3, 5; 44 classifier trained to generate predictions regarding speech status of subject, trained using known corpus of audio samples with associated speech statuses and may employ models associating audio features to speech statuses; 54 results from neural networks used by prediction engine; 70; 85-86: provide combined output to third neural network…to generate prediction regarding speech status of the subject  – appears to be stored user profile information for a particular speaker and used to classify/identify input); and 
determining, from the respective spectrogram [masks] and the corresponding audio soundtrack, a respective isolated speech spectrogram for each speaker that isolates the speech of the speaker in the video (fig 3, 5
3: speaker detection
5 detect a currently speaking subject, speaker diarization, speech separation
10: determining whether the subject is speaking, liveness
39: subject, such as a user
Where prediction engine and speech status determine specific portions of audio (and video) that correspond to a specific speaker).  

	Marcheret does not specifically teach where Yu more clearly teaches spectrogram mask for a speaker (abstract; 15; 20; 29; 38).
	Yu (2017/0178666) teaches process an acoustic signal comprising speech from multiple speakers to trace an individual speaker’s speech (abstract) and output a different mask for each speaker in the audio; the mask can be used to generate an isolated audio signal for an individual speaker (38)
	It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Yu for an improved system to better determine and isolate speech of a particular speaker.

Regarding claim 2 Marcheret and Yu teach The method of claim 1, further comprising: 
generating, from the respective isolated speech spectrogram for a particular one of the one or more speakers, an isolated speech signal for the particular speaker (Yu 20; 38 generate an isolated audio signal for an individual speaker).
Rejected for similar rationale and reasoning as claim 1  

Regarding claim 3 Marcheret teaches The method of claim 1, wherein obtaining the respective per- frame face embeddings comprises: 
fig 3 video, 5; para 43 frames, 45-46; 53; 60 – obtaining video and frames); 
detecting, in each frame of the stream of frames, a respective face of each of one or more speakers (45 face detection); and 
generating, for each frame, a respective per-frame face embedding for each of the detected faces (fig 3, 5; para: 43; 45-47 image data and visual features for face for frames of video).  

Regarding claim 4 Marcheret teaches The method of claim 1, wherein combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video comprises: 
concatenating the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate the audio-visual embedding for the video (fig 3, 5; para 52; 53: combine audio features and video features to generate combined feature vector; 85).  

Regarding claim 5 Marcheret and Yu teach The method of claim 1, wherein determining from the audio-visual embedding for the video, a respective mask for each of the one or more speakers comprises: 
processing the audio-visual embedding for the video using a masking neural network, wherein the masking neural network is configured to process the audio-visual embedding for the video to generate a respective spectrogram mask for each of the one or more speakers (Marcheret: 44; 86 may provide the combined output to a third neural network for processing…to generate a prediction regarding the speech status of the subject);
and where Yu teaches The technology described herein uses a multiple-output layer RNN to process an acoustic signal comprising speech from multiple speakers to trace an individual speaker's speech. The multiple-output layer RNN has multiple output layers, each of which is meant to trace one speaker (or noise) and represent the mask for that speaker (or noise) (abstract).   
Rejected for similar rationale and reasoning as claim 1 where It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Yu (and have the third network of Marcheret further incorporate speaker mask information to have a masking neural network) for an improved system to better determine and isolate speech of a particular speaker.
  
Regarding claim 9 Marcheret teaches The method of claim 1, further comprising: 
for each of one or more of the speakers, processing the isolated speech spectrogram for the speaker or data derived from the isolated speech spectrogram for the speaker using an automatic speech recognition (ASR) model to generate a transcription of the speech of the speaker in the video
(3: automatic speech recognition and speaker detection
5 detect a currently speaking subject, speaker diarization, speech separation
10: determining whether the subject is speaking, liveness
12: prediction regarding the speech status of the subject may be used in enhanced automated speech recognition
39: subject, such as a user – speech recognition of user speech and present as text).  


Regarding claim 10 Marcheret teaches A method of training a video convolutional neural network, an audio convolutional neural network, and a [masking] neural network (abstract: method; fig 1 audio neural network, visual neural network, AV fused neural network; 44: network may be trained; 47 convolutional network), the method comprising: 
obtaining training data comprising a plurality of training examples, each training example comprising (i) a respective training video and (ii) a ground truth isolated speech spectrogram of the speech of each of one or more speakers in the respective training video (44: trained using known corpus of audio samples with associated speech statuses; 51: visual neural network comprise classifier trained to generate predictions); and 
training the video convolutional neural network, the audio convolutional neural network, and the [masking] neural network on the training data, wherein the [masking] neural network is configured to generate a respective spectrogram [mask] for each speaker in an audio soundtrack, and wherein each spectrogram [mask] describes time-frequency relationships between clean speech for a respective speaker and background interference in a spectrogram of the audio soundtrack
44 classifier trained to generate predictions regarding speech status of subject, trained using known corpus of audio samples with associated speech statuses and may employ models associating audio features to speech statuses;
51: visual neural network 135 may comprise a software classifier trained to generate predictions regarding the speech status of a subject based on one or more input visual features. In particular, according to some aspects described herein, visual neural network 135 may be adapted to generate a prediction regarding the speech status of the subject based on a vector and/or set of vectors of scattering coefficients associated with the visual features
86 may provide the combined output to a third neural network for processing…to generate a prediction regarding the speech status of the subject).  
	Marcheret already teaches multiple neural networks and training (which includes audio visual standard/base examples), allowing for classification and identification of current input.  Marcheret does not specifically teach the mask limitations.
Yu (2017/0178666) teaches process an acoustic signal comprising speech from multiple speakers to trace an individual speaker’s speech (abstract) and output a different mask for each speaker in the audio; the mask can be used to generate an isolated audio signal for an individual speaker (38).  Yu teaches training a mask for each speaker (37-38).
	It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Yu and training with the mask information for an improved system to better determine and isolate speech of a particular speaker.
	Further rejected for similar rationale and reasoning as claim 1 and 5


Regarding claim 15 Marcheret teaches The method oclaim 1, wherein the video convolutional neural network comprises a set of weights, and wherein processing for each speaker, the per-frame face embeddings of the face of the speaker using the video convolutional neural network to generate visual features for the face of the speaker comprises: 
processing for each speaker, the per-frame face embeddings of the face of the speaker using the video convolutional neural network to generate visual features for the face of the speaker using the set of weights (42: network; layers; nodes, weights).  

Regarding claim 16 Yu teaches The method of laim 1, wherein the audio soundtrack of the video further comprises background noise, the method further comprising: 
determining, from the audio-visual embedding for the video, a background noise spectrogram mask for the background noise (abstract: represent the mask for that speaker (or noise); 3 audio signals include background noise and human speakers; 42; 52: speaker-specific output layers can be designed to capture background noise; 76:output layer associated with background noise). 
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Yu for an improved system to better determine and isolate speech of a particular speaker.  The reference already teaches a mask (representative information) for a speaker and it would have been obvious to further incorporate 
 
Regarding claim 17 Yu teaches The method of claim 16, wherein determining, from the respective masks and the corresponding audio soundtrack, a respective isolated speech signal for each speaker that isolates the speech of the speaker in the video comprises:  
Page: 6of8	masking the background noise of the corresponding audio soundtrack with the background noise spectrogram mask (abstract; 3; 12; 38 – allowing for masking of background noise).  
	Rejected for similar rationale and reasoning as claim 16

Regarding claim 18 Yu teaches The method of claim 1, wherein the respective spectrogram masks for each of the one or more speakers is a complex ideal ratio mask, the complex ideal ratio mask having a separately estimated real component and imaginary component (Yu 29 complex ideal ratio mask; 49).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Yu and the complex ideal ratio mask for improved signal separation.


Regarding claim 21 Marcheret and Yu teach A system comprising one or more computers and one or more storage devices storing instructions that when executed by operations comprising: 
obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; 
processing, for each speaker, the per-frame face embeddings of the face of the speaker using a video convolutional neural network to generate visual features for the face of the speaker;
obtaining a spectrogram of an audio soundtrack for the video; 
processing the spectrogram using an audio convolutional neural network to generate an audio embedding for the audio soundtrack; 
combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; 
determining, from the audio-visual embedding for the video, a respective spectrogram mask for each of the one or more speakers, wherein each spectrogram mask describes time- frequency relationships between clean speech for a respective speaker and background interference in a spectrogram of the audio soundtrack; and 
determining, from the respective spectrogram masks and the corresponding audio soundtrack, a respective isolated speech spectrogram for each speaker that isolates the speech of the speaker in the video.  
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 

Regarding claim 22 Marcheret and Yu teach One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers, cause the one or more computers to perform perations comprising: 
obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; 
processing, for each speaker, the per-frame face embeddings of the face of the speaker using a video convolutional neural network to generate visual features for the face of the speaker; 
obtaining a spectrogram of an audio soundtrack for the video; 
processing the spectrogram using an audio convolutional neural network to generate an audio embedding for the audio soundtrack; 
combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; 
determining, from the audio-visual embedding for the video, a respective spectrogram mask for each of the one or more speakers, wherein each spectrogram mask describes time- frequency relationships between clean speech for a respective speaker and background interference in a spectrogram of the audio soundtrack; and 
determining, from the respective spectrogram masks and the corresponding audio soundtrack, a respective isolated speech spectrogram for each speaker that isolates the speech of the speaker in the video.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 



6.	Claims 6-8 are rejected under 35 U.S.C. 103 as being unpatentable over Marcheret et al (2017/0061966) in view of Yu (2017/0178666) in further view of Droppo et al (2018/0254040).

Regarding claim 6 Marcheret teaches neural network and layers (42 network, layer, node, weight) but does not specifically teach where Droppo teaches The method of claim 5, wherein the masking neural network includes one or more long short-term memory (LSTM) layers followed by one or more other neural network layers (47 bidirectional long short term memory).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Droppo for more efficient network processing and predictions. 

Regarding claim 7 Marcheret teaches The method of claim 6, wherein the one or more other neural network layers include one or more fully connected layers (42).  
	Rejected for similar rationale and reasoning as claim 6

Regarding claim 8 Droppo teaches The method of claim 6, wherein the one or more LSTM layers are bidirectional LSTM layers (47).  
Rejected for similar rationale and reasoning as claim 6


7.	Claims 11 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Marcheret et al (2017/0061966) in view of Yu (2017/0178666) In further view of Stork et al (5,621,858)

Regarding claim 11 Marcheret teaches The method of claim 10, wherein obtaining the training data comprises, for each of the training examples: 
obtaining, for each of one or more speakers, a clean video of the speaker speaking and a corresponding clean audio soundtrack of speech of the speaker from the clean video (44; 51 – speech and video of speakers for training); 
generating, from at least the clean video and corresponding clean audio soundtrack of the one or more speakers, a mixed video and a mixed audio soundtrack (44; 51; 52; 85-86 – fused AV neural network which has AV components and therefore trained on AV samples); and 
generating the training example by associating the mixed video with, for each of the one or more speakers, a spectrogram corresponding to the respective clean audio soundtrack of the speech of the speaker (44; 51 generating training sample using input information). 
Marcheret teaches training for the audio and video portions which would include acquiring audio and video examples to be compared against for input classification.  Marcheret however does not go into further detail regarding obtaining the video acoustic and visual information was collected from four male subjects…resulting in tokens, each token was converted into visual, acoustic, and full acoustic and video vectors suitable for use in classification (col 15 l. 5-15).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Stork to allow for proper training for improved prediction of Marcheret and presenting a reasonable expectation of success.
The references would therefore allow for the teaching of 
generating the training example by associating the mixed video with, for each of the one or more speakers, a spectrogram corresponding to the respective clean audio soundtrack of the speech of the speaker.

Regarding claim 13 Marcheret, Yu, and Stork teach The method of claim 11, wherein each training example comprises (i) a respective training video of a plurality of speakers and (ii) a respective ground truth isolated speech spectrogram of the speech of each of the plurality of speakers in the respective training video; and 
wherein generating the mixed video with the mixed audio soundtrack comprises mixing the training video for the plurality of speakers and mixing the respective clean audio soundtracks of the plurality of speakers.  
Rejected for similar rationale and reasoning as claim 11 where Stork teaches acoustic and visual information from multiple speakers


s 12 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Marcheret et al (2017/0061966) in view of Yu (2017/0178666) In further view of Stork et al (5,621,858) In further view of Burges et al (2004/0260550).

Regarding claim 12 Marcheret, Yu, and Stork teach The method of claim 11, wherein each training example comprises (i) a respective training video of a single speaker and (ii) a ground truth isolated speech spectrogram of the speech of the single speaker in the respective training video, and [wherein] generating the mixed video and the mixed audio soundtrack 
but doesn’t teach where Burges teaches augmenting the clean audio soundtrack of the speech of the single speaker with noise (102: For training, audio data from each of the S speakers is used (along with extra examples with added noise)).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Burges for improved training and subsequent classification.


Regarding claim 14 Marcheret, Yu, Stork, and Burges teach The method of claim 13, wherein generating the mixed video and the mixed audio soundtrack comprises augmenting the mixed audio soundtracks of the speech of the plurality of speakers with noise.  
Rejected for similar rationale and reasoning as claim 12

9.	Claims 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Marcheret et al (2017/0061966) in view of Yu (2017/0178666) in further view of McCord et al (2018/0082679)
	
Regarding claim 19 Marcheret does not specifically teach where McCord teaches  The method of claim 1, wherein the audio convolutional neural network is an audio dilated convolutional neural network (abstract dilated convolutional neural network; 42; 65).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate dilated convolutional neural network for improved network efficiency.

Regarding claim 20 Marcheret does not specifically teach where McCord teaches the method of claim 1, wherein the video convolutional neural network is a video dilated convolutional neural network.  
Rejected for similar rationale and reasoning as claim 19

Conclusion
10.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/
Primary Examiner, Art Unit 2655