DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This Office Action is in response to correspondence filed 10 October 2019 in reference to application 16/598,172.  Claims 1-20 are pending and have been examined.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1 and 8-12 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Le Roux et al. (US PAP 2019/0318754).

Consider claim 1, Le Roux teaches A method implemented by one or more processors (abstract, figure 1B), the method comprising: 
generating a speaker embedding for a human speaker (0060, 0068, forming embeddings for each of the speakers), wherein generating the speaker embedding for the human speaker comprises: 

generating the speaker embedding based on one or more instances of output each generated based on processing a respective of the one or more instances of speaker audio data using the trained speaker embedding model (0068, generating embeddings for each of the speakers); 
receiving audio data that captures one or more utterances of the human speaker and that also captures one or more additional sounds that are not from the human speaker (0043, 0060, 0073, input audio of mixture or sources, including target person); 
generating a refined version of the audio data, wherein the refined version of the audio data isolates the one or more utterances of the human speaker from the one or more additional sounds that are not from the human speaker (0046-47, 0060, 0076 generating refined separate signals for each source), and wherein generating the refined version of the audio data comprises: 
processing the audio data using a frequency transformation to generate an audio spectrogram, wherein the audio spectrogram is a frequency domain representation of the audio data (0060, STFT of input, Short Term Fourier Transform); 
processing the audio spectrogram and the speaker embedding using a trained voice filter model to generate a predicted mask, wherein the predicted mask isolates the one or more utterances of the human speaker from the one or 
generating a masked spectrogram by processing the audio spectrogram using the predicted mask, wherein the masked spectrogram captures the one or more utterances of the human speaker and not the one or more additional sounds (0046-47, 0060, 0076 using masks and spectrogram estimates to generate estimated target spectrograms); and  34Attorney Docket No. ZS202-19697 
generating the refined version of the audio data by processing the masked spectrogram using an inverse of the frequency transformation (0060, iSTFT signal reconstruction back to time domain.).

Consider claim 8, Le Roux teaches the method of claim 1, wherein the frequency transformation is a Fourier transform, and wherein the inverse of the frequency transformation is an inverse Fourier transform (0060, STFT (short term Fourier transform) and iSTFT ( inverse short term Fourier transform) ).

Consider claim 9, Le Roux teaches the method of claim 1, wherein the trained speaker embedding model is a recurrent neural network model (0060, BLSTM, which are a type of recurrent architecture).

Consider claim 10, Le Roux teaches The method of claim 1, wherein generating a masked spectrogram by processing the audio spectrogram using the predicted mask comprises: convolving the predicted mask with the audio spectrogram to generate the 

Consider claim 11, Le Roux teaches the method of claim 1, wherein the one or more additional sounds of the audio data that are not from the human speaker captures one or more utterances of an additional human speaker that is not the human speaker (0060, 0068, separating each of the speakers, multiple speakers), and further comprising: 
generating an additional speaker embedding for the additional human speaker (0060, 0068, forming embeddings for each of the speakers, multiple speakers), and further comprising), wherein generating the additional speaker embedding comprises:  36Attorney Docket No. ZS202-19697
processing one or more instances of additional speaker audio data corresponding to the additional speaker using the trained speaker embedding model (0068, clustering into time frequency components dominated by same speaker), and 
generating the additional speaker embedding based on one or more instances of additional output each generated based on processing a respective of the one or more instances of additional speaker audio data using the trained speaker embedding model (0068, generating embeddings for each of the speakers); 
generating an additional refined version of the audio data, wherein the additional refined version of the audio data isolates the one or more utterances of the additional speaker from the one or more utterances of the human speaker and from the one or 
processing the audio spectrogram and the additional speaker embedding using the trained voice filter model to generate an additional predicted mask, wherein the additional predicted mask isolates the one or more utterances of the additional human speaker from the one or more utterances of the human speaker and the one or more additional sounds in the audio spectrogram (0046-47, 0060, 0076 using embedding to generate predicted spectrograms and masks for the target); 
generating an additional masked spectrogram by processing the audio spectrogram using the additional predicted mask, wherein the additional masked spectrogram captures the one or more utterances of the human speaker and not the one or more utterances of the human speaker and not the one or more additional sounds (0046-47, 0060, 0076 using masks and spectrogram estimates to generate estimated target spectrograms); and 
generating the additional refined version of the audio data by processing the additional masked spectrogram using the inverse of the frequency transformation (0046-47, 0060, 0076 using masks and spectrogram estimates to generate estimated target spectrograms).

Consider claim 12, Le Roux teaches the method of claim 1, wherein the audio data is captured via one or more microphones of a client device (0055, microphones) .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux in view of Gorodetski et al. (US PAP 2016/0217792).

Consider claim 3, Le Roux teaches the method of claim 1, further comprising: 
processing the refined version of the audio data using the trained speaker embedding model to generate refined output (0060, iSTFT signal reconstruction back to time domain).
Le Roux does not specifically teach determining whether the human speaker spoke the refined version of the audio data by comparing the refined output with the speaker embedding for the human speaker.
In the same field of speaker embeddings, Gorodetski teaches determining whether the human speaker spoke the refined version of the audio data by comparing 
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use embeddings for identification as taught by Gorodetski in the system of Le Roux in order to allow for more efficient source signal separation (Gorodetski 0008-10).

Consider claim 13, Le Roux teaches the method of claim 12, wherein the one or more instances of the speaker audio data used in generating the speaker embedding comprise an instance that is based on the audio data (0068, generating embeddings for each of the speakers.. which is based on audio), but does not specifically teach: 
identifying the instance based on the instance being from an initial occurrence of voice activity detection in the audio data.
In the same field of speaker embeddings, Gorodetski teaches identifying the instance based on the instance being from an initial occurrence of voice activity detection in the audio data (0040, VAD used to determine speech segments for source seperation).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use VAD as taught by Gorodetski in the system of Le Roux in order to allow for more efficient source signal separation (Gorodetski 0008-10).

Claims 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux and Gorodetski as applied to claim 3 above, and further in view of Wolff et al. (US PAP 2015/0046157).

Consider claim 4, Le Roux and Gorodetski teach the method of claim 3 but do not specifically teach in response to determining the human speaker spoke the refined version of the audio data, performing one or more actions that are based on the refined version of the audio data.
In the same field of speech separation, Wolff teaches in response to determining the human speaker spoke the refined version of the audio data, performing one or more actions that are based on the refined version of the audio data (0020,0027, commands from selected user only, to control a video game console or television for example).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to respond to audio commands of a select user in a group of users as taught by Wolff in the system of Le Roux and Gorodetski in order to more accurately provide speech recognition device control (Wolff 0002-04).

Consider claim 5, Wolff teaches the method of claim 4, wherein performing one or more actions that are based on the refined version of the audio data comprises: 
generating responsive content that is customized for the human speaker and that is based on the refined version of the audio data (0020,0024-27, dialogs with a single user when in selective mode); and  35Attorney Docket No. ZS202-19697 


Consider claim 6, Roux and Gorodetski teach the method of claim 3 but do not specifically teach in response to determining the human speaker did not speak the refined version of the audio data, performing one or more actions that are based on the audio data.
In the same field of speech separation, Wolff teaches in response to determining the human speaker did not speak the refined version of the audio data, performing one or more actions that are based on the audio data (0023, broad speech mode excepting speech from anybody in the room for control).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to respond to audio commands of a group of users as taught by Wolff in the system of Roux and Gorodetski in order to more accurately provide speech recognition device control (Wolff 0002-04).

Consider claim 7, Wolff teaches the method of claim 6, wherein performing one or more actions that are based on the refined version of the audio data comprises: 
generating responsive content that is customized for the human speaker and that is based on the refined version of the audio data (0020,0024-27, dialogs with a single user when in selective mode); and  35Attorney Docket No. ZS202-19697 


Allowable Subject Matter
Claims 2 and 14-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.  The following is a statement of reasons for the indication of allowable subject matter:  

Consider claim 2, Le Roux teaches the method of claim 1, wherein the trained voice filter model comprises a recurrent neural network portion (0060, BLSTM, which are a type of recurrent architecture).
However the prior art of record does not specifically teach the limitations of “wherein the trained voice filter model comprises a convolutional neural network portion, a recurrent neural network portion, and a fully connected feed-forward neural network portion, and wherein processing the audio spectrogram and the speaker embedding using a trained voice filter model to generate a predicted mask comprises: processing the audio spectrogram using the convolutional neural network portion of the trained voice filter model to generate convolutional output; processing the speaker embedding and the convolutional output using the recurrent neural network portion of the trained voice filter model to generate recurrent output; and processing the recurrent output 

Consider claim 14, Le Roux teaches the method of claim 1, wherein the sequence of audio data is captured via one or more microphones of a client device (0055, microphones).  However the prior art of record does not teach or fairly suggest the limitations of  “wherein generating the speaker embedding for the human for the human speaker occurs prior to the sequence of audio data being captured via the one or more microphones of the client device”  when combined with each and every other limitation of the claim, the base claim and intervening claims.  Therefore claim 14 contains allowable subject matter.

Claims 15-18 depend on and further limit claim 14 and therefore contain allowable subject matter as well.

Claims 19 and 20 are allowed.  The following is an examiner’s statement of reasons for allowance: 

Consider claim 19, Le Roux teaches a method of training a machine learning model to generate refined versions of audio data that isolate any utterances of a target 
identifying an instance of audio data that includes spoken input from only a first human speaker (0068, clustering into time frequency components dominated by same speaker); 
generating a speaker embedding for the first human speaker (0060, 0068, forming embeddings for each of the speakers); 
identifying an additional instance of audio data that lacks any spoken input from the first human speaker, and that includes spoken input from at least one additional human speaker (0068, clustering into time frequency components dominated by same speaker, for each speaker); 
generating a mixed instance of audio data that combines the instance of audio data and the additional instance of audio data (0073, input mixture); 
processing the mixed instance of audio data and the speaker embedding using the machine learning model by:
 processing the mixed instance of audio data using a frequency transformation to generate a mixed audio spectrogram, wherein the mixed audio spectrogram is a frequency domain representation of the mixed audio data (0073 spectrogram of mixed signal); 
processing the mixed audio spectrogram using the predicted mask to generate a masked spectrogram (0073-76, prediction of masks); 

generating a loss based on comparison of the predicted audio spectrogram and the masked spectrogram (0076, error calculation); and 
updating one or more weights of the machine learning model based on the loss (0076, loss function used to update neural network models).
However the prior art of record does not specifically teach
“processing the mixed audio data spectrogram using a convolutional neural network portion of the machine learning model to generate convolutional output; 
processing the convolutional output and the speaker embedding using a recurrent neural network portion of the machine learning model to generate recurrent output; 
processing the recurrent output using a fully connected feed-forward neural network portion of the machine learning model to generate a predicted mask;” when combined with each and every other limitation of the claim.  Therefore claim 19 is allowable.  39Attorney Docket No. ZS202-19697 

Consider claim 20, Le Roux teaches a method implemented by one or more processors (abstract, figure 1B), the method comprising: 

in response to invoking the automated assistant client:
 performing certain processing of initial spoken input received via one or more microphones of the client device (0055, microphones, inputting of signals); 
identifying a speaker embedding for the human speaker that provided the spoken input (0060, 0068, forming embeddings for each of the speakers); 
generating a refined version of the audio data that isolates any of the audio data that is from the human speaker, wherein generating the refined version of the audio data comprises: 
(0046-47, 0060, 0076 generating refined separate signals for each source), and wherein generating the refined version of the audio data comprises: 
processing the audio data using a frequency transformation to generate an audio spectrogram, wherein the audio spectrogram is a frequency domain representation of the audio data (0060, STFT of input, Short Term Fourier Transform); 
processing the audio spectrogram and the speaker embedding using a trained voice filter model to generate a predicted mask, (0046-47, 0060, 0076 using embedding to generate predicted spectrograms and masks for the target); 
generating a masked spectrogram by processing the audio spectrogram using the predicted mask, wherein the masked spectrogram captures the one or 
generating the refined version of the audio data by processing the masked spectrogram using an inverse of the frequency transformation (0060, iSTFT signal reconstruction back to time domain.) and 
Le Roux does not specifically teach 
generating a responsive action based on the certain processing of the initial spoke input ; 
causing performance of the responsive action; 
determining that a continued listening mode is activated for the automated assistant client device; in response to the continued listening mode being activated: 
automatically monitoring for additional spoken input after causing performance of at least part of the responsive action; 
receiving audio data during the automatically monitoring; 
determining whether the audio data includes any additional spoken input that is from the same human speaker that provided the initial spoken input.
In the same field of speech separation, Wolff teaches  40Attorney Docket No. ZS202-19697 
generating a responsive action based on the certain processing of the initial spoke input (0020,0024-27, dialogs with a single user when in selective mode, dialogs require rendering output); 
causing performance of the responsive action (0020,0024-27, dialogs with a single user when in selective mode, dialogs require rendering output); 

automatically monitoring for additional spoken input after causing performance of at least part of the responsive action (0024-27, listening for additional speech to continue dialog); 
receiving audio data during the automatically monitoring (0024-27, listening for additional speech to continue dialog); 
determining whether the audio data includes any additional spoken input that is from the same human speaker that provided the initial spoken input (0024-27, listening for additional speech to continue dialog in selective mode from same person).
However the prior art of record does not teach or fairly suggest the limitations of “determining whether the audio data includes the any additional spoken input that is from the same human based on whether any portions of the refined version of the audio data correspond to at least a threshold level of audio” when combined with each and every other limitation of the claim.  Therefore claim 20 is allowable.

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DOUGLAS C GODBOLD whose telephone number is (571)270-1451.  The examiner can normally be reached on 7:30-12 Monday and Friday, 7:30-6 Tuesday-Thursday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


DOUGLAS GODBOLD
Examiner
Art Unit 2658



/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2658