DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 5/1/2020. Claims 1-22 are pending in the application and have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement

The information disclosure statement (IDS) submitted on 5/1/20220 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.

Claims 1-7, 10-11, 12-18, 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over O. Cetin and E. Shriberg, "Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap," 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, pp. I-I in view of Yoshioka, T., Erdogan, H., Chen, Z., Xiao, X., & Alleva, F. (2018). Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks. arXiv preprint arXiv:1810.03655.
	Regarding claim 1, Cetin teaches a method of training a speech recognition model with a loss function, the method comprising: receiving, at data processing hardware, a training example comprising an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment, the overlapping region comprising a known start time and a known end time (see Fig. 1. Illustration of experiment conditions. 
    PNG
    media_image1.png
    594
    896
    media_image1.png
    Greyscale
When A is taken as the foreground speaker, B and C are background speakers. For the cross-talk condition, full original audio from B and C are added to A. For the background-noise condition, B and C are added only in the cases in which they do not contain any speech (for example, during the overlap marked DURING, B is not added to A, and only C is added). The regions marked BEFORE and AFTER in A are nonoverlaps); However Cetin doesn’t teach for each of the first speaker and the second speaker, generating, by the data processing hardware, a respective masked audio embedding based on the training sample; determining, by the data processing hardware, whether the first speaker was speaking: prior to the known start time of the overlapping region; or after the known end time of the overlapping region; when the first speaker was speaking prior to the known start time of the overlapping region, applying, by the data processing hardware, to the respective masked audio embedding for the first speaker, a first masking loss after the known end time; and when the first speaker was speaking after the known end time of the overlapping region, applying, by the data processing hardware, to the respective masked audio embedding for the first speaker, the first masking loss before the known start time. However Yoshioka 
    PNG
    media_image2.png
    272
    363
    media_image2.png
    Greyscale
teaches for each of the first speaker and the second speaker, generating, by the data processing hardware, a respective masked audio embedding based on the training sample (see Yoshika, pg. pg. 3039, sect 2.2 and Fig. 2 teaches generating a spectral masks for each of the speakers); determining, by the data processing hardware, whether the first speaker was speaking: prior to the known start time of the overlapping region or after the known end time of the overlapping region; (see Yoshioka, pg. 3039, sect. 2.2.2 teaches how each signal can be a single utterance or a mixture of two utterances with different length levels, and reverberations, corrupted by background noise, Yoshika pg. 3040, sect. 2.2.3 and sect. 2.3.1, after the permutation alignment processing, the masks for the nonoverlapping frames of the current window are used and teaches we add another output channel to the separation network so that the noise masks can also be obtained; interpreted as first speaker or second speaker and overlap); when the first speaker was speaking prior to the known start time of the overlapping region, applying, by the data processing hardware, to the respective masked audio embedding for the first speaker, a first masking loss after the known 
    PNG
    media_image3.png
    237
    340
    media_image3.png
    Greyscale
end time (see Yoshioka Fig. 3, section 2.2.2 and sect. 2.3.1, speech mask1 in figure 3; + noise mask to compute the mask embedding loss); and when the first speaker was speaking after the known end time of the overlapping region, applying, by the data processing hardware, to the respective masked audio embedding for the first speaker, the first masking loss before the known start time(see Yoshioka Fig. 3, section 2.2.2 and sect. 2.3.1, speech mask2 in figure 3; + noise mask to compute the mask embedding loss).
	Cetin and Yoshioka are considered to be analogous to the claimed invention because they relate to ASR systems to be able to recognize overlapped speech. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Cetin on analyzing and processing locations of overlapping speech and then using the unmixing transducer and ASR processing teachings of Yoshioka to improve the computational ability of transcribing individual utterances that may or may not be overlapping in the multi-talker settings (see Yoshioka, pg. 3038, sect. 1).
	Regarding claim 2 Cetin and Yoshioka teach the method of claim 1. Cetin further teaches when the first speaker was speaking prior to the known start time of the overlapping region, the first speaker was not speaking after the known end time of the overlapping region (see Cetin, sect. 2.3 discusses creating the set of data based on adding specific channels with a speaker and background noise in a time-synchronous fashion after weighting a factor to adjust cross-talk severity; foreground speaker with single speaker overlap is interpreted as being extended by a person skilled in the art for a first speaker before known time of overlapping); and when the first speaker was speaking after the known end time of the overlapping region, the first speaker was not speaking prior to the known start time of the overlapping region (see Cetin, sect. 2.3 discusses creating the set of data based on adding specific channels with a speaker and background noise in a time-synchronous fashion after weighting a factor to adjust cross-talk severity; foreground speaker with single speaker overlap is interpreted as being extended by a person skilled in the art as first speaker after known time of overlapping).
Regarding claim 3 Cetin and Yoshioka teach the method of claim 1. Yoshioka further teaches when the first speaker was speaking prior to the known start time of the overlapping region, applying, by the data processing hardware, to the respective masked audio embedding for the second speaker, a second masking loss prior to the known start time of the overlapping region (see Yoshioka, pg. 3040, section 2.3 and equation 5, teaches using a mask-based beamforming approach to compute the output signals & to overcome the interference the noise mask is calculated, masking approach described in Yoshioka, pg. 3039, sect,. 2.2; in Fig. 3, the speaker 1 mask and noise mask are interpreted as first speaker was speaking prior to the known start time of the overlapping region and  a second masking loss prior to the known start time of the overlapping region respectively ); and when the first speaker was speaking after the known end time of the overlapping region, applying, by the data processing hardware, to the respective masked audio embedding for the second speaker, the second masking loss after the known end time of the overlapping region (see Yoshioka, pg. 3040, section 2.3 and equation 5, teaches using a mask-based beamforming approach to compute the output signals & to overcome the interference the noise mask is calculated, masking approach described in Yoshioka, pg. 3039, sect,. 2.2; in Fig. 3, the speaker 2 mask and noise mask are interpreted as first speaker was speaking after the known start time of the overlapping region and  a second masking loss after the known start time of the overlapping region respectively).
Regarding claim 4 Cetin and Yoshioka teach the method of claim 3. Yoshioka further teaches for each of the respective masked audio embeddings generated for the first speaker and the second speaker: computing, by the data processing hardware, a respective average speaker embedding for the respective one of the first speaker or the second speaker inside the overlapping region (see Yoshioka, pg. 3040, sect. 2.3.1 teaches computation of the interference spatial covariance matrix, obtaining the noise masks, the squared error in noise estimation; the noise estimation is interpreted as the average speaker embedding inside the overlapping region ); and computing, by the data processing hardware, a respective average speaker embedding for the respective one of the first speaker or the second speaker outside the overlapping region (see Yoshioka, pg. 2.3 teaches how to obtain output signal using beamforming approach and this is interpreted as average speaker embedding outside the overlapping region); determining, by the data processing hardware, an embedding loss based on a function of the average speaker embedding computed for the respective masked audio embedding for the first speaker inside the overlapping region, the average speaker embedding computed for the respective masked audio embedding for the second speaker inside the overlapping region, the average speaker embedding computed for the respective masked audio embedding for the first speaker outside the overlapping region, and the average speaking embedding computed for the respective masked audio embedding for the second speaker outside the overlapping region (see Yoshioka, pg. 3040, sect. 2.3.1 teaches The loss function is defined as the sum of the PIT loss and the squared error in noise estimation; PIT loss is interpreted embedding loss of the speaker and the mean squared error is interpreted as the overlapping region loss); and applying, by the data processing hardware, the embedding loss to each of: the respective masked audio embedding generated for the first speaker to enforce that an entirety of the respective masked audio embedding generated for the first speaker corresponds to only audio spoken by the first speaker (see Yoshioka, pg. 3040, sect. 2.1-2.2 describes obtaining the output signal yi,t,f based on the input signal and mask and unmixed signal si,t,f , the loss of difference of yi,t,f  and si,t,f is interpreted as the embedded loss of speaker corresponding to the audio spoken by first speaker); and the respective masked audio embedding generated for the second speaker to enforce that an entirety of the respective masked audio embedding generated for the second speaker corresponds to only audio spoken by the second speaker (see Yoshioka, pg. 3040, sect. 2.1-2.2 describes obtaining the output signal yi,t,f based on the input signal and mask and unmixed signal si,t,f , the loss of difference of yi,t,f  and si,t,f is interpreted as the embedded loss of speaker corresponding to the audio spoken by second speaker).
Cetin and Yoshioka are considered to be analogous to the claimed invention because they relate to ASR systems to be able to recognize overlapped speech. Therefore,  Yoshioka, pg. 3038, sect. 1).
Regarding claim 5, Cetin and Yoshioka teach the method of claim 1. Yoshioka further teaches generating the respective masked audio embedding occurs at each frame of the audio signal for the training example (see Yoshioka, pg. 3039, sect. 2.2 and Fig. 2 teaches how the spectral mask is obtained from the unmixing transducer for each frame of audio signal, sect. 2.2.2 further describes the training set ).
Regarding claim 6, Cetin and Yoshioka teach the method of claim 1. Cetin further teaches the audio single comprises a monophonic audio signal (see Cetin, sect. 2.1-2.3 discusses how the speech from individual headset microphones are used for the data set; the speech from individual headset microphones is interpreted as monophonic audio signal).
	Regarding claim 7, Cetin and Yoshioka teach the method of claim 1. Yoshioka further teaches wherein the training example comprises simulated training data (see Yoshioka, pg. 3040, sect. 2.4 and pg. 3041, sect. 3.2 & Table 1 teaches how the training data for the proposed transducer is compared without the processing; interpreted as simulated training data).
Regarding claim 10, Cetin and Yoshioka teach the method of claim 1. Yoshioka further teaches the speech recognition model comprises an audio encoder configured to, during inference: generate per frame audio embeddings from a monophonic audio stream comprising speech spoken by two or more different speakers (see Yoshioka, pg. 3039, sect. 2 describes audio features are extracted to form embeddings of utterances from different speakers); and communicate each frame audio embedding to a masking model, the masking model trained to generate, for each frame audio embedding, a respective masked audio embedding (see Yoshioka, pg. 3039, sect 2.2 describes the process of generating the spectral mask for each output signal, yi,tf).
Regarding claim 11, Cetin and Yoshioka teach the method of claim 1. Cetin further teaches a first ground truth transcript corresponding to the audio spoken by the first speaker (see Cetin sect. 2 and sect. 3 describes clean condition and no overlap, single speaker overlap and 2 speakers overlap which is interpreted as first speaker ground truth and second speaker ground truth respectively); and a second ground truth transcript corresponding to the audio spoken by the second speaker (see Cetin sect. 2 and sect. 3 describes clean condition and no overlap, single speaker overlap and 2 speakers overlap which is interpreted as first speaker ground truth and second speaker ground truth respectively).
	Regarding claim 12, Cetin teaches a system for training a speech recognition model with a loss function, the system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising (see Cetin, system described in sect. 2.2): receiving a training example comprising an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment, the overlapping region comprising a known start time and a known end time (see Cetin, Fig. 1. Illustration of experiment conditions. 
    PNG
    media_image1.png
    594
    896
    media_image1.png
    Greyscale
When A is taken as the foreground speaker, B and C are background speakers. For the cross-talk condition, full original audio from B and C are added to A. For the background-noise condition, B and C are added only in the cases in which they do not contain any speech (for example, during the overlap marked DURING, B is not added to A, and only C is added). The regions marked BEFORE and AFTER in A are nonoverlaps); However Cetin doesn’t teach for each of the first speaker and the second speaker, generating a respective masked audio embedding based on the training sample; determining, by the data processing hardware, whether the first speaker was speaking: prior to the known start time of the overlapping region; or after the known end time of the overlapping region; when the first speaker was speaking prior to the known start time of the overlapping region, applying, to the respective masked audio embedding for the first speaker, a first masking loss after the known end time; and when the first speaker was speaking after the known end time of the overlapping region, applying, to the respective masked audio embedding for the first speaker, the first masking loss before the known start time. However Yoshioka teaches for each of the first speaker and the second speaker, generating a respective 
    PNG
    media_image2.png
    272
    363
    media_image2.png
    Greyscale
masked audio embedding based on the training sample (see Yoshika, pg. pg. 3039, sect 2.2 and Fig. 2 teaches generating a spectral masks for each of the speakers); determining whether the first speaker was speaking: prior to the known start time of the overlapping region or after the known end time of the overlapping region; (see Yoshioka, pg. 3039, sect. 2.2.2 teaches how each signal can be a single utterance or a mixture of two utterances with different length levels, and reverberations, corrupted by background noise, Yoshika pg. 3040, sect. 2.2.3 and sect. 2.3.1, after the permutation alignment processing, the masks for the nonoverlapping frames of the current window are used and teaches we add another output channel to the separation network so that the noise masks can also be obtained; interpreted as first speaker or second speaker and overlap); when the first speaker was speaking prior to the known start time of the overlapping region, applying, to the respective masked audio embedding for the first speaker, a first masking loss after 
    PNG
    media_image3.png
    237
    340
    media_image3.png
    Greyscale
the known end time (see Yoshioka Fig. 3, section 2.2.2 and sect. 2.3.1, speech mask1 in figure 3; + noise mask to compute the mask embedding loss); and when the first speaker was speaking after the known end time of the overlapping region, applying, to the respective masked audio embedding for the first speaker, the first masking loss before the known start time(see Yoshioka Fig. 3, section 2.2.2 and sect. 2.3.1, speech mask2 in figure 3; + noise mask to compute the mask embedding loss).
	Cetin and Yoshioka are considered to be analogous to the claimed invention because they relate to ASR systems to be able to recognize overlapped speech. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Cetin on analyzing and processing locations of overlapping speech and then using the unmixing transducer and ASR processing teachings of Yoshioka to improve the computational ability of transcribing individual utterances that may or may not be overlapping in the multi-talker settings (see Yoshioka, pg. 3038, sect. 1).
	Regarding claim 13 Cetin and Yoshioka teach the system of claim 12. Cetin further teaches when the first speaker was speaking prior to the known start time of the overlapping region, the first speaker was not speaking after the known end time of the overlapping region (see Cetin, sect. 2.3 discusses creating the set of data based on adding specific channels with a speaker and background noise in a time-synchronous fashion after weighting a factor to adjust cross-talk severity; foreground speaker with single speaker overlap is interpreted as being extended by a person skilled in the art for a first speaker before known time of overlapping); and when the first speaker was speaking after the known end time of the overlapping region, the first speaker was not speaking prior to the known start time of the overlapping region (see Cetin, sect. 2.3 discusses creating the set of data based on adding specific channels with a speaker and background noise in a time-synchronous fashion after weighting a factor to adjust cross-talk severity; foreground speaker with single speaker overlap is interpreted as being extended by a person skilled in the art as first speaker after known time of overlapping).
Regarding claim 14 Cetin and Yoshioka teach the system of claim 12. Yoshioka further teaches when the first speaker was speaking prior to the known start time of the overlapping region, applying, to the respective masked audio embedding for the second speaker, a second masking loss prior to the known start time of the overlapping region (see Yoshioka, pg. 3040, section 2.3 and equation 5, teaches using a mask-based beamforming approach to compute the output signals & to overcome the interference the noise mask is calculated, masking approach described in Yoshioka, pg. 3039, sect,. 2.2; in Fig. 3, the speaker 1 mask and noise mask are interpreted as first speaker was speaking prior to the known start time of the overlapping region and  a second masking loss prior to the known start time of the overlapping region respectively ); and when the first speaker was speaking after the known end time of the overlapping region, applying, to the respective masked audio embedding for the second speaker, the second masking loss after the known end time of the overlapping region (see Yoshioka, pg. 3040, section 2.3 and equation 5, teaches using a mask-based beamforming approach to compute the output signals & to overcome the interference the noise mask is calculated, masking approach described in Yoshioka, pg. 3039, sect,. 2.2; in Fig. 3, the speaker 2 mask and noise mask are interpreted as first speaker was speaking after the known start time of the overlapping region and  a second masking loss after the known start time of the overlapping region respectively).
Regarding claim 15 Cetin and Yoshioka teach the system of claim 14. Yoshioka further teaches for each of the respective masked audio embeddings generated for the first speaker and the second speaker: computing a respective average speaker embedding for the respective one of the first speaker or the second speaker inside the overlapping region (see Yoshioka, pg. 3040, sect. 2.3.1 teaches computation of the interference spatial covariance matrix, obtaining the noise masks, the squared error in noise estimation; the noise estimation is interpreted as the average speaker embedding inside the overlapping region ); and computing a respective average speaker embedding for the respective one of the first speaker or the second speaker outside the overlapping region (see Yoshioka, pg. 2.3 teaches how to obtain output signal using beamforming approach and this is interpreted as average speaker embedding outside the overlapping region); determining an embedding loss based on a function of the average speaker embedding computed for the respective masked audio embedding for the first speaker inside the overlapping region, the average speaker embedding computed for the respective masked audio embedding for the second speaker inside the overlapping region, the average speaker embedding computed for the respective masked audio embedding for the first speaker outside the overlapping region, and the average speaking embedding computed for the respective masked audio embedding for the second speaker outside the overlapping region (see Yoshioka, pg. 3040, sect. 2.3.1 teaches The loss function is defined as the sum of the PIT loss and the squared error in noise estimation; PIT loss is interpreted embedding loss of the speaker and the mean squared error is interpreted as the overlapping region loss); and applying, the embedding loss to each of: the respective masked audio embedding generated for the first speaker to enforce that an entirety of the respective masked audio embedding generated for the first speaker corresponds to only audio spoken by the first speaker (see Yoshioka, pg. 3040, sect. 2.1-2.2 describes obtaining the output signal yi,t,f based on the input signal and mask and unmixed signal si,t,f , the loss of difference of yi,t,f  and si,t,f is interpreted as the embedded loss of speaker corresponding to the audio spoken by first speaker); and the respective masked audio embedding generated for the second speaker to enforce that an entirety of the respective masked audio embedding generated for the second speaker corresponds to only audio spoken by the second speaker (see Yoshioka, pg. 3040, sect. 2.1-2.2 describes obtaining the output signal yi,t,f based on the input signal and mask and unmixed signal si,t,f , the loss of difference of yi,t,f  and si,t,f is interpreted as the embedded loss of speaker corresponding to the audio spoken by second speaker).
Cetin and Yoshioka are considered to be analogous to the claimed invention because they relate to ASR systems to be able to recognize overlapped speech. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Cetin on analyzing and processing locations of overlapping speech and then using the unmixing transducer and ASR processing teachings of Yoshioka to improve the computational ability of transcribing individual utterances that may or may not be overlapping in the multi-talker settings (see Yoshioka, pg. 3038, sect. 1).
Regarding claim 16, Cetin and Yoshioka teach the system of claim 12. Yoshioka further teaches generating the respective masked audio embedding occurs at each frame of the audio signal for the training example (see Yoshioka, pg. 3039, sect. 2.2 and Fig. 2 teaches how the spectral mask is obtained from the unmixing transducer for each frame of audio signal, sect. 2.2.2 further describes the training set ).
Regarding claim 17, Cetin and Yoshioka teach the system of claim 12. Cetin the audio single comprises a monophonic audio signal (see Cetin, sect. 2.1-2.3 discusses how the speech from individual headset microphones are used for the data set; the speech from individual headset microphones is interpreted as monophonic audio signal).
	Regarding claim 18, Cetin and Yoshioka teach the system of claim 12. Yoshioka further teaches wherein the training example comprises simulated training data (see Yoshioka, pg. 3040, sect. 2.4 and pg. 3041, sect. 3.2 & Table 1 teaches how the training data for the proposed transducer is compared without the processing; interpreted as simulated training data).
Regarding claim 21, Cetin and Yoshioka teach the system of claim 12. Yoshioka further teaches the speech recognition model comprises an audio encoder configured to, during inference: generate per frame audio embeddings from a monophonic audio stream comprising speech spoken by two or more different speakers (see Yoshioka, pg. 3039, sect. 2 describes audio features are extracted to form embeddings of utterances from different speakers); and communicate each frame audio embedding to a masking model, the masking model trained to generate, for each frame audio embedding, a respective masked audio embedding (see Yoshioka, pg. 3039, sect 2.2 describes the process of generating the spectral mask for each output signal, yi,tf).
Regarding claim 22, Cetin and Yoshioka teach the system of claim 12. Cetin further teaches a first ground truth transcript corresponding to the audio spoken by the first speaker (see Cetin sect. 2 and sect. 3 describes clean condition and no overlap, single speaker overlap and 2 speakers overlap which is interpreted as first speaker ground truth and second speaker ground truth respectively); and a second ground truth transcript corresponding to the audio spoken by the second speaker (see Cetin sect. 2 and sect. 3 describes clean condition and no overlap, single speaker overlap and 2 speakers overlap which is interpreted as first speaker ground truth and second speaker ground truth respectively).
Claims 8 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over O. Cetin and E. Shriberg, "Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap," 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, pp. I-I in view of Yoshioka, T., Erdogan, H., Chen, Z., Xiao, X., & Alleva, F. (2018). Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks. arXiv preprint arXiv:1810.03655 further in view of Li, J., Zhao, R., Hu, H., & Gong, Y. (2019, December). Improving RNN transducer modeling for end-to-end speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 114-121). IEEE.
Regarding claim 8, Cetin and Yoshioka teach the method of claim 1. However Ceting and Yoshioka fail to teach wherein the speech recognition model comprises a recurrent neural network transducer (RNN-T) architecture. Li teaches wherein the speech recognition model comprises a recurrent neural network transducer (RNN-T) architecture (see Li, sect. 2 and Fig. 1 teaches RNN-T model which consists of encoder, prediction and joint networks).
Cetin, Yoshioka and Li are considered to be analogous to the claimed invention because they relate to end-to-end ASR solutions. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed  Li, pg. 1, sect. 1).
Regarding claim 19, Cetin and Yoshioka teach the system of claim 12. However Ceting and Yoshioka fail to teach wherein the speech recognition model comprises a recurrent neural network transducer (RNN-T) architecture. Li teaches wherein the speech recognition model comprises a recurrent neural network transducer (RNN-T) architecture (see Li, sect. 2 and Fig. 1 teaches RNN-T model which consists of encoder, prediction and joint networks).
Cetin, Yoshioka and Li are considered to be analogous to the claimed invention because they relate to end-to-end ASR solutions. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Cetin and Yoshioka on processing locations of overlapping speech using the unmixing transducer with the RNN-T training teachings of Li to reduce the memory consumption of RNN-T training and RNN-T model structure (see Li, pg. 1, sect. 1).
Claims 9 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over O. Cetin and E. Shriberg, "Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap," 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, pp. I-I in view of Yoshioka, T., Erdogan, H., Chen, Z., Xiao, X., & Alleva, F. (2018). Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks. arXiv preprint arXiv:1810.03655 further in view of Li, J., Zhao, R., Hu, H., & Gong, Y. (2019, December). Improving RNN transducer modeling for end-to-end speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 114-121). IEEE further in view of Chen et. al.,  US Patent Application Publication, 2019/0318757.
Regarding claim 9, Cetin, Yoshioka and Li teach the method of claim 8.  Cetin, Yoshioka, Li fail to teach a first decoder configured to receive, as input, the respective masked audio embedding generated for the first speaker and to generate, as output, a first transcription associated with the first speaker, the first transcription transcribing the first segment of the audio signal that corresponds to the audio spoken by the first speaker; and a second decoder configured to receive, as input, the respective masked audio embedding generated for the second speaker and to generate, as output, a second transcription associated with the second speaker, the second transcription transcribing the second segment of the audio signal that corresponds to the audio spoken by the second speaker. However Chen teaches a first decoder configured to receive, as input, the respective masked audio embedding generated for the first speaker and to generate, as output, a first transcription associated with the first speaker, the first transcription transcribing the first segment of the audio signal that corresponds to the audio spoken by the first speaker (see Chen, [0068] As noted above, one potential application of the speech separation techniques discussed herein is to provide speaker-specific transcripts for 
    PNG
    media_image4.png
    279
    598
    media_image4.png
    Greyscale
multi-party conversations. FIG. 8 shows a speech separation and transcription processing flow 800 that can be employed to obtain speaker-specific transcripts; the speech recognition , 708(1)is interpreted as the masked audio embedding generated as the masked audio embedding for the first speaker);  and a second decoder configured to receive, as input, the respective masked audio embedding generated for the second speaker and to generate, as output, a second transcription associated with the second speaker, the second transcription transcribing the second segment of the audio signal that corresponds to the audio spoken by the second speaker (see Chen, [0068] As noted above, one potential application of the speech separation techniques discussed herein is to provide speaker-specific transcripts for multi-party conversations. FIG. 8 shows a speech separation and transcription processing flow 800 that can be employed to obtain speaker-specific transcripts; the speech recognition, 708(2) is interpreted as the masked audio embedding generated as the masked audio embedding for the second speaker).
Cetin, Yoshioka, Li and Chen are considered to be analogous to the claimed invention because they relate to end-to-end ASR solutions. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Cetin, Yoshioka and Li on processing locations of overlapping speech using the unmixing transducer with the speech separation of mixed signal teachings of Chen to improve an automated speech separation system that was robust in difficult scenarios (see Chen, [0001]).
the system of claim 19.  Cetin, Yoshioka, Li fail to teach a first decoder configured to receive, as input, the respective masked audio embedding generated for the first speaker and to generate, as output, a first transcription associated with the first speaker, the first transcription transcribing the first segment of the audio signal that corresponds to the audio spoken by the first speaker; and a second decoder configured to receive, as input, the respective masked audio embedding generated for the second speaker and to generate, as output, a second transcription associated with the second speaker, the second transcription transcribing the second segment of the audio signal that corresponds to the audio spoken by the second speaker. However Chen teaches a first decoder configured to receive, as input, the respective masked audio embedding generated for the first speaker and to generate, as output, a first transcription associated with the first speaker, the first transcription transcribing the first segment of the audio signal that corresponds to the audio spoken by the first speaker (see Chen, [0068] As noted above, one potential application of the speech 
    PNG
    media_image4.png
    279
    598
    media_image4.png
    Greyscale
separation techniques discussed herein is to provide speaker-specific transcripts for multi-party conversations. FIG. 8 shows a speech separation and transcription processing flow 800 that can be employed to obtain speaker-specific transcripts; the speech recognition , 708(1)is interpreted as the masked audio embedding generated as the masked audio embedding for the first speaker);  and a second decoder configured to receive, as input, the respective masked audio embedding generated for the second speaker and to generate, as output, a second transcription associated with the second speaker, the second transcription transcribing the second segment of the audio signal that corresponds to the audio spoken by the second speaker (see Chen, [0068] As noted above, one potential application of the speech separation techniques discussed herein is to provide speaker-specific transcripts for multi-party conversations. FIG. 8 shows a speech separation and transcription processing flow 800 that can be employed to obtain speaker-specific transcripts; the speech recognition, 708(2) is interpreted as the masked audio embedding generated as the masked audio embedding for the second speaker).
Cetin, Yoshioka, Li and Chen are considered to be analogous to the claimed invention because they relate to end-to-end ASR solutions. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Cetin, Yoshioka and Li on processing locations of overlapping speech using the unmixing transducer with the speech separation of mixed signal teachings of Chen to improve an automated speech separation system that was robust in difficult scenarios (see Chen, [0001]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Chen, Z., Luo, Y., & Mesgarani, N. (2017, March). Deep attractor network for single-microphone speaker separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 246-250) teaches end to end training for a single channel speech separation by creating attractor points in high dimensional embedding space of .
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 2:00pm - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NANDINI SUBRAMANI/Examiner, Art Unit 2656
                                                                                                                                                                                                        /EDGAR X GUERRA-ERAZO/Primary Examiner, Art Unit 2656