DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments and amendments in the Amendment filed August 24, 2022 (herein “Amendment”) with respect to the objection to claim 12, and therefore claims 13-19 which depend therefrom have been fully considered and are persuasive.  The objection to claim 12 and claims 13-19 have been withdrawn. 
Applicant’s arguments and amendments in the Amendment, with respect to the rejection(s) of claim(s) 1, 12 and 20 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Tahernezhaadi et al., US 2003/0117967 A1.

Claim Objections
Claims 1, 12 and 20, and therefore claims 2-11, and 13-19 which depend therefrom, is objected to because of the following informalities: claims 1, 12 and 20 all recite “to determined energy level,” but should recite “to the determined energy level.”  Appropriate correction is required.




Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 4-10, and 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wung et al., (US 2019/0172476 A1, herein “Wung”) in view of Yashioka et al., “Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks,” Interspeech 2018, September 6, 2018, Hyderabad, India, further in view of Tahernezhaadi et al., (US 2003/0117967 A1, herein “Tahernezhaadi”).
Regarding claim 1, Wung teaches a computer-implemented method comprising (Wung paras.3-4, operations of a system for speech signal enhancement): 
receiving, by a computing device (Wung fig. 6, para. 20, components shown in the figures implemented as one or more processors, where multi-mic signals are shown as being received into an acoustic echo canceller AEC) that has an associated microphone and loudspeaker (Wung fig. 6, paras. 22, 24, microphones integrated into a loudspeaker cabinet along with loudspeaker drivers producing media content playback (which is a loudspeaker unit)), first audio data of a user utterance, the first audio data being generated using the microphone (Wung figs. 4 and 6, paras. 57, 40-41 and 22, acoustic echo canceller (AEC) block 7 receiving the microphone signals which includes a talker’s voice or speech (user utterance)); 
while receiving the first audio data of the user utterance (Wung para. 56, the present acoustic condition is evaluated (thus while receiving the audio signals from the microphones as it is the “present” condition), where para. 22 teaches the microphones receive a talker’s voice/speech (the first audio data), and see also para. 37 discussing the output of the system being the processing of the current frame of the signal picked up by the microphones), determining, by the computing device (Wung fig. 6, paras. 20, 56, components shown in the figures implemented as one or more processors, the selector 11 determines the present acoustic condition), an energy level of second audio data being outputted by the loudspeaker of the computing device (Wung para. 56, the present acoustic condition being a reference signal estimate strength (energy level)); 
based on the, selecting, by the computing device (Wung fig. 6, paras. 20, 56 components shown in the figures implemented as one or more processors, where selector 11 selects one of the DNNs depending (based) a determination about the present acoustic condition which is the strength (energy level) of the reference signal (second audio data)), a model from among (i) a first model that is configured to reduce noise in audio data that includes speech from one speaker and (ii) a second model that is configured to reduce noise in the audio data (Wung paras. 4-6, and 55, the DNN has two configurations (corresponding to two models, as the DNN is disclosed as modelling the desired and undesired signal characteristics in the multi-channel speech pickup) which are selectable to calculate an environment relative “SPP” (speech presence probability) value applied to a filter to enhance speech (of at least one speaker) by tracking desired and undesired (noise) signal components and then suppress (reduce) the undesired components); 
providing, by the computing device (Wung fig. 6, paras. 20, 57, 49, components shown in the figures implemented as one or more processors, DNN 3 has input features from the microphone signals), the first audio data as an input to the selected model (Wung paras. 49 and 55, input to DNN 3 are the input features from the microphone signals, where the DNN will have a selected configuration between DNN NS or DNN RES (input to the selected model)); 
receiving, by the computing device  and from the selected model (Wung fig. 6, paras. 20, 33, 57-58, components shown in the figures implemented as one or more processors, the output of DNN 3 is received by multichannel filter 2 where it is filtered and output), processed first audio data (Wung fig. 6, paras. 20, 35, the SPP from the DNN is used by the multi-channel filter 2 to output an enhanced speech signal (processed first audio data)); and 
providing, for output by the computing device, the processed first audio data (Wung fig. 6, para. 47, the output of the multichannel filter is transformed back to the time domain and output as enhanced speech).
While Wung teaches that it’s DNNs are trained using speech (see Wung paras. 5 and 31), Wung does not explicitly teach a first model that is trained using first audio data samples that each encode speech from one speaker, and a second model that includes speech from more than one speaker and a second model that is trained using second audio data samples that each encode speech from either one speaker or two speakers.
Wung further does not explicitly teach comparing an audio energy threshold to determined energy level, determining, based on the comparison of the audio energy to the determined energy level, whether a double-talk situation exists; based on the determination of whether a double-talk situation exists, selecting.
Yoshioka NPL teaches a first model that is trained using first audio data samples that each encode speech from one speaker, and a second model that includes speech from more than one speaker and a second model that is trained using second audio data samples that each encode speech from either one speaker or two speakers (Yoshioka NPL page 3040-3041, right column, figure 3, a Speech-Speech-Noise (SSN) model is trained with training samples from a speech data (thus encoding speech) having one or two speakers, where the model is saved after every epoch, thus a model generated at least every epoch, and thus multiple (first and second) trained models, and where the second model reduces noise in audio from multiple speakers).
Tahernezhaadi teaches comparing an audio energy threshold to determined energy level (Tahernezhaadi paras. 4 and 6, conventional echo cancellers use doubletalk detection by comparing a reflected far-end reference signal to a double-talk threshold by way of a difference between the reference signal and a desired signal), determining, based on the comparison of the audio energy to the determined energy level, whether a double-talk situation exists (Tahernezhaadi paras. 4 and 6, the doubletalk detection determines that double talk is occurring (both far-end speech and near-end speech are occurring, when the difference is less than the double-talk threshold); based on the determination of whether a double-talk situation exists, selecting (Tahernezhaadi para. 6, echo cancellation without a center clipping operation is selected when double-talk is determined).
Therefore, taking the teachings of Wung and Yoshioka NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the operations of the speech enhancement system disclosed in Wung to include training using the training samples of Yoshioka NPL at least because doing so would provide a speech processing system capable of handling speech overlaps (see Yoshioka NPL section 4, page 3041).
Further, taking the teaching of Wung and Tahernezhaadi together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wung’s selection of a DNN for processing a speech signal with the selection being based on determining a double-talk scenario as disclosed by Tahernezhaadi at least because doing so would prevent a very large error signal from being output by echo cancelling processing on the doubletalk input signal (Tahernezhaadi para. 6).
Regarding claims 2 and 13, Wung does not explicitly teach the limitations of claims 2 or 13. Yoshioka NPL teaches comprising: receiving, by the computing device, audio data of a first utterance spoken by a first speaker and audio data of a second utterance spoken by a second speaker (Yoshioka NPL page 3040, fig. 3, speech data used for training, including a two-speaker case where there are two source signals); generating, by the computing device, combined audio data by combining the audio data of the first utterance and the audio data of the second utterance (Yoshioka NPL page 3040, fig. 3, source signals are mixed together in the two-speaker case); generating, by the computing device, noisy audio data by combining the combined audio data with noise (Yoshioka NPL page 3040, the mixed together signals are corrupted by additive noise); and training, by the computing device and using machine learning, the second model using the combined audio data and the noisy audio data (Yoshioka NPL fig. 3, page 3039-3040, the generated training samples resulted from the mixed source signals and additive noise, and used to train the SSN model).
Therefore, taking the teachings of Wung and Yoshioka NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the operations of the speech enhancement system disclosed in Wung to include training using the training samples of Yoshioka NPL at least because doing so would provide a speech processing system capable of handling speech overlaps (see Yoshioka NPL section 4, page 3041).
Regarding claims 4 and 14, Wung teaches comprising: before providing the first audio data as an input to the selected model, providing, by the computing device, the first audio data as an input to an echo canceller that is configured to reduce echo in the first audio data (Wung fig. 6, para. 41, acoustic echo cancellation performed on the microphone signals (first audio data) in an AEC block 7 that is a processing block appearing before the DNN 3).
Therefore, taking the teachings of Wung and Yoshioka NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the operations of the speech enhancement system disclosed in Wung to include training using the training samples of Yoshioka NPL at least because doing so would provide a speech processing system capable of handling speech overlaps (see Yoshioka NPL section 4, page 3041).
Regarding claims 5 and 15, Wung teaches comprising: receiving, by the computing device, audio data of an utterance spoken by a speaker (Wung paras. 31-33, DNN trained in a supervised manner, where the DNN is provided input features extracted from the multi-channel speech pickup including speech); generating, by the computing device, noisy audio data by combining the audio data of the utterance with noise (Wung paras. 31-32, the DNN is trained using data of speech and background noise from the multi-channel speech pickup and extracting features therefrom); and training, by the computing device and using machine learning, the first model using the audio data of the utterance and the noisy audio data (Wung paras. 31-33, DNN is trained from the input features extracted from the multi-channel speech pickup in certain selected acoustic conditions including speech and background noise).
Regarding claims 6 and 16, Wung does not teach the limitations of claims 6 or 16. Yashioka NPL teaches wherein the second model is trained using second audio data samples that each encode speech from either two simultaneous speakers or one speaker (Yashioka NPL page 3040, section 2.4, training samples used to train the SSN model using a one or two speaker speech signal, and then generating (encoding) a training sample).
Regarding claims 7 and 17, Wung teaches comprising: comparing, by the computing device, the energy level of the second audio data to a threshold energy level (Wung fig. 6, paras. 56, 41, the playback reference signal (second audio data) is evaluated against a given threshold); and 
based on comparing the energy level of the second audio data to the threshold energy level, determining, by the computing device, that the energy level of the audio data does not satisfy the threshold energy level, wherein selecting the model comprises selecting the second model based on determining that the energy level of the second audio data does not satisfy the threshold energy level (Wung para. 56, if the strength of the reference signal is below (does not satisfy) a given threshold, then the DNN NS (second model) is selected).
Regarding claims 8 and 18, Wung teaches comprising: comparing, by the computing device, the energy level of the second audio data to a threshold energy level (Wung fig. 6, paras. 56, 41, the playback reference signal (second audio data) is evaluated against a given threshold); and 
based on comparing the energy level of the second audio data to the threshold energy level, determining, by the computing device, that the energy level of the audio data satisfies the threshold energy level, wherein selecting the model comprises selecting the first model based on determining that the energy level of the second audio data satisfies the threshold energy level (Wung para. 56, if the strength of the reference signal is above (satisfies) a given threshold, then the DNN RES (first model) is selected).
Regarding claims 9 and 19, Wung teaches wherein the microphone of the computing device is configured to detect audio output by the loudspeaker of the computing device (Wung para. 22, the multi-channel audio pickup (signals detected from the microphone) has mixed therein a talker’s voice or speech and sounds being output by the loudspeaker unit/media content playback).
Regarding claim 10, Wung teaches wherein the computing device is communicating with another computing device during an audio conference (Wung paras. 20 and 22, speech enhancement system as part of an audio system that captures speech of a talker in a room as a near end talker, and sounds of a far-end talker in the media content playback during a telephony session (audio conference)).
Regarding claim 12, Wung teaches a computing device comprising: one or more processors (Wung fig. 6, para. 20, components shown in the figures implemented as one or more processors); and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the computing device to perform (Wung para. 20, the processors execute instructions stored in solid state memory) the operations comprising (Wung paras.3-4, operations of a system for speech signal enhancement):
receiving, by the computing device (Wung fig. 6, para. 20, components shown in the figures implemented as one or more processors, where multi-mic signals are shown as being received into an acoustic echo canceller AEC), first audio data of a user utterance, the first audio data being generated using a microphone associated with the computing device (Wung figs. 4 and 6, paras. 57, 40-41 and 22, acoustic echo canceller (AEC) block 7 receiving the microphone signals which includes a talker’s voice or speech (user utterance), where paras. 20, 22, and 24, teach microphones integrated into the loudspeaker cabinet of the media content playback device which also houses the processor); 
while receiving the first audio data of the user utterance (Wung para. 56, the present acoustic condition is evaluated (thus while receiving the audio signals from the microphones as it is the “present” condition), where para. 22 teaches the microphones receive a talker’s voice/speech (the first audio data), and see also para. 37 discussing the output of the system being the processing of the current frame of the signal picked up by the microphones), determining, by the computing device (Wung fig. 6, paras. 20, 56, components shown in the figures implemented as one or more processors, the selector 11 determines the present acoustic condition), an energy level of second audio data being outputted by a loudspeaker associated with the computing device (Wung para. 56, the present acoustic condition being a reference signal estimate strength (energy level), where paras. 20, 22 and 24 teach the loudspeaker also in the media content playback device that generates the reference signal as media content playback); 
based on the, selecting, by the computing device (Wung fig. 6, paras. 20, 56 components shown in the figures implemented as one or more processors, where selector 11 selects one of the DNNs depending (based) a determination about the present acoustic condition which is the strength (energy level) of the reference signal (second audio data)), a model from among (i) a first model that is configured to reduce noise in audio data that includes speech from one speaker and (ii) a second model that is configured to reduce noise in the audio data (Wung paras. 4-6, and 55, the DNN has two configurations (corresponding to two models, as the DNN is disclosed as modelling the desired and undesired signal characteristics in the multi-channel speech pickup) which are selectable to calculate an environment relative “SPP” (speech presence probability) value applied to a filter to enhance speech (of at least one speaker) by tracking desired and undesired (noise) signal components and then suppress (reduce) the undesired components); 
providing, by the computing device (Wung fig. 6, paras. 20, 57, 49, components shown in the figures implemented as one or more processors, DNN 3 has input features from the microphone signals), the first audio data as an input to the selected model (Wung paras. 49 and 55, input to DNN 3 are the input features from the microphone signals, where the DNN will have a selected configuration between DNN NS or DNN RES (input to the selected model)); 
receiving, by the computing device  and from the selected model (Wung fig. 6, paras. 20, 33, 57-58, components shown in the figures implemented as one or more processors, the output of DNN 3 is received by multichannel filter 2 where it is filtered and output), processed first audio data (Wung fig. 6, paras. 20, 35, the SPP from the DNN is used by the multi-channel filter 2 to output an enhanced speech signal (processed first audio data)); and 
providing, for output by the computing device, the processed first audio data (Wung fig. 6, para. 47, the output of the multichannel filter is transformed back to the time domain and output as enhanced speech).
While Wung teaches that it’s DNNs are trained using speech (see Wung paras. 5 and 31), Wung does not explicitly teach a first model that is trained using first audio data samples that each encode speech from one speaker, and a second model that includes speech from more than one speaker and a second model that is trained using second audio data samples that each encode speech from either one speaker or two speakers.
Wung further does not explicitly teach comparing an audio energy threshold to determined energy level, determining, based on the comparison of the audio energy to the determined energy level, whether a double-talk situation exists; based on the determination of whether a double-talk situation exists, selecting.

Yoshioka NPL teaches a first model that is trained using first audio data samples that each encode speech from one speaker, and a second model that includes speech from more than one speaker and a second model that is trained using second audio data samples that each encode speech from either one speaker or two speakers (Yoshioka NPL page 3040-3041, right column, figure 3, a Speech-Speech-Noise (SSN) model is trained with training samples from a speech data (thus encoding speech) having one or two speakers, where the model is saved after every epoch, thus a model generated at least every epoch, and thus multiple (first and second) trained models, and where the second model reduces noise in audio from multiple speakers).
Tahernezhaadi teaches comparing an audio energy threshold to determined energy level (Tahernezhaadi paras. 4 and 6, conventional echo cancellers use doubletalk detection by comparing a reflected far-end reference signal to a double-talk threshold by way of a difference between the reference signal and a desired signal), determining, based on the comparison of the audio energy to the determined energy level, whether a double-talk situation exists (Tahernezhaadi paras. 4 and 6, the doubletalk detection determines that double talk is occurring (both far-end speech and near-end speech are occurring, when the difference is less than the double-talk threshold); based on the determination of whether a double-talk situation exists, selecting (Tahernezhaadi para. 6, echo cancellation without a center clipping operation is selected when double-talk is determined).
Therefore, taking the teachings of Wung and Yoshioka NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the operations of the speech enhancement system disclosed in Wung to include training using the training samples of Yoshioka NPL at least because doing so would provide a speech processing system capable of handling speech overlaps (see Yoshioka NPL section 4, page 3041).
Further, taking the teaching of Wung and Tahernezhaadi together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wung’s selection of a DNN for processing a speech signal with the selection being based on determining a double-talk scenario as disclosed by Tahernezhaadi at least because doing so would prevent a very large error signal from being output by echo cancelling processing on the doubletalk input signal (Tahernezhaadi para. 6).
Regarding claim 20, Wung teaches one or more non-transitory computer-readable media storing software comprising instructions executable by one or more processors of a computing device which, upon such execution (Wung fig. 6, para. 20, components shown in the figures implemented as one or more processors that execute instructions stored in solid state memory), cause the computing device to perform the operations comprising (Wung paras.3-4, operations of a system for speech signal enhancement):
receiving, by the computing device (Wung fig. 6, para. 20, components shown in the figures implemented as one or more processors, where multi-mic signals are shown as being received into an acoustic echo canceller AEC), first audio data of a user utterance, the first audio data being generated using a microphone associated with the computing device (Wung figs. 4 and 6, paras. 57, 40-41 and 22, acoustic echo canceller (AEC) block 7 receiving the microphone signals which includes a talker’s voice or speech (user utterance), where paras. 20, 22, and 24, teach microphones integrated into the loudspeaker cabinet of the media content playback device which also houses the processor); 
while receiving the first audio data of the user utterance (Wung para. 56, the present acoustic condition is evaluated (thus while receiving the audio signals from the microphones as it is the “present” condition), where para. 22 teaches the microphones receive a talker’s voice/speech (the first audio data), and see also para. 37 discussing the output of the system being the processing of the current frame of the signal picked up by the microphones), determining, by the computing device (Wung fig. 6, paras. 20, 56, components shown in the figures implemented as one or more processors, the selector 11 determines the present acoustic condition), an energy level of second audio data being outputted by a loudspeaker associated with the computing device (Wung para. 56, the present acoustic condition being a reference signal estimate strength (energy level), where paras. 20, 22 and 24 teach the loudspeaker also in the media content playback device that generates the reference signal as media content playback); 
based on the, selecting, by the computing device (Wung fig. 6, paras. 20, 56 components shown in the figures implemented as one or more processors, where selector 11 selects one of the DNNs depending (based) a determination about the present acoustic condition which is the strength (energy level) of the reference signal (second audio data)), a model from among (i) a first model that is configured to reduce noise in audio data that includes speech from one speaker and (ii) a second model that is configured to reduce noise in the audio data (Wung paras. 4-6, and 55, the DNN has two configurations (corresponding to two models, as the DNN is disclosed as modelling the desired and undesired signal characteristics in the multi-channel speech pickup) which are selectable to calculate an environment relative “SPP” (speech presence probability) value applied to a filter to enhance speech (of at least one speaker) by tracking desired and undesired (noise) signal components and then suppress (reduce) the undesired components); 
providing, by the computing device (Wung fig. 6, paras. 20, 57, 49, components shown in the figures implemented as one or more processors, DNN 3 has input features from the microphone signals), the first audio data as an input to the selected model (Wung paras. 49 and 55, input to DNN 3 are the input features from the microphone signals, where the DNN will have a selected configuration between DNN NS or DNN RES (input to the selected model)); 
receiving, by the computing device and from the selected model (Wung fig. 6, paras. 20, 33, 57-58, components shown in the figures implemented as one or more processors, the output of DNN 3 is received by multichannel filter 2 where it is filtered and output), processed first audio data (Wung fig. 6, paras. 20, 35, the SPP from the DNN is used by the multi-channel filter 2 to output an enhanced speech signal (processed first audio data)); and 
providing, for output by the computing device, the processed first audio data (Wung fig. 6, para. 47, the output of the multichannel filter is transformed back to the time domain and output as enhanced speech).
While Wung teaches that it’s DNNs are trained using speech (see Wung paras. 5 and 31), Wung does not explicitly teach a first model that is trained using first audio data samples that each encode speech from one speaker, and a second model that includes speech from more than one speaker and a second model that is trained using second audio data samples that each encode speech from either one speaker or two speakers.
Wung further does not explicitly teach comparing an audio energy threshold to determined energy level, determining, based on the comparison of the audio energy to the determined energy level, whether a double-talk situation exists; based on the determination of whether a double-talk situation exists, selecting.
Yoshioka NPL teaches a first model that is trained using first audio data samples that each encode speech from one speaker, and a second model that includes speech from more than one speaker and a second model that is trained using second audio data samples that each encode speech from either one speaker or two speakers (Yoshioka NPL page 3040-3041, right column, figure 3, a Speech-Speech-Noise (SSN) model is trained with training samples from a speech data (thus encoding speech) having one or two speakers, where the model is saved after every epoch, thus a model generated at least every epoch, and thus multiple (first and second) trained models, and where the second model reduces noise in audio from multiple speakers).
Tahernezhaadi teaches comparing an audio energy threshold to determined energy level (Tahernezhaadi paras. 4 and 6, conventional echo cancellers use doubletalk detection by comparing a reflected far-end reference signal to a double-talk threshold by way of a difference between the reference signal and a desired signal), determining, based on the comparison of the audio energy to the determined energy level, whether a double-talk situation exists (Tahernezhaadi paras. 4 and 6, the doubletalk detection determines that double talk is occurring (both far-end speech and near-end speech are occurring, when the difference is less than the double-talk threshold); based on the determination of whether a double-talk situation exists, selecting (Tahernezhaadi para. 6, echo cancellation without a center clipping operation is selected when double-talk is determined).
Therefore, taking the teachings of Wung and Yoshioka NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the operations of the speech enhancement system disclosed in Wung to include training using the training samples of Yoshioka NPL at least because doing so would provide a speech processing system capable of handling speech overlaps (see Yoshioka NPL section 4, page 3041).
Further, taking the teaching of Wung and Tahernezhaadi together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wung’s selection of a DNN for processing a speech signal with the selection being based on determining a double-talk scenario as disclosed by Tahernezhaadi at least because doing so would prevent a very large error signal from being output by echo cancelling processing on the doubletalk input signal (Tahernezhaadi para. 6).
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Wung in view of Yashioka NPL in view of Tahernezhaadi, as set forth above regarding claim 2 from which claim 3 depends, further in view of Yoshioka et al., "Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5739-5743, doi: 10.1109/ICASSP.2018.8462081 (herein “Yoshioka2 NPL”).
Regarding claim 3, Wung does not explicitly teach the limitations of claim 3. 
Yoshioka NPL teaches wherein combining the audio data of the first utterance and the audio data of the second utterance (Yoshioka NPL page 3040, section 2.4, start and end times of the utterances used for a two-speaker case in mixing the samples, is described in reference 15, which is the Yoshioka2 NPL reference)
Yoshioka NPL2 teaches comprises overlapping the audio data of the first utterance and the audio data of the second utterance in the time domain and summing the audio data of the first utterance and the audio data of the second utterance (Yoshioka2 NPL page 5741, fig. 3, test sets generated including full and partial overlap of speech signals from speaker 1 and speaker 2, the speaker signals are mixed (summing)).
Therefore, taking the teachings of Wung and Yoshioka NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the operations of the speech enhancement system disclosed in Wung to include training using the training samples of Yoshioka NPL at least because doing so would provide a speech processing system capable of handling speech overlaps (see Yoshioka NPL section 4, page 3041).
Still further, taking the teachings of Wung and Yoshioka2 NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the operations of the speech enhancement system disclosed in Wung to include the training/test data disclosed in Yoshioka2 NPL at least because doing so would allow for a system to learn how to determine whether input consists of multiple speakers, and achieve good recognition performance (see Yoshioka2 NPL page 5742, right column).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Wung in view of Yashioka NPL in view of Tahernezhaadi, as set forth above regarding claim 1 from which claim 11 depends, further in view of Oyman et al., (US 10,148,868 B2, herein “Oyman”).
Regarding claim 11, Wung teaches wherein the computing device is communicating with another computing device during a conference (Wung paras. 20 and 22, speech enhancement system as part of an audio system that captures speech of a talker in a room as a near end talker, and sounds of a far-end talker in the media content playback during a telephony session (conference)).
Wung does not explicitly teach that the conference is a video conference, just that it is a telephony session with sound.
Oyman teaches a video conference (Oyman col. 4, line 55 – col. 5, line 13, multimedia telephony services including a two-way video conferencing).
Therefore, taking the teachings of Wung and Oyman together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the telephony system disclosed in Wung to include the video conferencing disclosed in Oyman at least because video applications such as Skype and Google Hangout are extensively used on mobile devices in daily life and have high consumer demand (see Oyman col. 1, lines 16-30), and therefore, would have been use of known technique to improve similar devices (methods, or products) in the same way. see MPEP 2143(I)(C).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Lloyd et al., US 2012/0259631 A1, directed towards a speech recognition system that associates a model with a user and determines whether background audio in an input audio signal is below a defined threshold.
Chen et al., US 2014/0278397 A1, directed towards speech processing of audio signals in a communication session that includes both near-end and far-end speakers, and that suppresses noise.
Heigold et al., US 9,976,374 B2, directed towards training a neural network for speaker verification that trains and uses multiple speaker models that are selected based on a generated speaker representation.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Friday, 09:30-18:30 EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MICHELLE M. KOETH
Primary Examiner
Art Unit 2656



/MICHELLE M KOETH/Primary Examiner, Art Unit 2656