DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments

Applicant’s arguments, see pages 9-11, filed 05/09/2022, with respect to the rejection(s) of claim(s) 1-4, 7, 11-14, and 17 under 35 U.S.C. 103 have been fully considered but they are not persuasive.  Applicant argues on page 11, “Mitchell does not teach a model that is trained to determine audio characteristics for each frequency range in a particular frequency range. Rather, Mitchell focuses on a single frequency range for the entire audio frame”.  Examiner respectfully disagrees with these arguments. Mitchell teaches the use of multiple sound classes (ex. dog barking, female speaking, baby crying, loud crash, noisy room, etc.) for a given audio frame (Mitchell col 2, lines 1-3, 37-50 and col 10, lines 37-39). As known by one of ordinary skill in the art, each sound class corresponds to a specific frequency range (ex. baby crying – 300 to 600 Hz, female voice – 165 to 255 Hz, etc.), therefore, there are multiple frequency ranges corresponding to the multiple sound classes associated with a given audio frame.  As a result, Mitchell teaches, for each frame, the use of a set of sound classes/events (that correspond to multiple frequency ranges) to produce a wide variability of energy levels that provide input into a trained machine learning model to recognize (determine) and output specific energy levels/audio data corresponding to speech or non-speech (Mitchell col 6, lines 16-29; col 2, lines 1-3, 10-13, 46-50; col 18, lines 37-42, 52-58; and col 10, lines 37-39).
Applicant’s arguments, see page 13, filed 05/09/2022, with respect to the rejection(s) of claim(s) 5-6, 8, 15-16 and 18 under 35 U.S.C. 103 have been fully considered but they are not persuasive.  Examiner respectfully disagrees with applicant’s arguments that state Wu does not cure the deficiencies of Matheja, in view of Mitchell for claims 5-6, 8, 15-16 and 18 because of the features of independent claims 1 and 11. Wu discloses the state machine related deficiencies rendered by Matheja, in view of Mitchell, for claims 5-6, 8, 15-16 and 18. 
Applicant’s arguments, see pages 13-14, filed 05/09/2022, with respect to the rejection(s) of claim(s) 9-10 and 19-20 under 35 U.S.C. 103 have been fully considered but they are not persuasive.  Examiner respectfully disagrees with applicant’s arguments that state Alvarez does not cure the deficiencies of Matheja, in view of Mitchell, for claims 9-10 and 19-20 because of the features of independent claims 1 and 11. Alvarez discloses the audio and noise segmentation deficiencies rendered by Matheja, in view of Mitchell, for claims 9-10 and 19-20. 
Applicant’s arguments, see pages 12-13, filed 05/09/2022, with respect to the rejection(s) of claim(s) 21- under 35 U.S.C. 103 have been fully considered but they are not persuasive.  Examiner respectfully disagrees with applicant’s arguments that state Mitchell does not cure the deficiencies of Matheja for claims 21-23. Mitchell, discloses the trained speech audio data in the time-domain deficiencies rendered by Matheja for Claims 21-23.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 7, 11-14, 17 and 21-23 are rejected under 35 U.S.C. 103 as being unpatentable over Matheja et al., (US 2016/0261951 A1) (hereinafter Matheja), in view of Mitchell et al., (US 10,878,840 B1) (hereinafter Mitchell).

Regarding Claim 1: Matheja discloses a method comprising:
receiving, by a processing device through a plurality of channels, audio data, wherein the audio data of each channel corresponds to a plurality of frequency ranges (Receiving sound information (audio data) channels from a plurality of microphone signals, which correspond to frequencies in the frequency sub-band domain. (Matheja ¶0005 and 0038));
determining, based on at least one of the speech audio energy level or the noise energy level for each of the plurality of frequency ranges, a speech signal with removed noise for each channel associated with the audio data (Matheja discloses, after receiving speech signals from the plurality of microphones, estimating the peak levels for each channel via the Automatic Gain Control (AGC) module (Fig. 3) and removing noise from each channel using the estimated noise power via the Noise Reduction (NR) module (Fig. 4) to produce preprocessed speech signals. Fig. 2 also includes Voice Activity Detection (VAD) /Speaker Activity Detection (SAD) to contribute to the calculation of the target values necessary to determine the dominant speaker and calculate values for adjusting the AGC and the maximum attenuation of the NR module (Matheja ¶0037, 0042-0043 and 0046));
for each channel, determining one or more statistical values associated with an energy level of a channel’s speech signal with the removed noise (As detailed in the Specification, the peak measurement can be defined as a statistical value associated to the energy level of a channel.  Matheja discloses, in peak level estimation module, the estimation (determination) of a peak level (statistical value) for m-th microphone signal (channel) within the Automatic Gain Control (AGC) modules (Matheja ¶0042));
determining a strongest channel, wherein the strongest channel has highest one or more statistical values associated with an energy level of a speech signal of a respective channel (Matheja discloses specifying (determining) the strongest (dominant) channel as the reference speech level by observing the reference/target peak level of Automatic Gain Control (AGC) modules (Matheja ¶0044-0045));
determining that the one or more statistical values associated with the energy level of the speech signal of the strongest channel satisfy a threshold condition (Matheja discloses the dominant channel must be active for a predetermined (threshold) amount of time to control the target values necessary to regulate the background noise (Matheja ¶0038));
comparing one or more statistical values associated with an energy level of a speech signal of each channel other than the strongest channel with the corresponding one or more statistical values associated with the energy level of the speech signal of the strongest channel (Matheja discloses utilizing the dominant channel as the reference speech level to assess whether to conduct Automatic Gain Control (AGC) and noise attenuation to ensure all channels are adapted to similar levels(Matheja ¶0045 and 0027));
depending on the comparing, determining whether to update a gain value for a respective channel based on the one or more statistical values associated with the energy level of the respective channel (Matheja discloses achieving equivalent background noise characteristics for each channel by utilizing the reference channel (dominant speaker) to conduct Automatic Gain Control (AGC) and noise attenuation techniques to adapt the speech signal power levels to approximately the same power for all channels (Matheja ¶0027, 0041-0045)).

Matheja does not explicitly disclose:
determining, for each of the plurality of frequency ranges for each channel, at least one of a speech audio energy level or a noise energy level by providing audio data corresponding to each frequency range as input to a model that is trained to determine at least one of a speech audio energy level of given audio data or a noise energy level of the given audio data in the corresponding frequency range of the plurality of frequency ranges;

However, in an analogous art, Mitchell discloses: 
determining, for each of the plurality of frequency ranges for each channel, at least one of a speech audio energy level or a noise energy level by providing audio data corresponding to each frequency range as input to a model that is trained to determine at least one of a speech audio energy level of given audio data or a noise energy level of the given audio data in the corresponding frequency range of the plurality of frequency ranges (Mitchell teaches, for each frame, the use of a set of sound classes/events (that correspond to multiple frequency ranges) to produce a wide variability of energy levels that provide input into a trained machine learning model to recognize (determine) and output specific energy levels/audio data corresponding to speech or non-speech (Mitchell col 6, lines 16-29; col 2, lines 1-3, 10-13, 46-50; col 18, lines 37-42, 52-58; and col 10, lines 37-39).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Mitchell to the method of Matheja because this would improve the sound recognition ability of the system by utilizing previously captured audio data (from multiple sound classes of Mitchell coupled with the multiple frequency ranges of Matheja) that has been processed and trained within a machine learning model or neural network (Mitchell col 6, lines 42-46; col 2, lines 46-50).

Regarding Claim 2: Matheja, in view of Mitchell, further discloses the method of claim 1, wherein determining the speech signal with removed noise for each channel comprises: for each of the plurality of frequency ranges of a channel, calculating a denoised signal based on at least one of the speech audio energy level or the noise energy level for a corresponding frequency range; and combining calculated denoised signals that each correspond to one of the plurality of frequency ranges of the channel (View Matheja ¶0037, 0052 and 0058).

Regarding Claim 3: Matheja, in view of Mitchell, further discloses the method of claim 1, wherein the threshold condition requires that the one or more statistical values associated with the energy level of the strongest channel be above a respective threshold value for a threshold period of time (View Matheja ¶0038-0040).

Regarding Claim 4: Matheja, in view of Mitchell, further discloses the method of claim 1, wherein determining whether to update the gain value for the respective channel comprises: determining whether the one or more statistical values associated with the energy level of the respective channel have been within a predefined range from a corresponding one or more statistical values associated with the energy level of the strongest channel for a period of time (View Matheja ¶0038-0040).

Regarding Claim 7: Matheja, in view of Mitchell, further discloses the method of claim 1, wherein the plurality of frequency ranges is limited to a predefined set of frequencies (View Matheja ¶0006, 0038, 0047, 0074 and 0092).

Regarding Claim 11: Matheja discloses a system comprising:
	a memory (Matheja ¶0121-0122);
	a processing device communicably coupled to the memory, the processing device to (Matheja ¶0121-0122);
receive, through a plurality of channels, audio data, wherein the audio data of each channel corresponds to a plurality of frequency ranges (Matheja discloses receiving sound information (audio data) channels from a plurality of microphone signals, which correspond to frequencies in the frequency sub-band domain. (Matheja ¶0005 and 0038));
determine, based on at least one of the speech audio energy level or the noise energy level for each of the plurality of frequency ranges, a speech signal with removed noise for each channel associated with the audio data (Matheja discloses, after receiving speech signals from the plurality of microphones, estimating the peak levels for each channel via the Automatic Gain Control (AGC) module (Fig. 3) and removing noise from each channel using the estimated noise power via the Noise Reduction (NR) module (Fig. 4) to produce preprocessed speech signals. Fig. 2 also includes Voice Activity Detection (VAD) /Speaker Activity Detection (SAD) to contribute to the calculation of the target values necessary to determine the dominant speaker and calculate values for adjusting the AGC and the maximum attenuation of the NR module (Matheja ¶0037, 0042-0043 and 0046));
for each channel, determine one or more statistical values associated with an energy level of a channel's speech signal with the removed noise (As detailed in the Specification, the peak measurement can be defined as a statistical value associated to the energy level of a channel.  Matheja discloses, in peak level estimation module, the estimation (determination) of a peak level (statistical value) for m-th microphone signal (channel) within the Automatic Gain Control (AGC) modules (Matheja ¶0042));
determine a strongest channel, wherein the strongest channel has highest one or more statistical values associated with an energy level of a speech signal of a respective channel (Matheja discloses specifying (determining) the strongest (dominant) channel as the reference speech level by observing the reference/target peak level of Automatic Gain Control (AGC) modules (Matheja ¶0044-0045));
determine that the one or more statistical values associated with the energy level of the speech signal of the strongest channel satisfy a threshold condition (Matheja discloses the dominant channel must be active for a predetermined (threshold) amount of time to control the target values necessary to regulate the background noise (Matheja ¶0038));
compare one or more statistical values associated with an energy level of a speech signal of each channel other than the strongest channel with the corresponding one or more statistical values associated with the energy level of the speech signal of the strongest channel (Matheja discloses utilizing the dominant channel as the reference speech level to assess whether to conduct Automatic Gain Control (AGC) and noise attenuation to ensure all channels are adapted to similar levels(Matheja ¶0045 and 0027));
depending on the comparing, determine whether to update a gain value for a respective channel based on the one or more statistical values associated with the energy level of the respective channel (Matheja discloses achieving equivalent background noise characteristics for each channel by utilizing the reference channel (dominant speaker) to conduct Automatic Gain Control (AGC) and noise attenuation techniques to adapt the speech signal power levels to approximately the same power for all channels (Matheja ¶0027, 0041-0045)).

Matheja does not explicitly disclose:
determine, for each of the plurality of frequency ranges for each channel, at least one of a speech audio energy level or a noise energy level by providing audio data corresponding to each frequency range as input to a model that is trained to determine at least one of a speech audio energy level of given audio data or a noise energy level of the given audio data in the corresponding frequency range of the plurality of frequency ranges.

However, in an analogous art, Mitchell discloses: 
determine, for each of the plurality of frequency ranges for each channel, at least one of a speech audio energy level or a noise energy level by providing audio data corresponding to each frequency range as input to a model that is trained to determine at least one of a speech audio energy level of given audio data or a noise energy level of the given audio data in the corresponding frequency range of the plurality of frequency ranges (Mitchell teaches, for each frame, the use of a set of sound classes/events (that correspond to multiple frequency ranges) to produce a wide variability of energy levels that provide input into a trained machine learning model to recognize (determine) and output specific energy levels/audio data corresponding to speech or non-speech (Mitchell col 6, lines 16-29; col 2, lines 1-3, 10-13, 46-50; col 18, lines 37-42, 52-58; and col 10, lines 37-39).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Mitchell to the method of Matheja because this would improve the sound recognition ability of the system by utilizing previously captured audio data (from multiple sound classes of Mitchell coupled with the multiple frequencies of Matheja) that has been processed and trained within a machine learning model or neural network (Mitchell col 6 lines 42-46, col 2, lines 46-50).

Regarding Claim 12: Matheja, in view of Mitchell, further discloses the system of claim 11, wherein to determine the speech signal with removed noise for each channel, the processing device is further to: for each of the plurality of frequency ranges of a channel, calculate a denoised signal based on at least one of the speech audio energy level for the noise energy level for a corresponding frequency range; and combine calculated denoised signals that each correspond to one of the plurality of frequency ranges of the channel (View Matheja ¶0037, 0052 and 0058).

Regarding Claim 13: Matheja, in view of Mitchell, further discloses the system of claim 11, wherein the threshold condition requires that the one or more statistical values associated with the energy level of the strongest channel be above a respective threshold value for a threshold period of time (View Matheja ¶0038-0040).

Regarding Claim 14: Matheja, in view of Mitchell, further discloses the system of claim 11, wherein to determine whether to update the gain value for the respective channel, the processing device is further to: determine whether the one or more statistical values associated with the energy level of the respective channel have been within a predefined range from a corresponding one or more statistical values associated with the energy level of the strongest channel for a period of time (View Matheja ¶0038-0040).

Regarding Claim 17: Matheja, in view of Mitchell, further discloses the system of claim 11, wherein the plurality of frequency ranges is limited to a predefined set of frequencies (View Matheja ¶0006, 0038, 0047, 0074 and 0092).

Regarding Claim 21: Matheja discloses a non-transitory machine-readable storage medium comprising
instructions that cause a processing device (Matheja ¶0121) to:
receive, through a plurality of channels, audio data, wherein the audio data of each channel corresponds to a plurality of time-related portions (Receiving sound information (audio data) channels from a plurality of microphone broadband signals in the time domain. (Matheja ¶0005-0006 and 0038));
determine, for each of the plurality of time-related portions for each channel, a speech signal with removed noise by providing audio data corresponding to each time-related portion as input to determine a speech signal with removed noise of given audio data  (Matheja discloses, after receiving speech signals from the plurality of microphones, estimating the peak levels for each channel via the Automatic Gain Control (AGC) module (Fig. 3) and removing noise from each channel using the estimated noise power via the Noise Reduction (NR) module (Fig. 4) to produce preprocessed speech signals. Fig. 2 also includes Voice Activity Detection (VAD) /Speaker Activity Detection (SAD) to contribute to the calculation of the target values necessary to determine the dominant speaker and calculate values for adjusting the AGC and the maximum attenuation of the NR module (Matheja ¶0037, 0042-0043, 0046 and 0006));
for each channel, determine one or more statistical values associated with an energy level of a channel's speech signal with the removed noise (As detailed in the Specification, the peak measurement can be defined as a statistical value associated to the energy level of a channel.  Matheja discloses, in peak level estimation module, the estimation (determination) of a peak level (statistical value) for m-th microphone signal (channel) within the Automatic Gain Control (AGC) modules (Matheja ¶0042));
determine a strongest channel, wherein the strongest channel has highest one or more statistical values associated with an energy level of a speech signal of a respective channel (Matheja discloses specifying (determining) the strongest (dominant) channel as the reference speech level by observing the reference/target peak level of Automatic Gain Control (AGC) modules (Matheja ¶0044-0045));
determine that the one or more statistical values associated with the energy level of the speech signal of the strongest channel satisfy a threshold condition (Matheja discloses the dominant channel must be active for a predetermined (threshold) amount of time to control the target values necessary to regulate the background noise (Matheja ¶0038));
compare one or more statistical values associated with an energy level of a speech signal of each channel other than the strongest channel with the corresponding one or more statistical values associated with the energy level of the speech signal of the strongest channel (Matheja discloses utilizing the dominant channel as the reference speech level to assess whether to conduct Automatic Gain Control (AGC) and noise attenuation to ensure all channels are adapted to similar levels(Matheja ¶0045 and 0027));
depending on the comparing, determine whether to update a gain value for a respective channel based on the one or more statistical values associated with the energy level of the respective channel (Matheja discloses achieving equivalent background noise characteristics for each channel by utilizing the reference channel (dominant speaker) to conduct Automatic Gain Control (AGC) and noise attenuation techniques to adapt the speech signal power levels to approximately the same power for all channels (Matheja ¶0027, 0041-0045)).

Matheja does not explicitly disclose:
determine, for each of the plurality of time-related portions for each channel, a speech signal with removed noise by providing audio data corresponding to each time-related portion as input to a model that is trained to determine a speech signal with removed noise of given audio data.

However, in an analogous art, Mitchell discloses: 
determine, for each of the plurality of time-related portions for each channel, a speech signal by providing audio data corresponding to each time-related portion as input to a model that is trained to determine a speech signal of given audio (Mitchell teaches, for each frame, the use of a set of sound classes/events (corresponding time samples/frames of the time domain) to produce a wide variability of energy levels that provide input into a trained machine learning model to recognize (determine) and output specific energy levels/audio data corresponding to speech or non-speech (Mitchell col 6, lines 16-29; col 2, lines 1-3, 10-23, 28-32, 46-50; col 18, lines 37-42, 52-58; and col 10, lines 37-39)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Mitchell to the method of Matheja because this would improve the sound recognition ability of the system by utilizing previously captured speech audio data with noise remove from Matheja, coupled with Mitchell’s multiple sound classes in the time domain to process and train the speech signals within a machine learning model or neural network (Mitchell col 6, lines 42-46; col 2, lines 28-32, 46-50).

Regarding Claim 22: Matheja, in view of Mitchell, further discloses the non-transitory machine-readable storage medium of claim 21, wherein to determine the speech signal with removed noise for each channel, the processing device is further to: for each of the plurality of time-related portions of a channel, calculate a denoised speech signal for a corresponding time-related portion (Matheja ¶0037-0038 and 0006); and combine calculated denoised speech signals that each correspond to one of the plurality of time-related portions of the channel (View Matheja ¶0052, 0058 and 0006).

Regarding Claim 23: Matheja, in view of Mitchell, further discloses the non-transitory machine-readable storage medium of claim 21, wherein the threshold condition requires that the one or more statistical values associated with the energy level of the strongest channel be above a respective threshold value for a threshold period of time (View Matheja ¶0038-0040).


Claims 5-6, 8, 15-16 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Matheja et al., (US 2016/0261951 A1) (hereinafter Matheja), in view of Mitchell et al., (US 10,878,840 B1) (hereinafter Mitchell), in further view of Wu et al., (US 11,164,592 B1) (hereinafter Wu).

Regarding Claim 5: Matheja, in view of Mitchell, discloses the method of claim 1, comprising:
based on the speech audio energy level and the noise energy level, updating a state of a state machine that includes a speech state, a noise state and an uncertain state (Utilizing sound recognition, calculating scores for sound classes related to speech, non-speech/non-verbal (noise), and uncertain events/scenes (states) based on the energy level of the audio data (Mitchell col 3, lines 65-67; col 18, lines 37-39, col 7, lines 61-63; col 11, lines 13-16 and col 6, lines 56-65)).

Matheja, in view of Mitchell, does not explicitly disclose:
based on the speech audio energy level and the noise energy level, updating a state of a state machine that includes a silence state.

However, in an analogous art, Wu discloses: 
	based on the speech audio energy level and the noise energy level, updating a state of a state machine that includes a silence state (Wu teaches updating the state of the Voice Activity Detection (VAD) based on whether the energy levels of the audio data/frames are determined to be speech, silence, noise, and/or non-speech (Wu col 7, lines 20-29; col 3, lines 53-55; and col 4, lines 24-26)).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Wu to the method of Matheja, in view of Mitchell, because this would improve the various states of the current the Voice Activity Detection (VAD) abilities by identifying the gaps of silence found within the audio data, which would improve the audio quality and aid in maintaining the dynamic range for voice audio (Wu col 3, lines 37-47).  Implementing such improvements would also allow the device to determine and implement a silence gain for the gap of silence in the audio frame thus improving the accuracy of the speech estimates, the Automatic Gain Control (AGC) responsiveness and the user experience (Wu col 17, lines 61-67 and col 3, lines 49-51).

Regarding Claim 6: Matheja, in view of Mitchell and further in view of Wu, further discloses the method of claim 5, further comprising: updating the gain value for the respective channel, wherein updating the gain value for the respective channel further comprises: determining whether the state of the state machine is speech state for a threshold amount of time (Wu Fig. 11E, Module 1184; col 2, lines 53-54; col 7, lines 20-29); responsive to determining that the state of the state machine is speech state for the threshold amount of time, updating the gain value by no more than a first number of decibels per second (Wu col 4, lines 14-19 and col 13, lines 34-41); determining whether the state of the state machine is uncertain state for the threshold amount of time (Mitchell col 6, lines 56-65); and responsive to determining that the state of the state machine is uncertain state for the threshold amount of time, updating the gain value by no more than a second number of decibels per second (Wu col 4, lines 14-19 and col 13, lines 34-41).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Wu to the method of Matheja, in view of Mitchell, because this would improve the accuracy of the loudness estimate and therefore a responsiveness of the automatic gain control, resulting in an improved user experience (Wu col 2, lines 54-57).

Regarding Claim 8: Matheja, in view of Mitchell and further in view of Wu, further discloses the method of claim 6, wherein updating the gain value comprises: ensuring that the updated gain value does not exceed a gain value threshold (Wu col 20, lines 48-64, col 16, lines 39-44 and col 16, lines 52-55).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Wu to the method of Matheja, in view of Mitchell, because this would allow the system to limit the gain and prevent periods of excessive loudness and signal distortion (Wu col 16, lines 39-44).
Regarding Claim 15: Matheja, in view of Mitchell, discloses the system of claim 11wherein the processing device is further to: 
based on the speech audio energy level and the noise energy level, update a state of a state machine that includes a speech state, a noise state, and an uncertain state (Utilizing sound recognition, calculating scores for sound classes related to speech, non-speech/non-verbal (noise), and uncertain events/scenes (states) based on the energy level of the audio data (Mitchell col 3, lines 65-67; col 18, lines 37-39, col 7, lines 61-63; col 11, lines 13-16 and col 6, lines 56-65)).

Matheja, in view of Mitchell, does not explicitly disclose:
based on the speech audio energy level and the noise energy level, update a state of a state machine that includes a silence state.

However, in an analogous art, Wu discloses: 
	based on the speech audio energy level and the noise energy level, update a state of a state machine that includes a silence state (Wu teaches updating the state of the Voice Activity Detection (VAD) based on whether the energy levels of the audio data/frames are determined to be speech, silence, noise, and/or non-speech (Wu col 7, lines 20-29; col 3, lines 53-55, col 4, lines 24-26)).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Wu to the method of Matheja, in view of Mitchell, because this would improve the various states of the current the Voice Activity Detection (VAD) abilities by identifying the gaps of silence found within the audio data, which would improve the audio quality and aid in maintaining the dynamic range for voice audio (Wu col 3, lines 37-47).  Implementing such improvements would also allow the device to determine and implement a silence gain for the gap of silence in the audio frame thus improving the accuracy of the speech estimates, the Automatic Gain Control (AGC) responsiveness and the user experience (Wu col 17, lines 61-67 and col 3, lines 49-51).

Regarding Claim 16: Matheja, in view of Mitchell and further in view of Wu, further discloses the system of claim 15, wherein the processing device is further to: update the gain value for the respective channel, wherein to update the gain value for the respective channel, the processing device is further to: determine whether the state of the state machine is speech state for a threshold amount of time (Wu Fig. 11E, Module 1184; col 2, lines 53-54; col 7, lines 20-29); responsive to determining that the state of the state machine is speech state for the threshold amount of time, update the gain value by no more than a first number of decibels per second (Wu col 4, lines 14-19 and col 13, lines 34-41); determine whether the state of the state machine is uncertain state for the threshold amount of time (Mitchell col 6, lines 56-65); and responsive to determining that the state of the state machine is uncertain state for the threshold amount of time, update the gain value by no more than a second number of decibels per second (Wu col 4, lines 14-19 and col 13, lines 34-41).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Wu to the method of Matheja, in view of Mitchell, because this would improve the accuracy of the loudness estimate and therefore a responsiveness of the automatic gain control, resulting in an improved user experience (Wu col 2, lines 54-57).

Regarding Claim 18: Matheja, in view of Mitchell and further in view of Wu, further discloses the system of claim 16, wherein to update the gain value, the processing device is further to: ensure that the updated gain value does not exceed a gain value threshold (Wu col 20, lines 48-64, col 16, lines 39-44 and col 16, lines 52-55).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Wu to the method of Matheja, in view of Mitchell, because this would allow the system to limit the gain and prevent periods of excessive loudness and signal distortion (Wu col 16, lines 39-44).

Claims 9-10 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Matheja et al., (US 2016/0261951 A1) (hereinafter Matheja), in view of Mitchell et al., (US 10,878,840 B1) (hereinafter Mitchell) and further in view of Alvarez et al., (US 2016/0099007 A1) (hereinafter Alvarez).

Regarding Claim 9: Matheja, in view of Mitchell, discloses the method of claim 1. However, Matheja, in view of Mitchell, failed to explicitly disclose the claimed:
receiving speech audio segments and noise segment;
determining a noise energy level of each noise segment and a speech energy level of each speech audio segment;
generating noisy speech audio segments by combining each noise segment and each speech audio segment;
training, using machine learning, the model using the noise energy level of each noise segment, a speech audio energy level of each speech audio segment, and the noisy speech audio segments.

However, in an analogous art, Alvarez discloses: 
receiving speech audio segments and noise segment (Alvarez teaches receiving a stream of audio data that is segmented into a plurality of segments that do and do not include speech (Alvarez ¶0005 and 0006));
determining a noise energy level of each noise segment and a speech energy level of each speech audio segment (Alvarez discloses determining, via a plurality of audio segments that includes speech or noise only audio segments, the intensity levels (energy levels transfer rate) of each audio segment by observing the peak signal levels (Alvarez ¶0011 and 0012));
generating noisy speech audio segments by combining each noise segment and each speech audio segment (Alvarez teaches generating noisy audio data that is comprised of speech utterances, background speech (noisy speech) and other forms of noise (e.g. music and car noises) (Alvarez ¶0066 and 0067));
training, using machine learning, the model using the noise energy level of each noise segment, a speech audio energy level of each speech audio segment, and the noisy speech audio segments (Alvarez teaches training noisy data that is comprised of speech utterances, background speech (noisy speech) and other forms of noise (e.g. music and car noises) (Alvarez ¶0066 and 0067)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Alvarez to the method of Matheja, in view of Mitchell, because this would employ the use of Automatic Gain Control (AGC) and Deep Neural Network (DNN) to improve the system in the presence of noisy inputs when training audio segments (Alvarez ¶0004).

Regarding Claim 10: Matheja, in view of Mitchell and in further view of Alvarez, further discloses the method of claim 9: wherein combining each noise segment and each speech audio segment comprises overlapping each noise segment and each audio segment in a time domain and summing each noise segment and each audio segment (Mitchell col 17, lines 14-19 and col 2, lines 28-32 and Alvarez ¶0005-0006 and 0066-0067).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Alvarez to the method of Matheja, in view of Mitchell, because this would allow the system to combine the noisy and speech segments of Alvarez by overlapping multiple audio samples/frames byway of Mitchell. Thus, improving the training ability of the system when receiving noisy inputs (Alvarez ¶0004).  

Regarding Claim 19: Matheja, in view of Mitchell, discloses the system of claim 11. However, Matheja, in view of Mitchell, failed to explicitly disclose the claimed:
receive speech audio segments and noise segments;
determine a noise energy level of each noise segment and a speech energy level of each speech audio segment;
generate noisy speech audio segments by combining each noise segment and each speech audio segment;
train, using machine learning, the model using the noise energy level of each noise segment, a speech audio energy level of each speech audio segment, and the noisy speech audio segments.

However, in an analogous art, Alvarez discloses: 
receive speech audio segments and noise segments (Alvarez teaches receiving a stream of audio data that is segmented into a plurality of segments that do and do not include speech (Alvarez ¶0005 and 0006));
determine a noise energy level of each noise segment and a speech energy level of each speech audio segment (Alvarez discloses determining, via a plurality of audio segments that includes speech or noise only audio segments, the intensity levels (energy levels transfer rate) of each audio segment by observing the peak signal levels (Alvarez ¶0011 and 0012));
generate noisy speech audio segments by combining each noise segment and each speech audio segment (Alvarez teaches generating noisy audio data that is comprised of speech utterances, background speech (noisy speech) and other forms of noise (e.g. music and car noises) (Alvarez ¶0066 and 0067));
train, using machine learning, the model using the noise energy level of each noise segment, a speech audio energy level of each speech audio segment, and the noisy speech audio segments (Alvarez teaches training noisy data that is comprised of speech utterances, background speech (noisy speech) and other forms of noise (e.g. music and car noises) (Alvarez ¶0066 and 0067)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Alvarez to the method of Matheja, in view of Mitchell, because this would employ the use of Automatic Gain Control (AGC) and Deep Neural Network (DNN) to improve the system in the presence of noisy inputs when training audio segments (Alvarez ¶0004).

Regarding Claim 20: Matheja, in view of Mitchell and in further view of Alvarez, further discloses the system of claim 19, wherein combining each noise segment and each speech audio segment comprises overlapping each noise segment and each audio segment in a time domain and summing each noise segment and each audio segment (Mitchell col 17, lines 14-19 and col 2, lines 28-32 and Alvarez ¶0005-0006 and 0066-0067).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the disclosed teaching of Alvarez to the method of Matheja, in view of Mitchell, because this would allow the system to combine the noisy and speech segments of Alvarez by overlapping multiple audio samples/frames byway of Mitchell. Thus, improving the training ability of the system when receiving noisy inputs (Alvarez ¶0004).

Conclusion

The prior arts made of record and not relied upon is considered pertinent to applicant's disclosure. 
Please see attached form PTO-892.
	Dickins et al. (US 10,511,718 B2) discloses, in a teleconferencing setting, receiving audio signal data from a plurality of uplink data streams (channels), with corresponding frequencies, to recognize speech and noise signals (including the strongest signal) based on energy levels to provide updates to the outputs gain of each signal. 
Zhou et al. (US 10,728,656 B1) teaches receiving audio data from a plurality of input channels and signals to identify what is voice data versus background sound by conducting Automatic Gain Control (AGC) techniques and further training the data with an acoustic network model/artificial neural network.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action with respect to Claims 21-23.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DERRICK SCOTT JEFFERIES whose telephone number is (571)272-0923. The examiner can normally be reached 7:30a-4:30p.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/DERRICK SCOTT JEFFERIES/Examiner, Art Unit 2658                                                                                                                                                                                                        

/VIJAY B CHAWAN/Primary Examiner, Art Unit 2658