DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-5, 9-11, 13, and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Wung et al., US 2019/0172476 A1 (hereafter Wung), in view of Matsuo, US 2015/0066487 A1.
Regarding claim 1, Wung discloses a deep learning driven multi-channel filtering method for speech enhancement (see Wung, abstract and ¶ 0003).  Herein, Wung teaches a “speech enhancement method comprising steps of: receiving an audio input” (see Wung, ¶ 0024 and 0060, figure 1, and figures 4 and 8, units 2, 6, and 7, where the audio input is received via a microphone array), “converting the audio input into a plurality of successive digital audio signals, at least comprising a first digital audio signal and a second digital audio signal immediately after the first digital audio signal, each of the digital audio signals corresponding to an audio frame of the audio input” (see Wung, ¶ 0027 and 0029 and figures 4 and 8, unit “STFT block”, where the successive digital audio signals are converted via the Short Time Fourier Transform); and “sequentially processing the digital audio signals to generate a plurality of corresponding estimated audio signals, comprising steps of: processing the first digital audio signal to reduce stationary noise in the first digital audio signal according to a stationary noise suppression model 
Next, Wung teaches that the sequentially processing comprises “a first neural network generating a first voice activity detection signal according to the first digital audio signal” (see Wung, ¶ 0030-0034 and figures 1, 4, and 8, unit 3, where a DNN generates the first VAD signal); “a second neural network generating a first noise suppression signal by reducing non-stationary noise in the first artifact signal according to the first digital audio signal, the first artifact signal and the first voice activity detection signal” (see Wung, ¶ 0060-0061 and figure 8, unit 13, where a second DNN further reduces the noise in the first artifact signal with an ideal ratio mask (IRM)); and “optimizing the stationary noise suppression model according to the first voice activity detection signal, and processing the second digital audio signal according to the optimized stationary noise suppression model to reduce the stationary noise in the second digital audio signal to generate a second artifact signal” (see Wung, ¶ 0035-0039, figures 2 and 3, and figure 8, unit 2, because the multi-channel filter updates its noise reduction model, such as the parametric multi-channel Wiener filter (PMWF) that produces an estimated clean speech signal based on the ratio of estimated covariance matrices (e.g., Rvv and Ryy).
Last Wung teaches the step of “outputting the estimated audio signals” where the ISTFT block outputs the enhanced speech (see Wung, ¶ 0047 and figures 4 and 8).
However, Wung does not appear to teach the feature of “generating a first one of the estimated audio signals by combining the first noise suppression signal and the phase signal of the first digital audio signal”.  Additionally, Wung taught the STFT for converting the input audio into frames of digital audio, but did not appear to teach that the digital audio signals “comprising a magnitude signal and a phase signal”, and does not mention frames “partially overlapping each other”.
Matsuo discloses a voice processing method to reduce noise components in a voice signal (see Matsuo, abstract and ¶ 0003).  Specifically, Matsuo teaches the step of “converting the audio input into 
“A speech enhancement method comprising steps of: 
receiving an audio input;” (see Wung, ¶ 0024 and 0060, figure 1, and figures 4 and 8, units 2, 6, and 7);

“converting the audio input into a plurality of successive digital audio signals, at least comprising a first digital audio signal and a second digital audio signal immediately after the first digital audio signal, each of the digital audio signals corresponding to an audio frame of the audio input and comprising a magnitude signal and a phase signal, a first audio frame corresponding to the first digital audio signal and a second audio frame corresponding to the second digital audio signal partially overlapping each other;” (see Wung, ¶ 0027-0029, 0042, 0044, and 0060, and figures 4 and 8, units 6-7 and “STFT block”, in view of Matsuo, ¶ 0021 and 0026-0027, and figure 2, unit 10)
 
“sequentially processing the digital audio signals to generate a plurality of corresponding estimated audio signals, comprising steps of: processing the first digital audio signal to reduce stationary noise in the first digital audio signal according to a stationary noise suppression model to generate a first artifact signal;” (see Wung, ¶ 0035 and 0060, and figures 1, 2, 4, and 8, unit 2);

“a first neural network generating a first voice activity detection signal according to the first digital audio signal;” (see Wung, ¶ 0030-0034 and 0060, and figures 1, 4, and 8, unit 3);

“a second neural network generating a first noise suppression signal by reducing non-stationary noise in the first artifact signal according to the first digital audio signal, the first artifact signal and the first voice activity detection signal;” (see Wung, ¶ 0047-0048 and 0060-0061, and figure 8, unit 13);



“optimizing the stationary noise suppression model according to the first voice activity detection signal, and processing the second digital audio signal according to the optimized stationary noise suppression model to reduce the stationary noise in the second digital audio signal to generate a second artifact signal; and” (see Wung, ¶ 0035-0039, figures 2 and 3, and figure 8, unit 2); and 

“outputting the estimated audio signals” (see Wung, ¶ 0047 and figures 4 and 8, unit “ISTFT block”).


Regarding claim 2, see the preceding rejection with respect to claim 1 above.  The combination makes obvious the “speech enhancement method according to claim 1, wherein the step of processing the first digital audio signal to reduce the stationary noise in the first digital audio signal according to the stationary noise suppression model to generate the first artifact signal comprises steps of: receiving a spectral noise suppression gain as the stationary noise suppression model; and adjusting the first digital audio signal according to the spectral noise suppression gain to generate the first artifact signal” because Wung teaches a parametric multi-channel Wiener filter (PMWF) that produces an estimated clean speech signal based on the ratio of estimated covariance matrices (e.g., Rvv and Ryy) (see Wung, ¶ 0035 and figure 3).
Regarding claim 4, see the preceding rejection with respect to claim 1 above.  The combination makes obvious the “speech enhancement method according to claim 1, wherein the first voice activity detection signal generated by the first neural network has a value restricted from 0 to 1” (see Wung, ¶ 0033).
Regarding claim 5, see the preceding rejection with respect to claim 1 above.  The combination makes obvious the “speech enhancement method according to claim 1, wherein the first voice activity detection signal generated by the first neural network has a value restricted by a hyperbolic tangent function from −1 to 1 or a linear function with minimum to maximum normalization” because Wung 
Regarding claim 9, see the preceding rejection with respect to claim 1.  The combination of Wung and Matsuo makes obvious the method of claim 1 and likewise makes obvious a system with these features, where it would have been obvious to one of ordinary skill in the art at the time of the effective filing date to modify Wung with the teachings of Matsuo for the purpose of correcting signal discontinuities at the output of a frame-by-frame signal processing method (see Matsuo, ¶ 0004 and 0007).
Specifically, the combination makes obvious:
“A speech enhancement system receiving an audio input through a sound collecting device, the speech enhancement system comprising: 
a pre-processor configured to receive the audio input and convert the audio input into a plurality of successive digital audio signals, the successive digital audio signals at least comprising a first digital audio signal and a second digital audio signal immediately after the first digital audio signal, each of the digital audio signals corresponding to an audio frame of the audio input and comprising a magnitude signal and a phase signal, a first audio frame corresponding to the first digital audio signal and a second audio frame corresponding to the second digital audio signal partially overlapping each other;” (see Wung, ¶ 0024, 0027-0029, 0042, 0044, and 0060, and figures 4 and 8, units 6 and 7 and “STFT block”, in view of Matsuo, ¶ 0021 and 0026-0027, and figure 2, unit 10);
 
“a first-stage noise suppression device electrically coupled to the pre-processor, configured to process the first digital audio signal to reduce stationary noise in the first digital audio signal according to a stationary noise suppression condition to generate a first artifact signal;” (see Wung, ¶ 0035 and 0060, and figures 1, 2, 4, and 8, unit 2);

“a second-stage noise suppression device electrically coupled to the first-stage noise suppression device, configured to generate a first voice activity detection signal according to the first digital audio signal, and generate a first noise suppression signal by reducing non-stationary noise in the first artifact signal according to the first digital audio signal, the first artifact signal and the first voice activity detection signal; and” (see Wung, ¶ 0030-0039 and 0060-0061, figures 1 and 4, units 2 and 3, figures 2-3, and figure 8, units 2, 3, and 13);

“a reconstruction device electrically coupled to the second-stage noise suppression device and the pre-processor, configured to generate an estimated audio signal by combining the first noise suppression signal and the phase signal of the first digital audio signal, wherein the first-stage noise suppression device further optimizes the stationary noise suppression model according to the first voice activity detection signal, and processes the second digital audio signal according to the optimized stationary noise suppression model to reduce the stationary noise in the second digital audio signal to 


Regarding claim 10, see the preceding rejection with respect to claim 9 above.  The combination makes obvious the “speech enhancement system according to claim 9, wherein the second-stage noise suppression device is a many-to-many recurrent neural network” (see Wung, ¶ 0030).
Regarding claim 11, see the preceding rejection with respect to claim 9 above.  The combination makes obvious the “speech enhancement system according to claim 9, wherein the first-stage noise suppression device receives a spectral noise suppression gain as the stationary noise suppression model, and adjusts the first digital audio signal according to the spectral noise suppression gain to generate the first artifact signal” because Wung teaches a parametric multi-channel Wiener filter (PMWF) that produces an estimated clean speech signal based on the ratio of estimated covariance matrices (e.g., Rvv and Ryy) (see Wung, ¶ 0035 and figure 3).
Regarding claim 13, see the preceding rejection with respect to claim 9 above.  The combination makes obvious the “speech enhancement system according to claim 9, wherein the second-stage noise suppression device comprises: 
a first recurrent neural network configured to generate the first voice activity detection signal according to the first digital audio signal; and” (see Wung, ¶ 0030-0034 and figures 1, 4, and 8, unit 3); and

“a second recurrent neural network configure to generate the first noise suppression signal by reducing non-stationary noise in the first artifact signal according to the first digital audio signal, the first artifact signal and the first voice activity detection signal.” (see Wung, ¶ 0060-0061 and figure 8, unit 13).

Regarding claim 16, see the preceding rejection with respect to claim 13 above.  The combination makes obvious the “speech enhancement system according to claim 13, wherein the first recurrent neural network further comprises an activation function circuit restricting a value of the first voice activity detection signal from 0 to 1” (see Wung, ¶ 0033).
claim 17, see the preceding rejection with respect to claim 13 above.  The combination makes obvious the “speech enhancement system according to claim 13, wherein the first recurrent neural network further comprises an activation function circuit using a hyperbolic tangent function to restrict a value of the first voice activity detection signal from −1 to 1 or a linear function with minimum to maximum normalization” because Wung teaches the VAD signal as the DNN SPP value that varies from 0 to 1, and therefore makes obvious the linear function features (see Wung, ¶ 0033).

Claims 6-8 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wung and Matsuo as applied to claims 1 and 13 above, and further in view of Vickers, US 2016/0093313 A1.
Regarding claim 6, see the preceding rejection with respect to claim 1 above.  The combination of Wung and Matsuo makes obvious the speech enhancement method according to claim 1, but the combination does not appear to makes obvious the feature of first estimated values.
Vickers discloses a neural network voice activity detection (VAD) method using running range normalization (see Vickers, abstract).  Herein, Vickers discloses prior art, or traditional, feature normalization, such as mean-variance normalization, produces misleading VAD results (see Vickers, ¶ 0006-0007).  In order to improve the VAD results, Vickers teaches a running range normalization of the VAD features, where the normalized VAD features are input to a neural network to generate a VAD estimate (see Vickers, ¶ 0039-0041 and figure 1, steps 102-110).  It would have been obvious to one of ordinary skill in the art at the time of the effective filing date to modify the combination of Wung and Matsuo with the teachings of Vickers to improve the VAD results (see Vickers, ¶ 0007 and 0037).  Therefore, the combination of Wung, Matsuo, and Vickers makes obvious:
“The speech enhancement method according to claim 1, wherein the step of generating the first voice activity detection signal by the first neural network according to the first digital audio signal further comprises steps of: 


“the first neural network processing the input sections corresponding to the different time points to generate a plurality of estimated values, the estimated values comprising a plurality of first estimated values corresponding to the first digital audio signals of the input sections; and” (see Wung, ¶ 0033 and 0042 and figure 4, units 3 and 4, in view of Vickers, ¶ 0040-0041 and figure 1, step 110);

“generating the first voice activity detection signal according to the first estimated values.” (see Vickers, ¶ 0041 and figure 1, step 112).


Regarding claim 7, see the preceding rejection with respect to claim 6 above.  The combination makes obvious the “speech enhancement method according to claim 6, wherein the step of generating the first voice activity detection signal according to the first estimated values further comprises steps of: 
receiving the first estimated values; and calculating an average value of the first estimated values to obtain the first voice activity detection signal” because Vickers teaches a process of smoothing the VAD estimate output of the neural network to obtain the post-processed VAD estimate, and the smoothing operation makes obvious averaging (see Vickers, ¶ 0041).
Regarding claim 8, see the preceding rejection with respect to claim 6 above.  The combination makes obvious the “speech enhancement method according to claim 6, wherein the step of generating the first voice activity detection signal according to the first estimated values further comprises steps of: receiving the first estimated values; and comparing the first estimated values with a second threshold value to determine the first voice activity detection signal based on majority rule” because Vickers makes obvious post-processing the VAD estimate output of the neural network with a “Neural Net VAD threshold” (see Vickers, ¶ 0041 and figure 1, step 112). 
Regarding claim 14, see the preceding rejection with respect to claim 13 above.  The combination of Wung and Matsuo makes obvious the system of claim 13, but the combination does not appear to makes obvious the feature of first estimated values.  
.

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wung, Matsuo, and Vickers as applied to claim 14 above, and further in view of well-known prior art.
Regarding claim 15, see the preceding rejection with respect to claim 14 above.  The combination of Wung, Matsuo, and Vickers makes obvious the system of claim 14, but does not appear to explicitly teach the features with respect to shift registers.
The examiner takes Official Notice that it is well-known in the prior art to provide cascade-connected shift registers that provide the input sections of a neural network, where the neural network processes the input sections corresponding to the different time points.  It would have been obvious to one of ordinary skill in the art at the time of the effective filing date to modify the combination of Wung, Matsuo, and Vickers with the well-known prior art to provide the input to a neural network that .

Allowable Subject Matter
Claims 3 and 12 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kaskari et al., US 2018/0182411 A1;
Tashev et al., US 2019/0318755 A1; 
Sivaraman et al., US 2019/0385630 A1; and 
Lee et al., US 2020/0211580 A1.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Daniel R Sellers whose telephone number is (571)272-7528. The examiner can normally be reached Mon - Fri 10:00-4:00.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Fan S Tsang can be reached on (571)272-7547. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Daniel R Sellers/               Examiner, Art Unit 2653