Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/22/2020 is being considered by the examiner.
Drawings
The drawing submitted on 09/11/2019 is being considered by the examiner.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1 and 4, are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yong et al. “A Regression Approach to Speech Enhancement Based on Deep Neural Networks” (EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 1, JANUARY 2015 ).

Regarding Claim 1, Yong et al. teach: A computer-implemented system for recognizing and processing speech, comprising: an input processor configured to receive an input waveform and extract spectral features from the input waveform to form an initial spectrum ( Page 2, Column 2,  II System Overview, Fig.1, A DNN is adopted as the mapping function from noisy to clean speech features. In the training stage, a DNN-based regression model was trained using the log-power spectral features from pairs of noisy and clean speech data.  Therefore, short-time Fourier analysis is first applied to the input signal, computing the discrete Fourier transform (DFT) of each overlapping windowed frame. Then the log-power spectra are calculated.); a deep neural network trained to detect speech in the presence of at least one of noise or distortion, the deep neural network configured to receive the extracted spectral features and output speech detection probabilities indicating the presence of speech in the extracted spectral features (Page 2, Col 1, “Recently in [32], we have proposed a regression DNN based speech enhancement framework via training a deep and wide neural network architecture using a large collection of heterogeneous training data with four noise types. In traditional speech enhancement techniques, the noise estimate is usually updated by averaging the noisy speech power spectrum using time and frequency dependent smoothing factors, which are adjusted based on the estimated speech presence probability in individual frequency bins (e.g., [8], [33]).); and an output processor configured to modify the initial spectrum based on the speech detection probabilities indicating the presence of speech in each extracted spectral feature and output an enhanced waveform ( Page 2, Column 2,  II System Overview, Fig.1, In the enhancement stage, the noisy speech features are processed by the well-trained DNN model to predict the clean speech features. After we obtain the estimated log-power spectral features of clean speech, Xl(d). the reconstructed spectrum Xf(d) is given by: Xf(d) = exp{A1(d)/2} exp{jZyf(d)}.    (1) where LYl{d) denotes dth dimension phase of the noisy speech.).

Regarding Claim 4, Yong et al. teach: The system of claim 1, wherein the deep neural network was trained using speech conversations created using speech data containing at least one of noise or distortion ( Page 2, Column 2,  II System Overview, Fig.1, “A block diagram of the proposed speech enhancement framework is illustrated in Fig. 1. A DNN is adopted as the mapping function from noisy to clean speech features. In the training stage, a DNN-based regression model was trained using the log-power spectral features from pairs of noisy and clean speech data.  Therefore, short-time Fourier analysis is first applied to the input signal, computing the discrete Fourier transform (DFT) of each overlapping windowed frame. Then the log-power spectra are calculated.)
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 2-3, 6-8, 10, and 23-24, are rejected under 35 U.S.C. 103 as being unpatentable over Yong et al.  in view of DeLiang et al. “Learning Spectral Mapping for Speech Dereverberation and Denoising (IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 6, JUNE 2015)”.

Regarding Claim 2, Yong et al. teach:  The system of claim 1, wherein the deep neural network is configured to process the extracted spectral features to identify frame-specific and frequency band-specific speech presence in the initial spectrum; and wherein the output processor is configured to modify the initial spectrum on a per-frame and per-frequency band basis (See rejection of Claim 1 and Page 2, Col 1, Fig.1,  “Recently in [32], we have proposed a regression DNN based speech enhancement framework via training a deep and wide neural network architecture using a large collection of heterogeneous training data with four noise types. In traditional speech enhancement techniques, the noise estimate is usually updated by averaging the noisy speech power spectrum using time and frequency dependent smoothing factors, which are adjusted based on the estimated speech presence probability in individual frequency bins (e.g., [8], [33]). Page 2, Column 2,  II System Overview, Fig.1).
Yong et al. do not explicitly teach, “wherein the input processor is configured to extract spectral features from the input waveform in the time and frequency domain to form the initial spectrum.
DeLiang et al. teach, “wherein the input processor is configured to extract spectral features from the input waveform in the time and frequency domain to form the initial spectrum” (Page 1, Col 1, Fig.1, Deep neural networks (DNNs) have shown strong learning capacity [8]. A stacked denoising autoencoder (SDA) [37] is a deep learning method, and it can be trained to reconstruct the raw clean data from the noisy data, where hidden layer activations are used as learned features. Although SDAs were proposed to improve generalization, the main idea behind SDAs motivated us to utilize DNNs to learn the mapping from the corrupted data to clean data. A recent study [39] used DNNs to denoise acoustic features in each time-frequency unit for speech separation. In addition, Xu et al. [42] proposed a regression based DNN method for speech enhancement. Page 2, Col 1, “A. Spectral Features: We first extract features for spectral mapping. Given a time domain input signal s(t), we use the short time Fourier transform (STFT) to extract features. We first divide the input signal into 20-ms time frames with 10-ms frame shift, and then apply fast Fourier transform (FFT) to compute log spectral magnitudes in each time frame. For a 16 kHz signal, we use 320-point FFT and therefore the number of frequency bins is 161. We denote the log magnitude in the feth frequency and the mth frame as X(m, k). Therefore, in the spectrogram domain, each frame can be represented as a vector x(m): x(m) = [X(m, 1 ),X(m, 2),... ,X(m, 161)]T    (1)
In order to incorporate temporal dynamics, we include the spectral features of neighboring frames into a feature vector. Therefore, the input feature vector for the DNN feature mapping is: x(m) = [x(m — d),..., x(m),..., x(m + d)]T (2)

where d denotes the number of neighboring frames on each side and is set to 5 in this study. So the dimensionality of the input is 161 x 11 = 1771. The desired output of the neural network is the spectrogram of clean speech in the current frame m, denoted by a 161-dimensional feature vector y(m), whose elements correspond to the log magnitude in each frequency bin at the mth frame.)
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Yang et al. to include the teaching of DeLiang et al. to reconstruct raw clean data from the noisy data by denoising acoustic features in each time-frequency unit for speech separation by a regression based DNN method for speech enhancement.

Regarding Claim 3: The system of claim 1, wherein the deep neural network is configured to predict frame- specific and frequency-band specific speech activity for each extracted spectral feature (See rejection of Claim 2).

Regarding Claim 6: The system of claim 4, wherein the deep neural network was trained to detect speech in the presence of noise; and wherein the used speech conversations were created by mixing the speech data with a noise signal created with at least one of a background noise data, a music data, or a non- stationary noise data (See rejection of Claim 2 and Also see DeLiang et al. teaching, Absract, Page 2 “B. DNN Based Spectral Mapping and Page 4, B. Dreverberation).

Regarding Claim 7: The system of claim 6, wherein the deep neural network was trained to detect speech in the presence of both noise and distortion, wherein the distortion includes reverberation, and wherein the deep neural network was trained using speech conversations created using speech data modified by room impulses responses (See rejection of Claim 6).

Regarding Claims 8 and 24: The system of claim 7, wherein each speech conversation was created using a noise signal mixed with a reverberant speech signal to match a target signal-to-noise ratio, and wherein the reverberant speech signal was created by applying a room impulse response to the speech data to match a target reverberation time (See rejection of Claim 6).

Regarding Claim 10: The system of claim 1, wherein the input is configured to extract short-time spectral features from a time-varying initial spectrum (See rejection of Claim 2).

Regarding Claim 23: A method for training a neural network for detecting the presence of speech comprising: constructing a multi-layer deep neural network configured to process extracted spectral features from an initial spectrum on a per-frame and per-frequency band basis to identify frame- specific and frequency band-specific spectral features that correspond to speech; and training the deep neural network using speech conversations created using speech data containing at least one of noise or distortion, wherein the speech conversations containing noise were created by mixing the speech data with a noise signal created with at least one of a background noise data, a music data, or a non-stationary noise data, wherein the speech conversations containing distortion were created by modifying the speech data using room impulses responses, and wherein the speech data was created using a combination of clean speech data and silence data with a gain to simulate a recording distance (See rejection of Claim 2 and Also see DeLiang et al. teaching, Absract, Page 2 “B. DNN Based Spectral Mapping and Page 4, B. Dreverberation ).

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Yong et al.  in view of Arun et al. “On Training Targets for Supervised Speech Separation (IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014)”.

Regarding Claim 5, Yong et al. do not teach: The system of claim 4, wherein the speech data was created using a combination of clean speech data and silence data.
Arun et al. teach: the speech data was created using a combination of clean speech data and silence data (Recently, we have formulated monaural speech separation as a supervised learning problem, which is a data driven approach. In the simplest form, acoustic features are extracted from noisy mixtures to train a supervised learning algorithm, e.g. a deep neural network (DNN) [36]. In many previous studies (e.g. [10], [ 16], [ 17]), the training target (or the learning signal) is set to the ideal binary mask (IBM), which is a binary mask constructed from premixed speech and noise signals (see Section III-A for definition). ).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Yong et al. to include the teaching of Arun et al. above to reconstruct raw clean data from the noisy data by denoising acoustic features in each time-frequency unit for speech separation by a regression based DNN method for speech enhancement.

Claims 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Yong et al.  in view of Israel  “Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled Recursive Averaging (IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 5, SEPTEMBER 2003)”.

Regarding Claim 14, Young et al. do not teach: The system of claim 1, further comprising: a noise estimator configured to receive the speech detection probabilities from the deep neural network and the input waveform and output a noise variance estimate on a per-frame and per-band basis, and wherein the output processor is configured to modify the input waveform based on the noise variance.
Israel teaches: a noise estimator configured to receive the speech detection probabilities from the deep neural network and the input waveform and output a noise variance estimate on a per-frame and per-band basis, and wherein the output processor is configured to modify the input waveform based on the noise variance ( Page 1, Col 2, Recently, we introduced a noise estimation approach, namely minima controlled recursive averaging (MCRA) [3], [4], that combines the robustness of the minimum tracking with the simplicity of the recursive averaging. The noise estimate is obtained by averaging past spectral power values, using a smoothing parameter that is adjusted by the speech presence probability in subbands. The speech presence probability is controlled by the minima values of a smoothed periodogram. The recursive averaging is carried out without a hard distinction between speech absence and presence, thus continuously updating the noise estimate even during weak speech activity. Additionally, the smoothing of the noisy periodogram is carried out in both time and frequency, which takes into account the strong correlation of speech presence in neighboring frequency bins of consecutive frames. Page 2, Col 1, In Section III, we introduce an estimator for the a priori speech absence probability. The estimator is controlled by the minima values of a smoothed periodogram of the noisy signal. In Section IV, we combine the time-varying recursive averaging with the minima-controlled estimation of the a priori speech absence probability, and present the IMCRA algorithm.
Page 5, Col 1, IV. Implementation of the Algorithm: In this section, we combine the time-varying recursive averaging with the minima-controlled estimation of the a priori speech absence probability, and present the IMCRA noise estimation algorithm. Page 8, Col 2, VI Conclusion: Recursive averaging is a commonly used procedure for estimating the noise power spectrum during sections which do not contain speech. However, rather than employing a voice activity detector and restricting the update of the noise estimator to periods of speech absence, we adapt the smoothing parameter in time and frequency according to the speech presence probability. The noise estimate is thereby continuously updated even during weak speech activity. We have proposed an estimator for the a priori speech absence probability that is controlled by the minima values of a smoothed periodogram of the noisy measurement. It combines conditions on both the instantaneous and local measured power, and provides a soft transition between speech absence and presence. This prevents an occasional increase in the noise estimate during speech activity. Furthermore, carrying out the smoothing and minimum tracking in two iterations allows larger smoothing windows and smaller minimum search windows, while reliably tracking the minima even during strong speech activity. This yields a reduced variance of the minima values and shorter delay when responding to a rising noise power, which eventually improves the tracking capability of the noise estimator. We have shown that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is extremely effective. In particular, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.)
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Yong et al. to include the teaching of Israel above to obtained noise estimator by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability.

Regarding Claim 15: The system of claim 14, wherein the noise estimator is configured to perform noise estimation recursively in time (See rejection of claim 14).
Allowable Subject Matter
Claims 9, 11-13, 16-22 and 25 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Visser et al. (US 2016/0284346 A1) teach: Disclosed is a feature extraction and classification methodology wherein audio data is gathered in a target environment under varying conditions. From this collected data, corresponding features are extracted, labeled with appropriate filters (e.g., audio event descriptions), and used for training deep neural networks (DNNs) to extract underlying target audio events from unlabeled training data. Once trained, these DNNs are used to predict underlying events in noisy audio to extract therefrom features that enable the separation of the underlying audio events from the noisy components thereof.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878.  The examiner can normally be reached on Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656