Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with Mr. Andrew Rejent on 08/26/2021.
The application has been amended as follows: 
1. 	(Currently Amended) A computer-implemented system for recognizing and processing speech, comprising:	
an input processor configured to receive an input waveform and extract spectral features from the input waveform to form an initial spectrum;
a deep neural network trained to detect speech in the presence of both noise and reverberation, the deep neural network configured to receive the extracted spectral features and output speech detection probabilities indicating the presence of speech in the extracted spectral features; and
an output processor configured to modify the initial spectrum based on the speech detection probabilities indicating the presence of speech in each extracted spectral feature and output an enhanced waveform,	wherein the deep neural network was trained using speech conversations created by mixing a reverberant speech signal  to match a target signal-to-noise ratio, wherein the noise signal was created with at least one of a background noise data, a music data, or a non-stationary noise data, wherein the reverberant speech signal was created by applying a room impulse response to the speech data to match a target reverberation time, and	wherein binary masks were used as output targets during training of the deep neural network and each binary mask was only active if energy due to clean speech data and reverberant speech data were both dominant with respect to the noise signal.

(Original) The system of claim 1, 
wherein the input processor is configured to extract spectral features from the input waveform in the time and frequency domain to form the initial spectrum;
wherein the deep neural network is configured to process the extracted spectral features to identify frame-specific and frequency band-specific speech presence in the initial spectrum; and
wherein the output processor is configured to modify the initial spectrum on a per-frame and per-frequency band basis.

(Original) The system of claim 1, wherein the deep neural network is configured to predict frame-specific and frequency-band specific speech activity for each extracted spectral feature.

(Canceled) 

(Currently Amended) The system of claim 1, wherein the speech data was created using a combination of clean speech data and silence data.

(Canceled) 
(Canceled) 
(Canceled)
(Canceled) 

(Original) The system of claim 1, wherein the input is configured to extract short-time spectral features from a time-varying initial spectrum.

(Original) The system of claim 1, wherein the speech detection probabilities comprises a mask of frequency band-specific and frame-specific posterior probabilities. 

(Original) The system of claim 1, comprising a filter configured to apply a passband to the input signal, the passband having a frequency range corresponding to speech.

(Original) The system of claim 12, wherein the filter is configured to apply cepstral mean subtraction on at least one cepstral coefficient of the input signal.

(Original) The system of claim 1, further comprising:
	a noise estimator configured to receive the speech detection probabilities from the deep neural network and the input waveform and output a noise variance estimate on a per-frame and per-band basis, and wherein the output processor is configured to modify the input waveform based on the noise variance. 

(Original) The system of claim 14, wherein the noise estimator is configured to perform noise estimation recursively in time. 

(Original) The system of claim 14, 	wherein the noise estimator is configured to process the initial spectrum as noise during inactive speech as determined by the output from the deep neural network and output the noise variance estimate of the inactive speech, and 	wherein the noise estimator is configured to output an attenuated version of a previous noise estimate during active speech.

(Original) The system of claim 14, wherein the noise estimator is configured to combine the noise estimates for inactive speech and active speech together in a soft-decision manner using the probabilities of speech from the output of the deep neural network.

(Original) The system of claim 14, further comprising a signal-to-noise ratio estimator configured to:
receive the initial spectrum and the noise variance estimate; and 
calculate an a posteriori signal-to-noise ratio (SNR) of the initial spectrum and the noise variance estimate.

(Original) The system of claim 18, wherein the signal-to-noise ratio estimator is further configured to:
receive the speech detection probabilities from the deep neural network; and 
estimate an a priori signal-to-noise ratio (SNR) of an underlying clean speech signal of the initial spectrum based on the speech detection probabilities.

(Original) The system of claim 19, wherein one or both of the a posteriori SNRs and the a priori SNRs are calculated on a per-frame and per-frequency band basis.

(Original) The system of claim 14, further comprising a gain estimator configured to:
receive the initial spectrum and the noise variance estimate; and
calculate a gain mask for removing the estimated noise from the initial spectrum, 
wherein the output processor is configured to modify the initial spectrum based on the gain mask.

(Original) The system of claim 14, wherein the noise estimator is configured to calculate SNRs on a per-frame and per-frequency band basis, the system further comprising a gain estimator configured to:
receive the initial spectrum, the noise variance estimate, and the SNRs; and
calculate a gain mask for each frame-specific and band-specific component in the initial spectrum based on a respective SNR such that a strength of each gain mask corresponds to the value of the corresponding SNR,
wherein the output processor is configured to modify the initial spectrum based on the gain masks.

(Currently Amended) A method for training a neural network for detecting the presence of speech comprising:
constructing a multi-layer deep neural network configured to process extracted spectral features from an initial spectrum on a per-frame and per-frequency band basis to identify frame-specific and frequency band-specific spectral features that correspond to speech;
training the deep neural network using speech conversations created using speech data containing and reverberation; and
training the deep neural network using binary masks as output targets, 
wherein the speech conversations a reverberant speech signal to match a target signal-to-noise ratio, 	wherein the noise signal was created with at least one of a background noise data, a music data, or a non-stationary noise data, 
	wherein the reverberant speech signal was created by applying a room impulse response to the speech data to match a target reverberation time 
wherein the speech data was created using a combination of clean speech data and silence data with a gain to simulate a recording distance, and
wherein the output mask was only active if energy due to clean speech data and reverberant speech data were both dominant with respect to the noise signal.

(Canceled)	
(Canceled) 
(Canceled)	
(Canceled) 
(Canceled) 
(Canceled) 
(Canceled) 	
(Canceled) 
(Canceled)
(Canceled) 

REASONS FOR ALLOWANCE
Allowable Subject Matter
Claims 1-3, 5, and 10-23 are allowed.
The following is a statement of reasons for the indication of allowable subject matter:  The prior art of records alone or in combination failed to teach, for claim 1, “wherein the deep neural network was trained using speech conversations created by mixing a reverberant speech signal with a noise signal to match a target signal-to-noise ratio, wherein the noise signal was created with at least one of a background noise data, a music data, or a non-stationary noise data, wherein the reverberant speech signal was created by applying a room impulse response to the speech data to match a target reverberation time, and wherein binary masks were used as output targets during training of the deep neural network and each binary mask was only active if energy due to clean speech data and reverberant speech data were both dominant with respect to the noise signal.”; for claim 23, “wherein the speech conversations were created by mixing a reverberant speech signal with a noise signal to match a target signal-to-noise ratio, wherein the noise signal was created with at least one of a background noise data, a music data, or a non-stationary noise data, wherein the reverberant speech signal was created by applying a room impulse response to the speech data to match a target reverberation time, wherein the speech data was created using a combination of clean speech data and silence data with a gain to simulate a recording distance, and
wherein the output mask was only active if energy due to clean speech data and reverberant speech data were both dominant with respect to the noise signal.”.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art of record Visser et al. (US 2016/0284346 A1) teach: (Abstract) Disclosed is a feature extraction and classification methodology wherein audio data is gathered in a target environment under varying conditions. From this collected data, corresponding features are extracted, labeled with appropriate filters (e.g., audio event descriptions), and used for training deep neural networks (DNNs) to extract underlying target audio events from unlabeled training data. Once trained, these DNNs are used to predict underlying events in noisy audio to extract therefrom features that enable the separation of the underlying audio events from the noisy components thereof. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878.  The examiner can normally be reached on Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656