DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1 and 10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites “a target speech” in generate a spectral-temporal mask to discriminate a target speech from noise and interference speech is unclear. Claim 1 recites “… thereby producing an equalized target speech signal”. It is not clear if “a target speech” and “an equalized target speech signal” are the same or different.

Claim 10 recites “a target speech” in estimating a spectral mask using the filtered multichannel audio input signal and the speech determination to discriminate a target speech from noise and interference speech is unclear. Claim 10 recites “… thereby producing an equalized target speech signal”. It is not clear if “a target speech” and “an equalized target speech signal” are the same or different.
10 recites “a spectral mask” in estimating a spectral mask using the filtered … is unclear. Paragraph 0014 of the specification recites “a spectral-temporal mask”. It is not clear if “a spectral mask” and “a spectral-temporal mask” are the same or different.

Claim 11 recites “a multichannel audio input signal” in wherein receiving a multichannel audio input signal comprises … There is insufficient antecedent basis for this limitation in the claim.

Claim 13 recites “identify the speech in the frame” in processing the frame of the multichannel audio input signal through a neural network trained to identify the speech in the frame. There is insufficient antecedent basis for this limitation in the claim.

Claim 16 recites “the multichannel audio signal” in a selected channel of the multichannel audio signal. There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have 

Claims 1-3, 5-12 and 14-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shin et al. (US PGPUB #2012/0130713) in view of Kaskari et al. (US PGPUB #2018/0182411) further in view of Dickins et al (US PGPUB #2014/0126745).

Regarding Claim 1, Shin discloses (title; Figs. 1-4, 7A-11B, and 19-26) a system comprising:
a first voice activity detector (Shin ¶0077 discloses by combining voice activity measures that are based on different features of the signal [e.g., proximity, direction of arrival, onset/offset, SNR], a fairly good frame-by-frame VAD can be obtained) operable to detect speech in a frame of a multichannel audio input signal (Shin ¶0069 discloses an audio channel with more than one channel. ¶0083-¶0089 and ¶0103 discloses based on information from a first plurality of frames of the audio signal, task T100 [Fig. 8B] calculates a series of values of a first voice activity measure. Based on information from a second plurality of frames of the audio signal, task T200 calculates a series of values of a second voice activity measure that is different from the first voice activity measure. Based on the series of values of the first voice activity measure, task T300 calculates a boundary value of the first voice activity measure) and output a speech determination (Shin ¶0083-¶0089 and ¶0103 discloses based on the series of values of the first voice activity measure, the series of values of the second voice activity measure, and the calculated boundary value of the first voice activity measure, task T400 produces a series of combined voice activity decisions);
an adaptive filter operable to receive the multichannel audio input signal and the speech determination (Shin ¶0054 discloses a spatially selective filtering operation includes ;
a mask estimator (Shin Fig. 1: T60 masking) operable to receive the equalized target speech signal and the speech determination (Shin Fig. 1: T20 VAD input into T60; Fig. 2: T130 gain application [including frequency smoothing]) and generate a spectral-temporal mask to discriminate a target speech from noise and interference speech (Shin ¶0077 Fig. 21B: T120B temporal smoothing of gain factor; Fig. 2: processed speech output; Fig. 19: TF mask noise reference NRTF); and
a second activity voice detector operable to detect voice in a frame of the speech discriminated signal (Shin ¶0107 discloses a time-frequency mask-based noise reference NRTF can be calculated by multiplying the inverse of the TF VAD [Fig. 1: T60; T70] with the input signal).
Shin may not explicitly disclose a constrained minimum variance adaptive filter operable to receive the multichannel audio input signal and the speech determination and minimize a signal variance at the output of the filter, thereby producing an equalized target speech signal.
However, Kaskari (abstract; Figs. 1-2) teaches a constrained minimum variance adaptive filter operable to receive the multichannel audio input signal and the speech determination and minimize a signal variance at the output of the filter (Kaskari ¶0042 discloses prediction filter estimator 140 can implement a fast-converging, adaptive online [e.g., real-time] prediction filter estimation. A voice activity detector [VAD] 145 can be used to provide control in noisy environments over the prediction filter estimator 140 based on input to .
Shin and Kaskari are analogous art as they pertain to voice activity detection. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify voice activity detection system (as taught by Shin) to use recursive least square method to estimate the optimum prediction filter adaptively in real-time (as taught by Kaskari, ¶0048) since a frequency-dependent adaptive step size is employed to speed up the convergence of the LMS filter process, such that the process arrives at its solution in fewer computational steps compared to a conventional LMS filter (Kaskari, ¶0006).
And Dickins teaches a constrained minimum variance adaptive filter (Dickins Fig. 1: VAD 125, VAD 129 receives inputs 106 and 110, adaptive filter updater 127) operable to receive the multichannel audio input signal and the speech determination (Dickins Fig. 1: mics 1-P and beam-formed signals 110 from spectral banding output 109; ¶0039 discloses to adaptively determine the filter coefficients, a noise estimator determines an estimate of the banded spectral amplitude metric of the noise. A voice-activity detector [VAD] uses the banded spectral amplitude metric of the noise, an estimate of the banded spectral amplitude metric of the mixed-down signal determined by a signal spectral estimator, and previously predicted echo spectral content to ascertain whether there is voice or not. ¶0368 discloses the processing of post-processing step 225 and of post-processor 1025 is controlled by a classification of the input signals, e.g., as being voice or not as determined by a VAD) and minimize a signal variance at the output of the filter, thereby producing an equalized target speech signal  (Dickins ¶0060 discloses carrying out .
Shin, Kaskari, and Dickins are analogous art as they pertain to voice activity detection. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the system of Shin in view of Kaskari in light of the teachings of Dickins for perceptual-domain-based dynamic equalization processing (as taught by Dickins, ¶0375) to improve the quality of sound signals from microphones (Dickins, ¶0010).

Regarding Claim 2, Shin in view of Kaskari and Dickins discloses the system of claim 1, further comprising
an audio input sensor array including a plurality of microphones, each microphone generating a channel of the multichannel audio input signal (Shin ¶0132 discloses a portable audio sensing device, for example, that can be constructed to include such an array [R100 of two of more microphones] and to be used with such a VAD strategy for audio recording and/or voice communications applications include a telephone handset [e.g., a cellular telephone handset], etc.; Fig. 1: input mic channels).

3, Shin in view of Kaskari and Dickins discloses the system of claim 2, further comprising
a sub-band analysis module operable to decompose each of the channels into a plurality of frequency sub-bands (Shin Fig. 1: T10 - FFT including band-splitting).

Regarding Claim 5, Shin in view of Kaskari and Dickins discloses the system (¶0054: adaptive filters; ¶0065: VAD is used to indicate the presence or absence of human speech in segments of an audio signal) of claim 1, but may not explicitly disclose wherein the constrained minimum variance adaptive filter is operable to minimize the output variance when the speech determination indicates the absence of speech in the frame.
However, Dickins teaches wherein the constrained minimum variance adaptive filter is operable to minimize the output variance when the speech determination indicates the absence of speech in the frame (Dickins ¶0039 discloses to adaptively determine the filter coefficients, a noise estimator determines an estimate of the banded spectral amplitude metric of the noise. A voice-activity detector [VAD] uses the banded spectral amplitude metric of the noise, an estimate of the banded spectral amplitude metric of the mixed-down signal determined by a signal spectral estimator, and previously predicted echo spectral content to ascertain whether there is voice or not).
Shin, Kaskari, and Dickins are analogous art as they pertain to voice activity detection. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the system of Shin in view of Kaskari in light of the teachings of Dickins for perceptual-domain-based dynamic equalization processing (as taught by Dickins, ¶0375) to improve the quality of sound signals from microphones (Dickins, ¶0010).
6, Shin in view of Kaskari and Dickins discloses the system of claim 1, but may not explicitly disclose wherein the constrained minimum variance adaptive filter comprises a normalized least mean square process.
However, Kaskari teaches wherein the constrained minimum variance adaptive filter comprises a normalized least mean square process (Kaskari 0013 discloses the prediction filter is further operable to use a least mean squares [LMS] process to estimate the prediction filter at each frame independently for each frequency bin. The system can also include an adaptive step-size estimator that improves a convergence rate of LMS compared to using a fixed step-size estimator. The system can also include a voice activity detector to control the update of the prediction filter).
Shin and Kaskari are analogous art as they pertain to voice activity detection. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify voice activity detection system (as taught by Shin) to use recursive least square method to estimate the optimum prediction filter adaptively in real-time (as taught by Kaskari, ¶0048) since a frequency-dependent adaptive step size is employed to speed up the convergence of the LMS filter process, such that the process arrives at its solution in fewer computational steps compared to a conventional LMS filter (Kaskari, ¶0006).

Regarding Claim 7, Shin in view of Kaskari and Dickins discloses the system of claim 1,
wherein the mask estimator is further operable to generate a reference feature signal for each sub-band and frame of a selected channel of the multichannel audio input signal (Shin Fig. 1: T60: masking; Fig. 19: TF mask noise reference NRTF).
8, Shin in view of Kaskari and Dickins discloses the system of claim 1,
wherein the second voice activity detector includes a single-channel power-based voice activity detector that is applied to each signal to produce a target speech mask (Shin ¶0112 discloses reducing noise in a multichannel audio signal that includes calculating a plurality of gain factors, each based on a power ratio between two channels of the multichannel signal in a corresponding frequency component during clean speech; and applying each of the calculated gain factors to the corresponding frequency component of at least one channel of the multichannel signal. Each of the gain factors can be based on a power ratio between two channels of the multichannel signal in a corresponding frequency component during noisy speech.  ¶0118 discloses we propose an alternative gain function that is based on the assumptions that the ratio of the clean speech power in the primary and secondary microphones in each band would be the same and that the noise is diffused. This method does not directly estimate noise power, but only deals with the power ratio between two microphones of the input signal and that of the clean speech. ¶0120 discloses the test statistic for TF proximity VAD is 20 log |Y1[k]|-20 log |Y2[k]|, or 10 log g[k], which can be measured. We assume that the noise is uncorrelated with the signals, and use the principle that the power of the sum of two uncorrelated signals is equal in general to the sum of the powers. Fig. 2: T110: TF VAD/gain difference based suppression; T120: VAD-based residual noise suppression).

Regarding Claim 9, Shin in view of Kaskari and Dickins discloses the system of claim 1,
wherein the system comprises a speaker, a tablet, a mobile phone, and/or a laptop computer (Shin ¶0132 discloses a portable audio sensing device, for example, that can be constructed to include such an array [R100 of two of more microphones] and to be used with such a VAD strategy for audio recording and/or voice communications applications include a telephone handset [e.g., a cellular telephone handset]; a wired or wireless headset [e.g., a Bluetooth headset]; a handheld audio and/or video recorder; a personal media player configured to record audio and/or video content; a personal digital assistant [PDA] or other handheld computing device; and a notebook computer, laptop computer, netbook computer, tablet computer, or other portable computing device, etc.). 
Claims 10-12 and 14-18 are rejected for the same reasons as set forth in Claims 1-3 and 5-9.

Claims 4 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shin et al. (US PGPUB #2012/0130713) in view of Kaskari et al. (US #2018/0182411) further in view of Dickins et al (US #2014/0126745) and Vickers (US #2016/0093313).

Regarding Claim 4, Shin in view of Kaskari and Dickins discloses the system of claim 1, but may not explicitly disclose wherein the first voice activity detector comprises a neural network trained to identify speech in the frame of the multichannel audio input signal.
However, Vickers (title) teaches wherein the first voice activity detector comprises a neural network trained to identify speech in the frame of the multichannel audio input signal (Vickers ¶0040 discloses the normalized VAD features can then be used [e.g., by a neural network, etc.] to determine whether or not the audio signal includes a voice signal. This process can be repeated to continuously update the voice activity detector .
Shin, Kaskari, Dickins, and Vickers are analogous art as they pertain to voice activity detection. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the system of Shin in view of Kaskari and Dickins in light of the teachings of Vickers to use neural network to continuous update voice activity detection using running estimates (as taught by Vickers, ¶0040) for allowing improvements in VAD and feature normalization (Vickers, ¶0007).

Claim 13 is rejected for the same reasons as set forth in Claim 4.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957.  The examiner can normally be reached on 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2651