DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 10 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claim 10 recites the limitation “VoIP application”. VoIP is not defined in the claim and therefore it is not clear. Application ¶0034 discloses “Voice-over-IP (VoIP) applications”.

Claim 12 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claim 12 recites the limitation “generating a corresponding the multichannel audio input signal” is not clear.

Claims 1 and 9 recites the limitation “the multi-stream target-speech detection generator” in wherein the multi-stream the multi-stream target-speech detection generator.  There is insufficient antecedent basis for this limitation in the claim. For the purposes of this written opinion, as best understood, "the multi-stream target-speech detection generator" has been interpreted to read "the multi-stream target-speech detector generator".

Claim 1 recites the limitation “the enhanced target streams” in to determine a plurality of weights associated with the enhanced target streams. There is insufficient antecedent basis for this limitation in the claim. For the purposes of this written opinion, as best understood, "the enhanced target streams" has been interpreted to read "each of the enhanced target streams".

Claim 1 recites the limitation “the enhanced target streams” in to apply the plurality of weights to the enhanced target streams to generate a combined enhanced output signal. There is insufficient antecedent basis for this limitation in the claim. For the purposes of this written opinion, as best understood, "the enhanced target streams" has been interpreted to read "each of the enhanced target streams".

Claim 5 recites the limitation “the target-speech detector engines” in wherein the target-speech detector engines comprise. There is insufficient antecedent basis for this limitation in the claim. For the purposes of this written opinion, as best understood, "the target-speech detector engines" has been interpreted to read "each of the target-speech detector engines".

Claim 6 recites the limitation “each target speech detector engine” in wherein each target speech detector engine is configured to. There is insufficient antecedent basis for this limitation in the claim. For the purposes of this written opinion, as best understood, "each target-speech detector engines" has been interpreted to read "each of the target-speech detector engines".

Claim 10 recites the limitation “VoIP” in VoIP if the target-speech is detected. There is insufficient antecedent basis for this limitation in the claim. For the purposes of this written opinion, as best understood, "VoIP if the target-speech is detected" has been interpreted to read "the VoIP application if the target-speech is detected".

Claim 11 recites the limitation “the stream” in detecting a target-speech in the stream using a multi-stream target-speech detector generator”. There is insufficient antecedent basis for this limitation in the claim. For the purposes of this written opinion, as best understood, "detecting a target-speech in the stream" has been interpreted to read "detecting a target-speech in each of the enhanced target stream".

Claim 11 recites the limitation “the enhanced target streams” in applying the calculated weights to the enhanced target streams”. There is insufficient antecedent basis for this limitation in the claim. For the purposes of this written opinion, as best understood, "the enhanced target streams" has been interpreted to read "each of the enhanced target streams".





Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim limitation(s) recite(s) sufficient structure, materials, or acts to entirely perform the recited function.  Such claim limitation(s) is/are:
a plurality of speech enhancement modules (See Fig. 2:202; ¶0026)
each speech enhancement module (See Fig. 2: 202; ¶0026)
wherein the plurality of speech enhancement modules (See Fig. 2: 202; ¶0026)
Claim 3
Claim 3
Claim 4

Because this/these claim limitation(s) is/are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are not being interpreted to cover only the corresponding structure, material, or acts described in the specification as performing the claimed function, and equivalents thereof.
If applicant intends to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to remove the structure, materials, or acts that performs the claimed function; or (2) present a sufficient showing that the claim limitation(s) does/do not recite sufficient structure, materials, or acts to perform the claimed function.
Claim Objections
Claim 6 objected to because of the following informalities:  Claim 6 is missing a period.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wung et al. (US PGPUB #2018/0350379) in view of Sundaram et al. (US Patent #9734822).

Regarding Claim 1, Wung discloses a system (Figs. 1, 10) comprising:
a target speech enhancement engine (Wung Figs. 1, 10, ¶0003 discloses a digital speech enhancement system that performs specific chain of digital signal processing operation upon a multi-channel sound pick-up, to result in an enhanced speech signal) configured to analyze a multichannel audio input signal (Wung Fig. 1: microphone array 2) and generate a plurality of enhanced target streams (Wung ¶0056 discloses the digital ;
a multi-stream target-speech detector generator comprising a plurality of target-speech detector engines (Wung ¶0035 discloses the source signals produced by the BSS 15 and the single pickup beam produced at the output of the residual echo suppressor 10 are provided to a speech stream selector 11, where the latter can analyze these input signals [e.g., based on their individual signal to noise ratio, SNR] and select one of them as containing an ASR voice trigger phrase, or as being the one most suitable for input to the ASR 12) each configured to determine a confidence of quality and/or presence of a specific target-speech in the stream (Wung ¶0035 discloses the selector 11 can assign a score to each of the audio streams at its input, by for example a deep neural network that has been previously trained to detect a trigger phrase, e.g., "Hey Hal." Each score quantifies the likelihood of the presence of the trigger phrase in its respective stream, the stream with the highest score is selected and passed to the ASR engine),
wherein the multi-stream target-speech detection generator is configured to determine a plurality of weights associated with the enhanced target streams (Wung ¶0036-¶0045 discloses the output signal vector [from de-reverb processor 5] can be computed as a product of a conversion factor r[n] and a priori error zeta[n] that is computed based on a difference between a new instance of the input signal vector y[n] and the concatenation xL[n] weighted by an old instance of a multi-channel linear prediction, MCLP, filter coefficient matrix G[n]); and
a fusion subsystem configured to apply the plurality of weights to the enhanced target streams to generate a combined enhanced output signal (Wung ¶0035 discloses the speech .
Wung may not explicitly disclose wherein the multi-stream target-speech detection generator is configured to determine a plurality of weights associated with the enhanced target streams; and a fusion subsystem configured to apply the plurality of weights to the enhanced target streams to generate a combined enhanced output signal.
However, Sundaram (Figs. 1A-7) teaches wherein the multi-stream target-speech detection generator is configured to determine a plurality of weights associated with the enhanced target streams (Sundaram col. 7 lines 53-65 discloses a beamformer 114 includes filter blocks 240, 242, and 244 and summation module 250. Generally, the filter blocks 240, 242, and 244 receive input signals from the sensor array 220, apply filters [such as weights, delays, or both] to the received input signals, and generate weighted, delayed input signals as output. For example, the first filter block 240 can apply a first filter weight and delay to the first received discrete-time digital input signal x1(k), the second filter block 242 can apply a second filter weight and delay to the second received discrete-time digital input signal x2(k), and the Nth filter block 244 can apply an Nth filter weight and delay to the Nth received discrete-time digital input signal xN(k), Fig. 2: weight and delay blocks 240, 242, 244); and
a fusion subsystem configured to apply the plurality of weights to the enhanced target streams to generate a combined enhanced output signal (Sundaram col. 8 lines 3-15 1(k), y2(k), and yN(k), Figs. 1A-2 and 5-7).
Wung and Sundaram are analogous art as they pertain to multichannel speech signal enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the speech signal enhancement system (as taught by Wung) to generated weighted input signals for further processing (as taught by Sundaram, col. 7 lines 53-65) to overcome the challenges occur in the process of beam selection (Sundaram, col. 2 lines 44-63).

Regarding Claim 2, Wung in view of Sundaram discloses the system of claim 1, further comprising
an audio sensor array (Wung Fig. 1: microphone array 2) configured to sense human speech (Wung Fig. 1: hello hal, can you hear me) and environmental noise (Wung Fig. 1: noise sources such as TV, Vacuum cleaner, appliances, multiple talkers in the surrounding; ¶0029) and generate a corresponding the multichannel audio input signal (Wung ¶0030 discloses the signal processing chain begins with an acoustic echo canceler 4 that receives a number M>=2 microphone signals, from M microphones 2).

Regarding Claim 3, Wung in view of Sundaram discloses the system of claim 1, wherein the target speech enhancement engine comprises
a plurality of speech enhancement modules (Wung Fig. 1: echo cancellation 4, de-reverberation 5, noise reduction 7, beam-forming 8, residual echo suppression 10, ,
each speech enhancement module configured to analyze the multichannel audio input signal and output one of the enhanced target streams (Wung ¶0035 discloses the source signals produced by the BSS 15 and the single pickup beam produced at the output of the residual echo suppressor 10 are provided to a speech stream selector 11, where the latter can analyze these input signals [e.g., based on their individual signal to noise ratio, SNR] and select one of them as containing an ASR voice trigger phrase, or as being the one most suitable for input to the ASR 12).

Regarding Claim 4, Wung in view of Sundaram discloses the system of claim 3, wherein the plurality of speech enhancement modules comprise
an adaptive spatial filtering algorithm, a beamforming algorithm (Wung ¶0031 discloses de-reverb processor 5 facilitates beamforming. ¶0032 discloses the resulting M noise reduced signals are then provided to a beamforming processor 8 that produces a single pickup beam signal from the M noise reduced signals; Fig. 1),
a blind source separation algorithm (Wung ¶0031 discloses de-reverb processor 5 facilitates blind source separation. ¶0034 discloses the system can also include a blind source separation processor [BSS 15] that produces a number of source signals [M or fewer] from the M de-reverberated signals, separating the mixed signals in the multi-channel pickup into distinct source signals; Fig. 1),
a single channel enhancement algorithm (Wung ¶0033 discloses a previously trained deep neural network can be used to further enhance the audio stream at the output of the beamforming processor by suppressing the residual echo. ¶0035 discloses , and/or
a neural network (Wung ¶0033 discloses a previously trained deep neural network can be used to further enhance the audio stream at the output of the beamforming processor by suppressing the residual echo).

Regarding Claim 5, Wung in view of Sundaram discloses the system of claim 1,
wherein the target-speech detector engines comprise Gaussian Mixture Models, Hidden Markov Models, and/or a neural network (Wung ¶0033 discloses a previously trained deep neural network can be used to further enhance the audio stream at the output of the beamforming processor by suppressing the residual echo).

Regarding Claim 6, Wung in view of Sundaram discloses the system of claim 1,
wherein each target speech detector engine is configured to produce a posterior weight correlated to a confidence that an input audio stream includes the specific target speech (Wung ¶0036-¶0045 discloses the output signal vector [from de-reverb processor 5] can be computed as a product of a conversion factor r[n] and a priori error zeta[n] that is computed based on a difference between a new instance of the input signal vector y[n] and the concatenation xL[n] weighted by an old instance of a multi-channel linear prediction, MCLP, filter coefficient matrix G[n]. ¶0035 discloses the selector 11 can assign a score to each of the audio streams at its input, by for example a deep neural network that has been previously trained to detect a trigger phrase, e.g., "Hey Hal").

Regarding Claim 7, Wung in view of Sundaram discloses the system of claim 6,
wherein each target-speech detector engine is configured to produce a higher posterior with clean speech (Wung ¶0035 discloses each score quantifies the likelihood of the presence of the trigger phrase in its respective stream, the stream with the highest score is selected and passed to the ASR engine).

Regarding Claim 8, Wung in view of Sundaram discloses the system of claim 1, but may not explicitly disclose wherein the enhanced output signal is a weighted sum of the enhanced target streams.
However, Sundaram (Figs. 1A-7) teaches wherein the enhanced output signal is a weighted sum of the enhanced target streams (Sundaram col. 7 lines 53-65 discloses a beamformer 114 includes filter blocks 240, 242, and 244 and summation module 250. 1(k), the second filter block 242 can apply a second filter weight and delay to the second received discrete-time digital input signal x2(k), and the Nth filter block 244 can apply an Nth filter weight and delay to the Nth received discrete-time digital input signal xN(k). col. 8 lines 3-15 discloses summation module 250 can determine a beamformed signal y(k) based at least in part on the weighted, delayed input signals y1(k), y2(k), and yN(k), Figs. 1A-2 and 5-7).
Wung and Sundaram are analogous art as they pertain to multichannel speech signal enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the speech signal enhancement system (as taught by Wung) to generated weighted input signals for further processing (as taught by Sundaram, col. 7 lines 53-65) to overcome the challenges occur in the process of beam selection (Sundaram, col. 2 lines 44-63).

Regarding Claim 9, Wung in view of Sundaram discloses the system of claim 1,
wherein the multi-stream target-speech detection generator is further configured to determine a combined probability of detecting a specific target speech in the streams (Wung ¶0007; Fig. 10: block 44), and
wherein the target-speech is detected if the combined probability exceeds a detection threshold (Wung ¶0035 discloses the selector 11 can assign a score to each of the audio streams at its input, by for example a deep neural network that has been .

Regarding Claim 10, Wung in view of Sundaram discloses the system of claim 9, further comprising
an automatic speech recognition engine (Wung ¶0069 discloses the speech based acoustic model for speech recognition [while also improving speech recognition performance especially in a far field condition where the talker is in a far field of the microphones], Fig. 1) or
a VoIP application (Wung ¶0035 discloses VoIP [a voice over Internet protocol telephony network], Fig. 1: block 13), and
wherein the enhanced output signal is forwarded to the automatic speech recognition engine (Wung ¶0029 discloses the components of the audio system depicted in Fig. 1 have been carefully selected and implemented as digital signal processing components  or
VoIP (Wung ¶0035 discloses the selected stream is then prepared [e.g., encoded, packetized], by the communication block 13, for uplink into a communications network, e.g., a voice over Internet protocol telephony network) if the target-speech is detected (Wung ¶0035 discloses the speech stream selector 11 makes its decision based on criteria that are more suitable for providing an uplink voice communications signal [to an uplink voice communication block 13], e.g., looking for the stream that has the greatest speech intelligibility metric, Figs. 1, 10).

Claims 11-20 are rejected for the same reasons as set forth in Claims 1-10.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957.  The examiner can normally be reached on 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2651