Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1.	This action is responsive to Application no.17/209,621 filed 3/23/2021.  All claims have been examined and are currently pending.
Information Disclosure Statement
2.	The information disclosure statement (IDS) submitted is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Interpretation 35 U.S.C. 112(f)
3.	Claim limitations of claims 29-30 recite “means for” language and have been interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because they use a generic placeholder coupled with functional language without reciting sufficient structure to achieve the function.  Furthermore, the generic placeholder is not preceded by a structural modifier.
Since the claim limitation(s) invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, claim(s) 29-30 have been interpreted to cover the corresponding structure described in the specification that achieves the claimed function, and equivalents thereof.  
A review of the specification shows that the following appears to be the corresponding structure described in the specification for the 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph limitation: figure 1.  
If applicant wishes to provide further explanation or dispute the examiner’s interpretation of the corresponding structure, applicant must identify the corresponding structure with reference to the specification by page and line number, and to the drawing, if any, by reference characters in response to this Office action. 
If applicant does not intend to have the claim limitation(s) treated under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112 , sixth paragraph, applicant may amend the claim(s) so that it/they will clearly not invoke 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, or present a sufficient showing that the claim recites/recite sufficient structure, material, or acts for performing the claimed function to preclude application of 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
For more information, see MPEP § 2173 et seq. and Supplementary Examination Guidelines for Determining Compliance With 35 U.S.C. 112 and for Treatment of Related Issues in Patent Applications, 76 FR 7162, 7167 (Feb. 9, 2011).

Claim Rejections - 35 USC § 102
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

5.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


6.	Claims 1, 12, 14-17, 20-22, 24-26, 29-30 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kupryjanow et al (2020/0184987).

Regarding claim 1 Kupryjanow et al (2020/0184987) teaches A device to perform speech enhancement (abstract: apparatus for reducing noise; [0001]: noise reduced to improve speech), the device comprising: 
one or more processors (fig 5, 6) configured to: 
obtain input spectral data based on an input signal, the input signal representing sound that includes speech (17: audio input; speech; 26: noisy speech X; feature vector, spectral coefficients; 39); and 
process, using a multi-encoder transformer, the input spectral data and context data to generate output spectral data that represents a speech enhanced version of the input signal (18: acoustic event detector can detect specific disturbing noises; recognize the type of the disturbance and thus provide a context awareness for the system to be used by the noise reduction model; 20: each of the classifiers may be neural networks; 23: acoustic event detector can detect type of acoustic environment; 25: noise suppressor 104 reduces disturbing components in the audio input …speech enhancement; noise suppressor can use auto-encoders to perform neural network based speech enhancement; 28: audio output when a disturbance is detected by acoustic event detector is the enhanced signal in which the disturbing sound is suppressed; 34: first, second type of disturbance; 
abstract; 63: context aware noise reducer; noise suppressor may be a neural network).  

Regarding claim 12 Kupryjanow teaches The device of claim 1, further comprising a speaker recognition engine configured to generate speaker extraction data based on the input signal, and wherein the context data includes the speaker extraction data (30: speaker identification or speaker diarization).  

Regarding claim 14 Kupryjanow teaches The device of claim 1, further comprising a noise analysis engine configured to generate noise type data based on the input signal, and wherein the context data includes the noise type data (18 noises).  

Regarding claim 15 Kupryjanow teaches The device of claim 1, further comprising: 
a microphone coupled to the one or more processors and configured to generate the input signal (17: microphone); and 
a spectral analyzer configured to generate the input spectral data (39 audio input, input spectrum).  

Regarding claim 16 Kupryjanow teaches The device of claim 1, further comprising a waveform generator configured to process the output spectral data to generate an output waveform corresponding to an enhanced version of the speech (28 output, enhanced signal).  


Regarding claim 17 Kupryjanow teaches A method of speech enhancement, the method comprising: 
obtaining input spectral data based on an input signal, the input signal representing sound that includes speech; and 
processing, using a multi-encoder transformer, the input spectral data and context data to generate output spectral data that represents a speech enhanced version of the input signal.  
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning  

Claim 20 recites limitations similar to claim 15 and is rejected for similar rationale and reasoning  

Regarding claim 21 Kupryjanow teaches the method of claim 17, further comprising generating text based on the input signal, wherein the context data includes the text (64: automatic speech recognizer).  

Claim 22 recites limitations similar to claim 12 and is rejected for similar rationale and reasoning  

Claim 24 recites limitations similar to claim 14 and is rejected for similar rationale and reasoning  
Claim 25 recites limitations similar to claim 16 and is rejected for similar rationale and reasoning  
 

Regarding claim 26 Kupryjanow teaches A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to: QUALCOMM Ref. No. 209039 - 45 – 
obtain input spectral data based on an input signal, the input signal representing sound that includes speech; and 
process, using a multi-encoder transformer, the input spectral data and context data to generate output spectral data that represents a speech enhanced version of the input signal.  
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning  


Regarding claim 29 Kupryjanow teaches An apparatus comprising: 
means for obtaining input spectral data based on an input signal, the input signal representing sound that includes speech; and 
means for processing, using a multi-encoder transformer, the input spectral data and context data to generate output spectral data that represents a speech enhanced version of the input signal.  
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning  

Regarding claim 30 Kupryjanow teaches The apparatus of claim 29, wherein the means for obtaining and the means for processing are integrated into at least one of a virtual assistant, a home appliance, a smart device, an internet of things (IoT) device, a communication device, a headset, a vehicle, a computer, a display device, a television, a gaming console, a music player, aQUALCOMM Ref. No. 209039 - 46 - radio, a video player, an entertainment unit, a personal media player, a digital video player, a camera, or a navigation device (11; 30: system may be used in various applications; 64).

Claim Rejections - 35 USC § 103
7.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

8.	Claims 2-4, 8-9, 18-19, 27-28 are rejected under 35 U.S.C. 103 as being unpatentable over Kupryjanow et al (2020/0184987) in view of Celikyilmaz et al (2019/0287012).


	Regarding claim 2 Kupryjanow teaches The device of claim 1, wherein the multi-encoder transformer includes: 
a multi-encoder that includes: 
a first encoder, a second encoder, and a decoder (25 auto encoders, encoder-decoder; neural networks 20; 63)
but does not specifically teach where Celikyilmaz teaches
a first encoder that includes a first attention network (29: encoder agent, attention network); 
at least a second encoder that includes a second attention network (29); and 
a decoder that includes a decoder attention network (abstract; 
where Celikyilmaz teaches encoder-decoder neural network…may employ multiple encoder agents to encode multiple respective input sequences; outputs of the encoder agents may be fed into the decoder which may use an attention mechanism (abstract);
25: distributes the task of encoding the input across multiple collaborating encoder agents (herein also simply “agents”), each in charge of a different portion of the input;  Once the agents complete encoding, they deliver their information to a decoder with contextual agent attention. Contextual agent attention enables the decoder to integrate information from multiple agents smoothly at each decoding step;
26: a plurality of multi-layer encoder agents 104, 105, 106, each taking a portion of the input as an input sequence and generating a corresponding encoded sequence as its encoder output.).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate the attention networks for improved encoders/decoder and neural networks, and further providing improved and more efficient noise suppression and speech enhancement.

Regarding claim 3 Kupryjanow teaches The device of claim 2, wherein the one or more processors are configured to: 
provide the input spectral data to the first encoder to generate first encoded data (25: auto-encoders to perform neural network based noise suppression; The network may be fed with audio signal chunks and the encoder- decoder layers transform a signal to a higher dimension; 26: noisy speech x; 39); 
obtain the context data based on one or more data sources (18: acoustic event detector 106 can detect specific disturbing noises, referred to herein as disturbances, present in the background of the audio input; 19; 23; 63 context aware noise reducer); 
provide the context data to at least the [second] encoder to generate [second encoded] data (25-27
27: noise suppressor 104 may be a neural net trained to infer a disturbance time-frequency mask. For example, the value 1 in the disturbance time-frequency mask may indicate a component of the disturbing sound. Training a neural network to infer disturbance time-frequency masks is different than just inverting the speech TFM because the disturbances are often foreground sounds which overlap with speech but are not identical to the acoustic background. In the example of a disturbance of a baby crying, the disturbance TFM may indicate which time-frequency components belong to the baby cry sound. The noise suppressor 104 may then selectively attenuate these components.); and 
provide, to the decoder [attention network], the first [encoded] data and the second [encoded] data to generate output spectral data that corresponds to a speech enhanced version of the input spectral data (25 decoder; 28 audio output when a disturbance is detected is the enhanced signal)
where Celikyilmaz teaches attention networks.
Rejected for similar rationale and reasoning as claim 2
Kupryjanow teaches multiple encoders, using the encoders with the input and context data, and a decoder for output, and appears to suggest but does not specifically teach, where Celikyilmaz teaches context data to the second encoder;
provide the context data to at least the second encoder to generate second encoded data; and 
provide, to the decoder attention network, the first encoded data and the second encoded data to generate output
(25: distributes the task of encoding the input across multiple collaborating encoder agents (herein also simply “agents”), each in charge of a different portion of the input;  Once the agents complete encoding, they deliver their information to a decoder with contextual agent attention. Contextual agent attention enables the decoder to integrate information from multiple agents smoothly at each decoding step;
26: a plurality of multi-layer encoder agents 104, 105, 106, each taking a portion of the input as an input sequence and generating a corresponding encoded sequence as its encoder output.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Celikyilmaz to incorporate the specific steps of the additional encoders and decoders for improved noise suppression and overall speech enhancement.


Regarding claim 4 Kupryjanow teaches The device of claim 3, wherein the one or more data sources includes at least one of the input signal or image data (17; 26; 39).  

Regarding claim 8 Kupryjanow teaches The device of claim 2, the first encoder including a Mel filter bank configured to filter the input spectral data (20: MFCCs).  

Regarding claim 9 Kupryjanow teaches The device of claim 2, further comprising an automatic speech recognition engine configured to generate text based on the input signal, wherein the context data includes the text (64: automatic speech recognizer).  

Claim 18 recites limitations similar to claim 3 and is rejected for similar rationale and reasoning  
Claim 19 recites limitations similar to claim 3/4 and is rejected for similar rationale and reasoning  

Claim 27 recites limitations similar to claim 3 and is rejected for similar rationale and reasoning  
Claim 28 recites limitations similar to claim 3/4 and is rejected for similar rationale and reasoning  


9.	Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Kupryjanow et al (2020/0184987) in view of Celikyilmaz et al (2019/0287012) in further view of Jackson (2019/0114489). 

Regarding claim 5 Kupryjanow does not specifically teach where Jackson teaches The device of claim 4, further comprising a camera configured to generate the image data (abstract; 4-5 images from cameras).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Jackson and image identification with a camera for an improved system to allow for better classification of disturbances (noise sounds), to improve noise suppression and speech enhancement.


10.	Claims 6, 11 are rejected under 35 U.S.C. 103 as being unpatentable over Kupryjanow et al (2020/0184987) in view of Celikyilmaz et al (2019/0287012) in further view of McCann (2019/0251168).

Regarding claim 6 Kupryjanow does not specifically teach where Celikyilmaz teaches The device of claim 3, wherein the decoder attention network comprises: 
a first [multi-head] attention network configured to process the first encoded data (abstract; 25-26; 29); 
a second [multi-head] attention network configured to process the second encoded data (abstract; 26-26; 29); and 
a combiner configured to combine outputs of the first [multi-head] attention network and the second [multi-head] attention network (abstract; 25: Once the agents complete encoding, they deliver their information to a decoder with contextual agent attention. Contextual agent attention enables the decoder to integrate information from multiple agents smoothly at each decoding step; 26; 29).  
Rejected for similar rationale and reasoning as claims 2 and 3
Where It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Celikyilmaz to incorporate the specific steps of the additional encoders and decoders for improved noise suppression and overall speech enhancement.

Celikyilmaz does not specifically teach where McCann teaches multi-head attention network (claim 6 decoder multi-head attention network).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate multi-head attention networks for improved efficiency of attention network and encoder/decoder/transformer.


Regarding claim 11 Kupryjanow does not specifically teach where Celikyilmaz teaches The device of claim 2, wherein: 
the first encoder comprises: 
a first layer including the first attention network, wherein the first attention network corresponds to a first [multi-head] attention network (29); and 
a second layer including a first feed forward network (29), and 
the second encoder comprises: 
a first layer including the second attention network, wherein the second attention network corresponds to a second [multi-head] attention network (29); and 
a second layer including a second feed forward network (29);
Rejected for similar rationale and reasoning as claims 2 and 3
But does not specifically teach where McCann teaches multi-head attention network.  
Rejected for similar rationale and reasoning as claim 6
Where It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate multi-head attention networks for improved efficiency of attention network and encoder/decoder/transformer.



11.	Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Kupryjanow et al (2020/0184987) in view of Celikyilmaz et al (2019/0287012) in further view of McCann (2019/0251168) in further view of Nicolson (Nicolson, Aaron, and Kuldip K. Paliwal. "Masked multi-head self-attention for causal speech enhancement." Speech Communication 125 (2020): 80-96.).

Regarding claim 7 Kupryjanow does not specifically teach where McCann teaches The device of claim 2, wherein the decoder further comprises: 
a [masked] multi-head attention network coupled to an input of the decoder attention network (38 decoder, multi-head attention network); and 
a decoder feed forward network coupled to an output of the decoder attention network (38 feed forward network).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate multi-head attention networks for improved efficiency of attention network and encoder/decoder/transformer.
McCann does not specifically teach where Nicolson teaches
masked multi-head attention network (title; abstract).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate masked multi-head attention networks for improved efficiency of attention network and encoder/decoder/transformer and real-time processing.


12.	Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Kupryjanow et al (2020/0184987) in view of Celikyilmaz et al (2019/0287012) in further view of Arik et al (2018/0247636).

Regarding claim 10 Kupryjanow does not specifically teach where Arik teaches The device of claim 9, wherein the second encoder includes a grapheme- to-phoneme convertor configured to process the text (45 grapheme-to-phoneme model based on encoder-decoder).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate a grapheme- to-phoneme convertor for improved speech and text analysis.



13.	Claims 13, 23 are rejected under 35 U.S.C. 103 as being unpatentable over Kupryjanow et al (2020/0184987) in view of Park et al (2020/0074988).

Regarding claim 13 Kupryjanow does not specifically teach where Park teaches The device of claim 1, further comprising an emotion recognition engine configured to generate emotion data based on the input signal, and wherein the context data includes the emotion data (543 emotion classification operation on speech data).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Park for an improved system to further identify, classify, and enhance speech.

Claim 23 recites limitations similar to claim 13 and is rejected for similar rationale and reasoning  


Conclusion
14.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892.

Sung et al (2020/0075038)
[0215] The determiner 1855 may determine an optimal parameter for speech quality enhancement based on a result of the objective speech quality measurement and a result of the subjective speech quality measurement. According to another embodiment of the present disclosure, context information about a call may be used additionally to determine an optimal parameter. The determiner may be implemented by using a machine learning method.

Hijazi et al (2019/0392852)
Ramprashad (2018/0366138)
Wolff et al (10,157,611)
Kristjansson (2017/0092268)

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/
Primary Examiner, Art Unit 2655