DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 03/31/2020.  Claims 1-14 are pending in the application. As such, Claims 1-14 have been examined. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitations uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: . “a signal acquisition module (101), configured for ….”, “a signal framing module (102), configured for …”, “a signal enhancing module (103), configured for …”, “and a signal output module (104), configured for …”  in claim 9; “signal framing module (102) … configured for …” in claim 10.
Because this/these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitations to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recites sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-4 and 9-12 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by S. R. Park and J. Lee, ‘A Fully Convolutional Neural Network for Speech Enhancement’. arXiv, 2016. (Year: 2016) (Referred to in the future as “Park”).

Regarding Claim 1, Park teaches a voice signal enhancing method, comprising: acquiring a voice signal in the present scene (Park Section 1 Paragraph 1-3 - "Denoising speech signals has been a long standing problem. Decades of works showed feasible solutions which estimated the noise model and used it to recover noise-deducted speech [1, 2, 3, 4, 5]. Nonetheless, estimating the model for a babble noise, which is encountered when a crowd of people are talking, is still a challenging task. The presence of babble noise, however, degrades hearing intelligibility of human speech greatly. When babble noise dominates over speech, aforementioned methods often times will fail to find the correct noise model [6]. If so, the noise-reduction will render distortion in speech, which creates discomforts to the users of hearing aids [7]. Here, instead of explicitly modeling the babble noise, we focus on learning a ‘mapping’ between noisy speech spectra and clean speech spectra, inspired by recent works on speech enhancement using neural networks [8, 9, 10, 11]. However, the model size of Neural Networks easily exceeds several hundreds of megabytes, limiting its applicability for an embedded system." These paragraphs reveal that we are working on "speech signals" to eliminate "babble noise". This is synonymous to working on a present voice signal with a signal enhancing method.);
 dividing the voice signal into frames according to a preset time interval to generate multiple frame signals (Park Section 2 Paragraph 3 and Section 4.1 Paragraph 2 - In section 2, it is seen that the voice signals can be divided into preset time intervals of "100ms speech segments" and that the "feature transformation" (generating multiple frame signals) is done with "a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms)".);
feeding the multiple frame signals into a trained neural network according to a preset step size, and performing convolution operations on the multiple frame signals through skip-connected convolutional layers to obtain multiple enhanced frame signals (Park Section 3.3 and Fig. 3 – Fig. 3. shows that multiple frame signals are fed into the trained neural network according to preset step size as seen by the 8 STFT. Section 3.3 states that those 8 STFT are going through convolution operations through "skip connections".);
and superimposing each enhanced frame signal according to the time domain of each enhanced frame signal to obtain the enhanced voice signal (Park Section 4.1 Paragraphs 2-3 - The "8 consecutive noisy STFT magnitude vectors" are considered in the time domain and are "standardized to have zero mean and unit variance". Later on at reconstruction, "inverse STFT [was used] to recover human speech" This implies the superimposition that was claimed and the vectors that are being super imposed are still being considered in the time domain when acknowledging the vector size.).

Regarding Claim 2, Park teaches all of the limitations of Claim 1. Park also teaches that said dividing the voice signal into frames according to a preset time interval to generate multiple frame signals specifically comprises: dividing the voice signal into frames according to a preset time interval (Park Section 2 Paragraph 3 and Section 4.1 Paragraph 2 - In section 2, it is seen that the voice signals can be divided into preset time intervals of "100ms speech segments" and that the "feature transformation" (generating multiple frame signals) is done with "a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms)". There is a multitude of voice signals.),
applying a Hanning window on the framed voice signals (Park Section 2 Paragraph 3 and Section 4.1 Paragraph 2 - In section 2, it is seen that the voice signals can be divided into preset time intervals of "100ms speech segments" and that the "feature transformation" (generating multiple frame signals) is done with "a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms)".),
and then implementing a DFT on them in order to generate multiple frame signals (Park Section 2 Paragraph 3 and Section 4.1 Paragraph 2 - In section 2, it is seen that the voice signals can be divided into preset time intervals of "100ms speech segments" and that the "feature transformation" (generating multiple frame signals) is done with "a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms)".).

Regarding Claim 3, Park teaches all the limitations of Claim 1. Park also teaches that the training method of the neural network is: acquiring multiple noise signals and multiple clear signals without noises (Park Section 4.1 Paragraph 1 - The experiment was conducted on the TIMIT database [19] and 27 different types of noise clips were collected from freely available online resource [20]. The noise are mostly babble, but includes different types of noise like instrumental sounds. Both data in the training set (4620 utterances) and the testing set (200 utterances) were added with one of 27 noise clips at 0dB SNR. After all feature transformation steps were completed, 20% of the training features were assigned as the validation set.);
mixing the multiple noise signals and multiple clear signals one by one according to randomly generated mixing coefficients to obtain multiple noise-bearing signals (Park Section 4.1 Paragraph 2 - The audio signals were down sampled to 8kHz, and the silent frames were removed from the signal. The spectral vectors were computed using a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms). The frequency resolution was 31.25 Hz (=4kHz/128) per each frequency bin. 256-point STFT magnitude vectors were reduced to 129-point by removing the symmetric half. For FNN/RNN, the input feature consisted of a noisy STFT magnitude vector (size: 129×1, duration: 32ms). For CNN, the input feature consisted of 8 consecutive noisy STFT magnitude vectors (size: 129 × 8, duration: 100ms). Both input features were standardized to have zero mean and unit variance.);
wherein a noise signal is mixed with a clear signal to form a noise-bearing signal (Park Section 4.1 Paragraph 2 - The audio signals were down sampled to 8kHz, and the silent frames were removed from the signal. The spectral vectors were computed using a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms). The frequency resolution was 31.25 Hz (=4kHz/128) per each frequency bin. 256-point STFT magnitude vectors were reduced to 129-point by removing the symmetric half. For FNN/RNN, the input feature consisted of a noisy STFT magnitude vector (size: 129×1, duration: 32ms). For CNN, the input feature consisted of 8 consecutive noisy STFT magnitude vectors (size: 129 × 8, duration: 100ms). Both input features were standardized to have zero mean and unit variance.);
and feeding the multiple noise-bearing signals sequentially into the neural network for signal enhancement to generate multiple corresponding denoised signals and adjusting the neural network according to the least square error between the denoised signals and the corresponding clear signals thereof (Park Section 4.2 -  Convolution layers were trained from scratch, with the aid of batch normalization layer [17] added after each convolution layer 1. All networks were trained using back propagation with gradient descent optimization using stochastic optimization (which uses least square error) with a mini-batch size of 64.).

Regarding Claim 4, Park teaches all of Claim 3. Park also teaches that said feeding the multiple noise-bearing signals sequentially into the neural network for signal enhancement to generate multiple corresponding denoised signals and adjusting the neural network according to the least square error between the denoised signals and the corresponding clear signals thereof specifically comprises: feeding the noise-bearing signal into the neural network (Park Section 4.2 -  Convolution layers were trained from scratch, with the aid of batch normalization layer [17] added after each convolution layer 1. All networks were trained using back propagation with gradient descent optimization using stochastic optimization (which uses least square error) with a mini-batch size of 64.),
adjusting the neural network according to the least square error between a denoised signal generated by the signal enhancement of the noise-bearing signal through the neural network and the corresponding clear signal (Park Section 4.2 -  Convolution layers were trained from scratch, with the aid of batch normalization layer [17] added after each convolution layer 1. All networks were trained using back propagation with gradient descent optimization using stochastic optimization (which uses least square error) with a mini-batch size of 64.),
continuing to adjust the neural network according to the least square error between a denoised signal generated by the signal enhancement of next noise-bearing signal through the neural network and the corresponding clear signal (Park Section 4.2 -  Convolution layers were trained from scratch, with the aid of batch normalization layer [17] added after each convolution layer 1. All networks were trained using back propagation with gradient descent optimization using stochastic optimization (which relies on least square error) with a mini-batch size of 64.),
and until the least square error obtained by using different noise-bearing signals is unchanged, terminating the training of the neural network (Park Section 4.2 - When the validation loss didn’t decrease for more than 4 epochs, learning rate was decreased to lr/2, lr/3, lr/4, subsequently. This is understood to mean that the training eventually stopped when the validation loss (this considered as the least square error relied upon when talking about the earlier stochastic optimization) started to become unchanged. The training was repeated once more for FNN and RNN with 12 regularization (λ = 10^(−5)) which slightly improved the performance.).

Regarding Claim 9, Park teaches a voice signal enhancement device, comprising: a signal acquisition module (101), configured for acquiring a voice signal at the present scene (Park Section 1 Paragraph 1-3 - "Denoising speech signals has been a long standing problem. Decades of works showed feasible solutions which estimated the noise model and used it to recover noise-deducted speech [1, 2, 3, 4, 5]. Nonetheless, estimating the model for a babble noise, which is encountered when a crowd of people are talking, is still a challenging task. The presence of babble noise, however, degrades hearing intelligibility of human speech greatly. When babble noise dominates over speech, aforementioned methods often times will fail to find the correct noise model [6]. If so, the noise-reduction will render distortion in speech, which creates discomforts to the users of hearing aids [7]. Here, instead of explicitly modeling the babble noise, we focus on learning a ‘mapping’ between noisy speech spectra and clean speech spectra, inspired by recent works on speech enhancement using neural networks [8, 9, 10, 11]. However, the model size of Neural Networks easily exceeds several hundreds of megabytes, limiting its applicability for an embedded system." These paragraphs reveal that we are working on "speech signals" to eliminate "babble noise". This is synonymous to working on a present voice signal with a signal enhancing method.);
a signal framing module (102), configured for framing the voice signal according to a preset time interval to generate multiple frame signals (Park Section 2 Paragraph 3 and Section 4.1 Paragraph 2 - In section 2, it is seen that the voice signals can be divided into preset time intervals of "100ms speech segments" and that the "feature transformation" (generating multiple frame signals) is done with "a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms)".);
a signal enhancing module (103), configured for feeding the multiple frame signals into a trained neural network based on a preset step size and implementing convolution operations on the multiple frame signals through skip-connected convolutional layers to obtain multiple enhanced frame signals (Park Section 3.3 and Fig. 3 – Fig. 3. shows that multiple frame signals are fed into the trained neural network according to preset step size as seen by the 8 STFT. Section 3.3 states that those 8 STFT are going through convolution operations through "skip connections".);
and a signal output module (104), configured for superimposing each enhanced frame signal according to a time domain of each enhanced frame signal to obtain the enhanced voice signal (Park Section 4.1 Paragraphs 2-3 - The "8 consecutive noisy STFT magnitude vectors" are considered in the time domain and are "standardized to have zero mean and unit variance". Later on at reconstruction, "inverse STFT [was used] to recover human speech" This implies the superimposition that was claimed and the vectors that are being super imposed are still being considered in the time domain when acknowledging the vector size.).

Regarding Claim 10, Park teaches all the limitations of Claim 9. Park also teaches that the signal framing module (102) is specifically configured for: dividing the voice signal into frames according to a preset time interval (Park Section 2 Paragraph 3 and Section 4.1 Paragraph 2 - In section 2, it is seen that the voice signals can be divided into preset time intervals of "100ms speech segments" and that the "feature transformation" (generating multiple frame signals) is done with "a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms)" There is a multitude of voice signals.),
applying a Hanning window on the framed voice signals (Park Section 2 Paragraph 3 and Section 4.1 Paragraph 2 - In section 2, it is seen that the voice signals can be divided into preset time intervals of "100ms speech segments" and that the "feature transformation" (generating multiple frame signals) is done with "a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms)".),
and then implementing a DFT on them in order to generate multiple frame signals (Park Section 2 Paragraph 3 and Section 4.1 Paragraph 2 - In section 2, it is seen that the voice signals can be divided into preset time intervals of "100ms speech segments" and that the "feature transformation" (generating multiple frame signals) is done with "a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms)".).

Regarding Claim 11, Park teaches all of the limitations of Claim 9. Park also teaches that the training method of the neural network is: acquiring multiple noise signals and multiple clear signals without noises (Park Section 4.1 Paragraph 1 - The experiment was conducted on the TIMIT database [19] and 27 different types of noise clips were collected from freely available online resource [20]. The noise is mostly babble, but includes different types of noise like instrumental sounds (These could be considered as the clear signals without noise). Both data in the training set (4620 utterances) and the testing set (200 utterances) were added with one of 27 noise clips at 0dB SNR. After all feature transformation steps were completed, 20% of the training features were assigned as the validation set.);
mixing the multiple noise signals and multiple clear signals one by one according to randomly generated mixing coefficients to obtain multiple noise-bearing signals (Park Section 4.1 Paragraph 2 - The audio signals were down sampled to 8kHz, and the silent frames were removed from the signal. The spectral vectors were computed using a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms). The frequency resolution was 31.25 Hz (=4kHz/128) per each frequency bin. 256-point STFT magnitude vectors were reduced to 129-point by removing the symmetric half. For FNN/RNN, the input feature consisted of a noisy STFT magnitude vector (size: 129×1, duration: 32ms). For CNN, the input feature consisted of 8 consecutive noisy STFT magnitude vectors (size: 129 × 8, duration: 100ms). Both input features were standardized to have zero mean and unit variance.);
wherein a noise signal is mixed with a clear signal to form a noise-bearing signal (Park Section 4.1 Paragraph 2 - The audio signals were down sampled to 8kHz, and the silent frames were removed from the signal. The spectral vectors were computed using a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms). The frequency resolution was 31.25 Hz (=4kHz/128) per each frequency bin. 256-point STFT magnitude vectors were reduced to 129-point by removing the symmetric half. For FNN/RNN, the input feature consisted of a noisy STFT magnitude vector (size: 129×1, duration: 32ms). For CNN, the input feature consisted of 8 consecutive noisy STFT magnitude vectors (size: 129 × 8, duration: 100ms). Both input features were standardized to have zero mean and unit variance.);
and feeding the multiple noise-bearing signals sequentially into the neural network for signal enhancement to generate multiple corresponding denoised signals and adjusting the neural network according to the least square error between the denoised signals and the corresponding clear signals thereof (Park Section 4.2 -  Convolution layers were trained from scratch, with the aid of batch normalization layer [17] added after each convolution layer 1. All networks were trained using back propagation with gradient descent optimization using stochastic optimization (least square error) with a mini-batch size of 64.).

Regarding Claim 12, Park teaches all of the limitations of claim 11. Park also teaches that said feeding the multiple noise-bearing signals sequentially into the neural network for signal enhancement to generate multiple corresponding denoised signals and adjusting the neural network according to the least square error between the denoised signals and the corresponding clear signals thereof specifically comprises: feeding the noise-bearing signal into the neural network (Park Section 4.2 -  Convolution layers were trained from scratch, with the aid of batch normalization layer [17] added after each convolution layer 1. All networks were trained using back propagation with gradient descent optimization using stochastic optimization (which relies on least square error) with a mini-batch size of 64.),
adjusting the neural network according to the least square error between a denoised signal generated by the signal enhancement of the noise-bearing signal through the neural network and the corresponding clear signal (Park Section 4.2 -  Convolution layers were trained from scratch, with the aid of batch normalization layer [17] added after each convolution layer 1. All networks were trained using back propagation with gradient descent optimization using stochastic optimization (which relies on least square error) with a mini-batch size of 64.),
continuing to adjust the neural network according to the least square error between a denoised signal generated by the signal enhancement of next noise-bearing signal through the neural network and the corresponding clear signal (Park Section 4.2 -  Convolution layers were trained from scratch, with the aid of batch normalization layer [17] added after each convolution layer 1. All networks were trained using back propagation with gradient descent optimization using stochastic optimization (which relies on least square error) with a mini-batch size of 64.),
and until the least square error obtained by using different noise-bearing signals is unchanged, terminating the training of the neural network (Park Section 4.2 - When the validation loss didn’t decrease for more than 4 epochs, learning rate was decreased to lr/2, lr/3, lr/4, subsequently. This is understood to mean that the training eventually stopped when the validation loss (this considered as the least square error relied upon when talking about the earlier stochastic optimization) started to become unchanged. The training was repeated once more for FNN and RNN with 12 regularization (λ = 10^(−5)) which slightly improved the performance.).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5-8, 13, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Park in view of O. Ernst, S. E. Chazan, S. Gannot, and J. Goldberger, ‘Speech Dereverberation Using Fully Convolutional Networks’. arXiv, 2018. (Further referred to as “Ernst”).

Regarding Claim 5, Park teaches all the limitations in claim 1. Ernst further teaches that the voice signal enhancing method according to claim 1, wherein the neural network comprises N successive convolutional layers (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers.);
every two symmetric convolutional layers with the N/2th convolutional layer as an axis of symmetry are skip connected to each other (Ernst Section III.A Paragraphs 2-3 and Figure 2 - Following [19], our network details are as follows. Let CBLl,s denote a Convolution-BatchNorm-Leaky-ReLU layer with slope=0.2, where l is number of filters and s × s is the filter size. CLl,s and CBRl,s have the same architecture but without BatchNorm, or with a non-leaky ReLU, respectively. With same notation, let DCDRl,s denote the DeConvolutionBatchNorm-Dropout-ReLU with dropout of 50% (N/2), and let DCRl,s denote the DeConvolution-BatchNorm-ReLU. DCTl,s denote DeConvolution-tanh. The U-net architecture which demonstrates the skip connections is illustrated in Fig. 2.);
wherein N is an even number (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers. The number of layers in Ernst are not limited to odd or even. Ernst allows for an even number of layers.).
Park and Ernst are both considered to be analogous to the claimed invention because both relate to utilizing convolutional neural networks in an effort for voice signal enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Park on how to more effectively set up the convolution architecture based on Ernst to improve the efficiency of removing noise from a signal (Ernst Section IV.C - It is evident that for the far microphone (where reverberation conditions are harsher), U-Net with asymmetric filters exhibits better LLR and FWSegSNR objective measures than the other methods. FWSegSNR objective measure, regardless of the room type. For the near microphone, the regular U-Net and asymmetric U-net outperformed the other methods in most of the rooms for the CD, LLR and FWSegSNR objective measure, whereas the differences between the regular and asymmetric U-Net were negligible.).

Regarding Claim 6, Park teaches all the limitations in claim 2. Ernst further teaches that the voice signal enhancing method according to claim 1, wherein the neural network comprises N successive convolutional layers (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers.);
every two symmetric convolutional layers with the N/2th convolutional layer as an axis of symmetry are skip connected to each other (Ernst Section III.A Paragraphs 2-3 and Figure 2 - Following [19], our network details are as follows. Let CBLl,s denote a Convolution-BatchNorm-Leaky-ReLU layer with slope=0.2, where l is number of filters and s × s is the filter size. CLl,s and CBRl,s have the same architecture but without BatchNorm, or with a non-leaky ReLU, respectively. With same notation, let DCDRl,s denote the DeConvolutionBatchNorm-Dropout-ReLU with dropout of 50% (N/2), and let DCRl,s denote the DeConvolution-BatchNorm-ReLU. DCTl,s denote DeConvolution-tanh. The U-net architecture which demonstrates the skip connections is illustrated in Fig. 2.);
wherein N is an even number (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers. The number of layers in Ernst are not limited to odd or even. Ernst allows for an even number of layers.).
Park and Ernst are both considered to be analogous to the claimed invention because both relate to utilizing convolutional neural networks in an effort for voice signal enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Park on how to more effectively set up the convolution architecture based on Ernst to improve the efficiency of removing noise from a signal (Ernst Section IV.C - It is evident that for the far microphone (where reverberation conditions are harsher), U-Net with asymmetric filters exhibits better LLR and FWSegSNR objective measures than the other methods. FWSegSNR objective measure, regardless of the room type. For the near microphone, the regular U-Net and asymmetric U-net outperformed the other methods in most of the rooms for the CD, LLR and FWSegSNR objective measure, whereas the differences between the regular and asymmetric U-Net were negligible.).

Regarding Claim 7, Park teaches all the limitations in claim 3. Ernst further teaches that the voice signal enhancing method according to claim 1, wherein the neural network comprises N successive convolutional layers (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers.);
every two symmetric convolutional layers with the N/2th convolutional layer as an axis of symmetry are skip connected to each other (Ernst Section III.A Paragraphs 2-3 and Figure 2 - Following [19], our network details are as follows. Let CBLl,s denote a Convolution-BatchNorm-Leaky-ReLU layer with slope=0.2, where l is number of filters and s × s is the filter size. CLl,s and CBRl,s have the same architecture but without BatchNorm, or with a non-leaky ReLU, respectively. With same notation, let DCDRl,s denote the DeConvolutionBatchNorm-Dropout-ReLU with dropout of 50% (N/2), and let DCRl,s denote the DeConvolution-BatchNorm-ReLU. DCTl,s denote DeConvolution-tanh. The U-net architecture which demonstrates the skip connections is illustrated in Fig. 2.);
wherein N is an even number (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers. The number of layers in Ernst are not limited to odd or even. Ernst allows for an even number of layers.).
Park and Ernst are both considered to be analogous to the claimed invention because both relate to utilizing convolutional neural networks in an effort for voice signal enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Park on how to more effectively set up the convolution architecture based on Ernst to improve the efficiency of removing noise from a signal (Ernst Section IV.C - It is evident that for the far microphone (where reverberation conditions are harsher), U-Net with asymmetric filters exhibits better LLR and FWSegSNR objective measures than the other methods. FWSegSNR objective measure, regardless of the room type. For the near microphone, the regular U-Net and asymmetric U-net outperformed the other methods in most of the rooms for the CD, LLR and FWSegSNR objective measure, whereas the differences between the regular and asymmetric U-Net were negligible.).

Regarding Claim 8, Park teaches all the limitations in claim 4. Ernst further teaches that the voice signal enhancing method according to claim 1, wherein the neural network comprises N successive convolutional layers (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers.);
every two symmetric convolutional layers with the N/2th convolutional layer as an axis of symmetry are skip connected to each other (Ernst Section III.A Paragraphs 2-3 and Figure 2 - Following [19], our network details are as follows. Let CBLl,s denote a Convolution-BatchNorm-Leaky-ReLU layer with slope=0.2, where l is number of filters and s × s is the filter size. CLl,s and CBRl,s have the same architecture but without BatchNorm, or with a non-leaky ReLU, respectively. With same notation, let DCDRl,s denote the DeConvolutionBatchNorm-Dropout-ReLU with dropout of 50% (N/2), and let DCRl,s denote the DeConvolution-BatchNorm-ReLU. DCTl,s denote DeConvolution-tanh. The U-net architecture which demonstrates the skip connections is illustrated in Fig. 2.);
wherein N is an even number (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers. The number of layers in Ernst are not limited to odd or even. Ernst allows for an even number of layers.).
Park and Ernst are both considered to be analogous to the claimed invention because both relate to utilizing convolutional neural networks in an effort for voice signal enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Park on how to more effectively set up the convolution architecture based on Ernst to improve the efficiency of removing noise from a signal (Ernst Section IV.C - It is evident that for the far microphone (where reverberation conditions are harsher), U-Net with asymmetric filters exhibits better LLR and FWSegSNR objective measures than the other methods. FWSegSNR objective measure, regardless of the room type. For the near microphone, the regular U-Net and asymmetric U-net outperformed the other methods in most of the rooms for the CD, LLR and FWSegSNR objective measure, whereas the differences between the regular and asymmetric U-Net were negligible.).

Regarding Claim 13, Park teaches all the limitations of claim 9. Ernst further teaches that the voice signal enhancement device according to claim 9, wherein the neural network comprises N successive convolutional layers (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers.);
every two symmetric convolutional layers with the N/2th convolutional layer as an axis of symmetry are skip connected to each other (Ernst Section III.A Paragraphs 2-3 and Figure 2 - Following [19], our network details are as follows. Let CBLl,s denote a Convolution-BatchNorm-Leaky-ReLU layer with slope=0.2, where l is number of filters and s × s is the filter size. CLl,s and CBRl,s have the same architecture but without BatchNorm, or with a non-leaky ReLU, respectively. With same notation, let DCDRl,s denote the DeConvolutionBatchNorm-Dropout-ReLU with dropout of 50% (N/2), and let DCRl,s denote the DeConvolution-BatchNorm-ReLU. DCTl,s denote DeConvolution-tanh. The U-net architecture which demonstrates the skip connections is illustrated in Fig. 2.).
Park and Ernst are both considered to be analogous to the claimed invention because both relate to utilizing convolutional neural networks in an effort for voice signal enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Park on how to more effectively set up the convolution architecture based on Ernst to improve the efficiency of removing noise from a signal (Ernst Section IV.C - It is evident that for the far microphone (where reverberation conditions are harsher), U-Net with asymmetric filters exhibits better LLR and FWSegSNR objective measures than the other methods. FWSegSNR objective measure, regardless of the room type. For the near microphone, the regular U-Net and asymmetric U-net outperformed the other methods in most of the rooms for the CD, LLR and FWSegSNR objective measure, whereas the differences between the regular and asymmetric U-Net were negligible.).

Regarding Claim 14, Park teaches all the limitations of claim 10. Ernst further teaches that the voice signal enhancement device according to claim 9, wherein the neural network comprises N successive convolutional layers (Ernst Section III.A Paragraph 2 -  the skip connections directly concatenate feature maps from layer i in the encoder to layer N − i in the decoder, where N is number of layers.);
every two symmetric convolutional layers with the N/2th convolutional layer as an axis of symmetry are skip connected to each other (Ernst Section III.A Paragraphs 2-3 and Figure 2 - Following [19], our network details are as follows. Let CBLl,s denote a Convolution-BatchNorm-Leaky-ReLU layer with slope=0.2, where l is number of filters and s × s is the filter size. CLl,s and CBRl,s have the same architecture but without BatchNorm, or with a non-leaky ReLU, respectively. With same notation, let DCDRl,s denote the DeConvolutionBatchNorm-Dropout-ReLU with dropout of 50% (N/2), and let DCRl,s denote the DeConvolution-BatchNorm-ReLU. DCTl,s denote DeConvolution-tanh. The U-net architecture which demonstrates the skip connections is illustrated in Fig. 2.).
Park and Ernst are both considered to be analogous to the claimed invention because both relate to utilizing convolutional neural networks in an effort for voice signal enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Park on how to more effectively set up the convolution architecture based on Ernst to improve the efficiency of removing noise from a signal (Ernst Section IV.C - It is evident that for the far microphone (where reverberation conditions are harsher), U-Net with asymmetric filters exhibits better LLR and FWSegSNR objective measures than the other methods. FWSegSNR objective measure, regardless of the room type. For the near microphone, the regular U-Net and asymmetric U-net outperformed the other methods in most of the rooms for the CD, LLR and FWSegSNR objective measure, whereas the differences between the regular and asymmetric U-Net were negligible.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Scalart et al. (US 20040064307 A1), Marro et al. (US 20070255535 A1), and Kremer et al. (US 20040078199 A1).
Scalart et al. (US 20040064307 A1) discloses an invention that “concerns a method which consists, when analyzing an input signal in the frequency domain, in determining a noise level estimator and a useful signal level estimator in an input signal frame, thereby enabling to calculate the transfer function of a first noise-reducing filter, carrying out a second pass to fine-tune the useful signal level estimator, by combining the signal spectrum and the first filter transfer function, then to calculate the transfer function of a second noise-reducing filter on the basis of the fine-tuned useful signal level estimator and the noise level estimator. Said second noise-reducing filter is then used to reduce the noise level in the frame.” (Scalart – Abstract).
Marro et al. (US 20070255535 A1) discloses an invention that “relates to a method of processing a noisy sound signal and to a device for implementing said method.” (Marro – Abstract).
Kremer et al. (US 20040078199 A1) discloses an “apparatus and a method for speech enhancement, the method includes the steps of: (i) receiving a noisy input signal; (ii) determining whether a likelihood of an existence of a speech signal in the noisy input signal exceeds a first threshold; (iii) generating an estimated noise signal, if the likelihood is below the first threshold; (iv) generating an estimated speech signal by parametric subtraction, if the likelihood exceeds a threshold; and (v) determining a relationship between the estimated noise signal and the estimated speech signal and modifying the estimated speech signal in response to the determination.” (Kremer – Abstract).
Please, see additional references in form PTO-892 for more details.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to UTHEJ KUNAMNENI whose telephone number is (571)272-5428. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/UTHEJ KUNAMNENI/               Examiner, Art Unit 2656                                                                                                                                                                                         
/EDGAR X GUERRA-ERAZO/              Primary Examiner, Art Unit 2656