DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1 to 10 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claims contain subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventors, at the time the application was filed, had possession of the claimed invention.
Independent claims 1 and 7 to 9 are amended to set forth a limitation directed to obtaining “one frequency that is identified from the plurality of frequencies”, which represents new matter.  Similarly, new dependent claim 10 sets forth a limitation of “based on the obtained one frequency”, and dependent claims 3 to 4 are amended to 
“The frequency of each node used in the generation of a time-series signal need not be a single frequency corresponding to the peak of the absolute value of the weights.  Alternatively, for example, for each node, a plurality of absolute values can be identified within a predetermined range from the peak, and the frequencies of a plurality of codes on the input side that correspond to a plurality of absolute values can be obtained. . . .  For each node, a plurality of signals defined according to a plurality of obtained frequencies and according to the amplitudes and the phases of the nodes corresponding to the obtained frequencies can be used in synthesizing the time-series signals.”  

Applicants’ closest support for “a single frequency”, then, is actually directed to a contrary embodiment that obtains “a plurality of frequencies”.   



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 to 3 and 7 to 10 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (U.S. Patent Publication 2017/0353789) in view of Korjani (U.S. Patent No. 10,614,827).
Concerning independent claims 1 and 7 to 9, Kim et al. discloses a method, system, apparatus, and computer-readable storage medium for sound source estimation using neural networks, comprising:
“one or more processors configured to:” – embodiments may be implemented as one or more computing devices, where device 20 includes a central processor 24 (¶[0067]: Figure 8);
“convert an acoustic signal and output amplitude and phase at a plurality of frequencies” – a plurality of microphones are responsive to a sound event generated by a sound source (¶[0002]); a plurality of features are extracted from an auralized signal, each of the features including a magnitude (“amplitude”) and a phase (“phase”) of a corresponding one of a plurality of auralized signals (¶[0002] - ¶[0003]); a feature extracted from an auralized signal may include a log magnitude and phase of a signal, where a log magnitude and phase of the auralized signal [Mc, θc] in the i-th channel may c  = 20 log |Xc (ωk,n)| and  θc =arg[Xc (ωk,n)], where Xc (ωk,n) corresponds to the n-th frame and k-th frequency bin (“at a plurality of frequencies”) of the spectrogram of the signal in the c-th channel (¶[0029] - ¶[0030]); Figure 6 illustrates that input to a neural network includes an array of a plurality frequency bins and a plurality of microphone channels; here, a magnitude is equivalent to an “amplitude”, and a magnitude and a phase are obtained for a plurality of frequency bins, where each frequency component is represented by an index k;  
“for each of a plurality of nodes of a hidden layer included in a neural network that treats the amplitude and the phase as input, obtain one frequency that is identified from the plurality of frequencies based on a plurality of weights used in arithmetic operation of the node” – a neural network architecture may be constructed by a sufficient number of layers and nodes within each layer so that it can model characteristics of the multi-microphone array with sufficient accuracy when trained with auralized multi-channel sounds signals; Figure 5 illustrates a neural network architecture in which features extracted from auralized multi-channel signals are provided as inputs (¶[0046]: Figure 5); Figure 6 illustrates a specific embodiment of a neural network having four neural network layers, namely, Layer 1, Layer 2, Layer 3, and Layer 4 for processing sound events or features extracted from auralized sound signals; a first layer, that is, Layer 1, includes microphone channels representing audio inputs from multiple microphones and frequency bins for each of the microphone channels as illustrated in two-dimensional coordinate 602 (¶[0048]: Figure 6); by providing a large number of units per frequency bin in Layer 2, more degrees of freedom may be provided for modeling the virtual beamformer responses across U = ΣiWiXi + V, where Wi is a complex-valued weight connecting complex-valued inputs and V is a complex-valued threshold value for an activation function fR(x) (¶[0059] - ¶[0060]); here, an activation function fR(x) at a node of a neural network is a function of net result U and weights W (“based on a plurality of weights used in arithmetic operation of the node”); broadly, an activation function is “an arithmetic operation”; Layers 2 and 3 are “hidden layers” (“a hidden layer”) of a neural network, Layer 1 is an input layer, and Layer 4 may be an output layer; Layer 3 is a hidden layer that processes magnitude and phase for each frequency bin to output Spectrogram n and Spectrogram n + L at nodes 630 to 640 of Layer 4, where these spectrograms include a plurality of frequency components (“obtain one frequency . . . based on a plurality of weights”);

Concerning independent claims 1 and 7 to 9, Kim et al. arguably discloses all of the limitations of these independent claims.  That is, Kim et al. processes magnitude and phase information for a plurality of frequency components with a neural network to output an auralized signal estimating a sound source.  Broadly, this auralized signal estimating a sound source can be construed as “an acoustic signal” (“generating an acoustic signal based on the plurality of obtained frequencies and based on amplitude and phase”) as an acoustic signal is eventually output with the proper spatial rendering.  That is, Kim et al. discloses Layer 4 of a neural network outputs spectrograms at nodes 630, 632, 634, 636, 638, and 640.  (¶[0055]: Figure 6)  Layer 4 is an output layer that Kim et al., they are taught by Korjani.
Concerning independent claims 1 and 7 to 9, Korjani teaches a system and method for speech enhancement using dynamic noise profile estimation, where a neural network is trained for various types of noise from user speech and then the noise is subtracted from the speech data leaving only the speech free of noise.  (Abstract)  A deep neural network (DNN) is trained by link weights between input layer 122, hidden layers 124, and output layer 126.  (Column 2, Lines 38 to 48: Figure 1)  Link weights and thresholds of a deep neural network are adjusted and optimized to output a noise profile that best matches training noise provided in noisy speech training data.  A short-term Fourier Transform (STFT) is taken, and a time-frequency domain representation is given by Yk(l) = Sk(l) + Nk(l), where k is a frequency index, l is a segment index, and Yk(l), Sk(l), and Nk(l) are spectra of the clean speech, noise, and noisy speech.  The spectral coefficients can be written in terms of their amplitude and phase, Y = ReiϕY, S = Aeiϕs, N = ReiϕN.  (Column 4, Lines 1 to 19: Equations (1) to (3): Figure 1)  Figure 3 illustrates an algorithm for filtering noise from noisy speech data, where noisy speech is received, coefficients are input to a deep neural network, noise is dynamically filtered from noisy speech, and a waveform for reconstructed speech is played.  (Column 4, Lines 25 to 46: Figure 3)  Korjani, then, similarly teaches converting an acoustic signal into an amplitude and a phase at a plurality of frequencies, inputting these amplitudes and phases at a plurality of frequencies into nodes of a hidden layer of a neural network, and generating an acoustic signal based on a plurality of frequencies from Korjani outputs clean speech at a plurality of frequencies that is obtained by a speech-enhancing noise filter that is implemented by a trained neural network that uses amplitude and phase information.  An objective is to provide a robust technique to accurately represent and filter noise from speech data that operates independent of the amount of training speech data or language.  (Column 1, Lines 36 to 39)  It would have been obvious to one having ordinary skill in the art to generate an acoustic signal using a neural network that is based on a plurality of frequencies and amplitude and phase as taught by Korjani in sound source enhancement of Kim et al. for a purpose of accurately representing and filtering noise from speech data.

Concerning claim 2, Kim et al. discloses that at least one layer of a neural network may be required to process complex numbers, e.g., Layer 2 (“a layer for inputting and outputting complex numbers”); a complex number may be in the form of a real component and an imaginary component, or a magnitude and a phase; each unit or node may receive complex inputs and produce a complex output; Wi is a complex-valued weight connecting complex-valued inputs, and V is a complex-valued threshold signal, where a net result U is converted into real and imaginary components as they are passed through an activation function fR(x) to obtain an output fout.  (¶[0059] - ¶[0060])  A neural network that includes nodes for processing complex numbers and complex weights is “a complex-valued neural network that includes a layer for inputting and outputting complex numbers.”  Similarly, Korjani teaches that spectral coefficients can be written in terms of their amplitude and phase, Y = ReiϕY, S = Aeiϕs, N = ReiϕN.  Y = ReiϕY, S = Aeiϕs, N = DeiϕN are complex-values coefficients as represented by an imaginary number, and these complex-valued coefficients are processed by a deep neural network.
Concerning claim 3, Kim et al. discloses taking an “absolute value of a plurality of weights”, where one embodiment implements a complex-input-complex-output unit and makes the complex output real by simply taking the magnitude of the complex output: fout = |fR(R(U)) + fR(I(U))|; additionally, one embodiment applies an activation function on the absolute value of the complex sum: fout = fR(|U|) (¶[0062] - ¶[0063]).  Here, U = ΣiWiXi + V, where Wi is a complex-valued weight connecting complex-valued inputs and V is a complex-valued threshold value for an activation function fR(x) (¶[0059] - ¶[0060]).  That is, fout = fR(|U|) = fR(|ΣiWiXi + V |), which is equivalent to taking an “absolute value of a plurality of weights”, where an activation function is an “arithmetic operation of the node”.  These weights are then used by “a plurality of nodes present in a hidden layer” of a neural network of Figure 6, and are applied to “obtain . . . frequency” for spectrograms at nodes 630 to 640 of Layers 2 and 3. 
Concerning claim 10, Kim et al. discloses “generate, for each of the plurality of nodes, a signal based on the obtained one frequency, corresponding amplitude, and corresponding phase” – a feature extracted from an auralized signal may include a log magnitude and a phase of the signal, where a log magnitude and phase of an auralized signal [Mc, θc] in the c-th channel may be defined as Mc = 20 log|Xc(ωk(n))| and θc = arg|Xc(ωk(n))|, where Xc(ωk(n)) corresponds to the n-th frame and k-th frequency bin of the spectrogram of the signal in the c-th channel (“a signal based on the obtained one frequency, corresponding amplitude, and corresponding phase”) (¶[0029] - ¶[0030]); c is a “corresponding amplitude”, and phase θc is “a “corresponding phase”; for frequency-domain based features, both magnitude and phase information from each of the sound channels may be provided to the neural network (¶[0059]: Figure 6); each node of neural network, then, ‘generates a signal’ based on a k-th frequency bin, magnitude Mc, and phase θc.  Korjani teaches “generate the acoustic signal by synthesizing a plurality of signals generated for the plurality of nodes” – link weights and thresholds of a deep neural network (DNN) are adjusted and optimized 240 to output a noise profile; when properly trained, a noise filter estimates noise n(i) and recovers clean speech s(i) from the noisy speech x(i) for each frequency index k, and amplitude and phase; once noise is isolated and removed from all the frames, the frames of the clean speech are assembled and reconstructed 350 into a complete waveform; this waveform with filtered speech may be transmitted to a user and played 360 (column 4, lines 1 to 46: Figure 1 and 3).  Similarly, Kim et al. discloses that sound signals from stationary or moving sound sources may be auralized to generate auralized multi-channel sound signals.  (¶[0018])  Implicitly, Kim et al. discloses that output provided from nodes of a neural network are combined to “generate the acoustic signal by synthesizing a plurality of signals generated by the plurality of nodes.” 

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (U.S. Patent Publication 2017/0353789) in view of Korjani (U.S. Patent No. 10,614,827) as applied to claims 1 to 3 above, and further in view of Sainath et al. (U.S. Patent Publication 2017/0092265).
Kim et al. discloses nodes 630 to 640 of Layer 4 receive a spectrogram that “to obtain one frequency that is set for such node in the previous layer corresponding to the weight”, and taking an absolute value of the weight by fout = fR(|U|) (¶[0055] and ¶[0062] - ¶[0063]: Figure 6)  However, Kim et al. does not disclose obtaining “a frequency . . at which the absolute value of the node becomes highest.”  Generally, this appears to be equivalent to what is conventionally known in the art of neural networks as ‘max pooling’, which is a common component of convolutional neural networks.  That is, max pooling takes the highest value of a plurality of values that are output from nodes at a given layer of a neural network and uses only that highest value as an input to the next layer.  
Specifically, Sainath et al. teaches a neural network for speech recognition, where an output of spectral filtering is generated as a pooled output.  Pooling a spatial filtered output in time to generate the pooled output may include non-overlapping max pooling the spatial filtered output along the frequency axis.  The spectral filtered output may include generating the spectral filtered output by applying a rectified non-linearity to the pooled output.  (¶[0007])  Spectral filtering convolution layer 104 applies a pooling function and a rectified non-linearity function using layer 106 to spatial filtered output to generate a spectral filtered output 108.  (¶[0023]: Figure 1)  Spectral filtering convolutional layer 104 includes a pooling and non-linearity layer 106 that pools the output, e.g., to discard short-time phase information.  (¶[0028]: Figure 1)  The frequency convolutional layer 110 may use pooling, e.g., non-overlapping max pooling along the frequency axis.  (¶[0032]: Figure 1)  Here, max pooling of frequency is “to obtain one frequency that is set for such a node in previous layer corresponding to weight at which i.e., a frequency peak.  An objective is prevent degradation of speech recognition performance due to reverberation or additive noise.  (¶[0002])  It would have been obvious to one having ordinary skill in the art to provide max pooling of frequency as taught by Sainath et al. in a speech-enhanced noise filter of Korjani for a purpose of preventing degradation of speech recognition performance due to additive noise.

Claims 5 to 6 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (U.S. Patent Publication 2017/0353789) in view of Korjani (U.S. Patent No. 10,614,827) as applied to claim 1 above, and further in view of Fukuda (U.S. Patent Publication 2018/0053087).
Arguably, Kim et al. and Korjani disclose and teach the limitations of these claims.  Specifically, Kim et al. discloses a sound classifier that compares speech to a speech database of phonemically and lexically transcribed speech as set forth by claim 5, and Korjani teaches a speech-enhancing noise filter as set forth by claim 6.  That is, Kim et al. discloses that a “neural network is an acoustic model” that receives a “feature of an acoustic signal is input” and “each processing unit representing either a phoneme, . . . or a word is output.”  (¶[0017] and ¶[0025])  Similarly, Korjani’s speech-enhancing noise filter is implemented by a neural network that is equivalent to “a denoising autoencoder” as the neural network is trained so that noise is eliminated from output.  (Abstract)  Kim et al. does not expressly disclose that a “posterior probability” is used to Fukuda.  
Here, Fukuda teaches training of front-end and back-end neural networks, where a front-end NN may be configured to estimate clean frequency filter bank features from noisy input features.  (Abstract)  A neural network is used based on an acoustic front-end as a denoising autoencoder.  (¶[0002])  The front-end neural network may be used for a denoising autoencoder (“a denoising autoencoder”), and is a denoising front-end.  The back-end neural network may be used for an acoustic modeling, and for identifying phonemes corresponding to the input speech, where the back-end NN may be referred to as an acoustic model backend.  (¶[0018] - ¶[0019])  The output layer of a back-end NN is composed of a plurality of units, where the output layer of the back-end NN outputs posterior probability of each context-dependent phoneme.  (¶[0045]: Figure 2)  Features are processed by a DNN-based feature mapping with a denoising effect.  Joint training is performed by training a combined NN to estimate posterior probabilities of context dependent phonemes corresponding to input noisy features (“posterior probability of each processing unit representing at least either a phoneme . . .”).  (¶[0108]: Figure 7)  An objective is to use a neural network to provide joint training of front-end and back-end neural networks to improve robustness of an acoustic model to various noise conditions.  (¶[0001] - ¶[0002])  It would have been obvious to one having ordinary skill in the art to use neural networks of Kim et al. and Korjani to determine posterior probabilities of phonemes and to provide denoising as a denoising autoencoder as taught by Fukuda for a purpose of improving robustness of an acoustic model by joint training.

Response to Arguments
Applicants’ arguments filed 26 April 2021 have been fully considered but they are not persuasive.
Applicants’ amendments overcome the objection to the title, and the amended title is being entered.
Applicants’ amendments overcome the rejection of independent claim 9 under 35 U.S.C. §101.
Applicants amend independent claims 1 and 7 to 9 to include a new limitation directed to obtain “one frequency that is identified from the plurality of frequencies”.  Applicants add new dependent claim 10.  Generally, Applicants argument is the new limitation of “obtain one frequency that is identified from the plurality of frequencies based on a plurality of weights used in arithmetic operation of the node” is not disclosed or taught by Kim et al. (U.S. Patent Publication 2017/0353789) or Korjani (U.S. Patent No. 10,614,827).  Applicants cite ¶[0062] of Kim et al. as disclosing that each node of a neural network receives ‘complex input’.  Specifically, Applicants allege that Kim et al. discloses that each node of Layer 2 of a neural network corresponds to multiple inputs because each of nodes 606, 608, and 610 receives a first input from a frequency bin of Spectrogram n and a second input from a frequency bin of Spectrogram n + L.  Applicants conclude that Kim et al. is silent on a limitation of “obtain one frequency that is identified from a plurality of frequencies based on a plurality of weights used in arithmetic operation of the node”.  This argument is not persuasive.
i.e., is not a single frequency.
Secondly, Applicants’ argument directed to the rejection of the independent claims as being unobvious under 35 U.S.C. §103 over Kim et al. (U.S. Patent Publication 2017/0353789) in view of Korjani (U.S. Patent No. 10,614,827) is not persuasive under a variety of rationales.  Mainly, it may be true that Figure 6 of Kim et al. shows a solid line from a frequency bin of Spectrogram n and a dashed line from a frequency bin of Spectrogram n + L, but it is maintained that these lines represent the same frequency bin from different time slices n and n + L.  That is, the solid and dashed lines represent the same frequency bin, k, at different frames, n.  Kim et al. describes the advantage of this at ¶[0067] - ¶[0068], where it is disclosed that it may be desirable Kim et al., so that this reference is not limited to only an embodiment using frames n and n + L illustrated in Figure 6.  Here, one skilled in the art could understand a corresponding simpler embodiment is contemplated by Kim et al., using only a single Spectrogram n, and not Spectrogram n + L, so that one does not have to consider the dashed lines of Figure 6, which is only a preferred embodiment.  Moreover, Applicants’ claim language may expressly set forth obtaining “one frequency”, but the independent claims are drafted with “comprising” language, which does not exclude additional frequency components.  Conventionally, claim language is construed so that any recitation of only a single element is not interpreted to exclude a plurality of these elements.  Even if a limitation of “one frequency” were not new matter under 35 U.S.C. §112(a), Applicants’ embodiments do not contemplate using information from past and future frames.  Applicants’ embodiments, then, do not exclude an additional feature of using information about past and future frames due to the preambular claim language of “comprising”.  Given all of this evidence, it is maintained that a limitation of obtaining “one frequency that is identified from the plurality of frequencies” is obvious under 35 U.S.C. §103 over Kim et al., at least because one frequency bin of a plurality of frequency bins is applied to each node of Layer 2.    
Thirdly, Applicants emphasize the language of ¶[0062] of Kim et al., as directed to receiving ‘complex inputs’, as if this is enough to distinguish over their claim Kim et al. at least discloses that a single frequency of a frequency bin is input to each node of Layer 2, and that each node then processes an input using weights and arithmetic operations characteristic of a neural network.
Applicants’ arguments, then, are not persuasive.  New grounds of rejection are applied under 35 U.S.C. §112(a).  Any new grounds of rejection are necessitated by amendment.  Accordingly, this rejection is properly FINAL.

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
Ramprashad is similar prior art directed to a neural network producing clean speech involving spectral peaks.  (¶0032] - ¶[0039])  
Applicants’ amendment necessitated the new grounds of rejection presented in this Office Action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP §706.07(a).  Applicants are reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached on Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 






/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        May 3, 2021