DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 13 and 20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Variani et al., (US 11,062,725 B2, herein “Variani”).
Regarding claim 1, Variani teaches a speech recognition method, comprising (Variani abstract, fig. 7, method of a speech recognition system, col. 24, ll. 63-64, fig. 7 depicting the process for speech recognition using neural networks): 
obtaining a first voice signal collected from a first microphone in a microphone array and a second voice signal collected from a second microphone in the microphone Variani col. 24, l. 63 – col. 25, l. 9, and col. 16, lines 25-35, process performed by the computing system 420, which includes two or more microphones located at different spatial positions (thus different microphones) receiving respective raw audio signals of user utterance/speech (first and second voice signals)); 
extracting enhanced features associated with the first voice signal and the second voice signal through a neural network (Variani col. 25, ll. 10-51, first and second data representing the first and second raw audio signals are filtered and processed through layers of a neural network to output spectral filtered output (extracted features), which col. 18, ll. 31-32 teaches the neural network module is for enhancement and acoustic modeling (thus enhanced output), and col. 21, ll. 6-64 detail the filtering and layer processing of the neural network, resulting in a frame-level feature vector                         
                            
                                
                                    Z
                                
                                
                                    f
                                
                                
                                    p
                                
                            
                            [
                            l
                            ]
                        
                     as given by equation 9); and 
obtaining a speech recognition result based on the enhanced features (Variani col. 25, ll. 50-53, from the spectral filtered output and further processing thereof in additional layers of the neural network, sub-word units in both the first and second raw audio signals are predicted).
Regarding claim 13, Variani teaches an electronic device, comprising (Variani fig. 4, col. 16, ll. 6-7 and 19-22, system for speech recognition performed by an individual computer system, such as the one computer shown, 420 in fig. 4): 
one or more processors (Variani figs. 4 and 8, col. 26, ll. 56-59, col. 27, ll. 4-5 and 60-62, computing device used to implement the system disclosed including a processor); and 
Variani figs. 4 and 8, col. 26, ll. 56-59, col. 27, ll. 4-8, 29-40 and 60-62, computing device used to implement the system disclosed including a storage device as a computer readable medium including a computer program product that contains instructions that when executed perform one or more methods as disclosed) is caused to implement a speech recognition method, the method comprising (Variani abstract, fig. 7, method of a speech recognition system, col. 24, ll. 63-64, fig. 7 depicting the process for speech recognition using neural networks): 
obtaining a first voice signal collected from a first microphone in a microphone array and a second voice signal collected from a second microphone in the microphone array, the second microphone being different from the first microphone (Variani col. 24, l. 63 – col. 25, l. 9, and col. 16, lines 25-35, process performed by the computing system 420, which includes two or more microphones located at different spatial positions (thus different microphones) receiving respective raw audio signals of user utterance/speech (first and second voice signals)); 
extracting enhanced features associated with the first voice signal and the second voice signal through a neural network (Variani col. 25, ll. 10-51, first and second data representing the first and second raw audio signals are filtered and processed through layers of a neural network to output spectral filtered output (extracted features), which col. 18, ll. 31-32 teaches the neural network module is for enhancement and acoustic modeling (thus enhanced output), and col. 21, ll. 6-64 detail the filtering and layer processing of the neural network, resulting in a frame-level feature vector                         
                            
                                
                                    Z
                                
                                
                                    f
                                
                                
                                    p
                                
                            
                            [
                            l
                            ]
                        
                     as given by equation 9); and 
obtaining a speech recognition result based on the enhanced features (Variani col. 25, ll. 50-53, from the spectral filtered output and further processing thereof in additional layers of the neural network, sub-word units in both the first and second raw audio signals are predicted).
Regarding claim 20, Variani teaches a computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor (Variani figs. 4 and 8, col. 26, ll. 56-59, col. 27, ll. 4-8, 29-40 and 60-62, a computer readable medium including a computer program product that contains instructions that when executed perform one or more methods as disclosed), a speech recognition method is implemented, the method comprising (Variani abstract, fig. 7, method of a speech recognition system, col. 24, ll. 63-64, fig. 7 depicting the process for speech recognition using neural networks): 
obtaining a first voice signal collected from a first microphone in a microphone array and a second voice signal collected from a second microphone in the microphone array, the second microphone being different from the first microphone (Variani col. 24, l. 63 – col. 25, l. 9, and col. 16, lines 25-35, process performed by the computing system 420, which includes two or more microphones located at different spatial positions (thus different microphones) receiving respective raw audio signals of user utterance/speech (first and second voice signals)); 
extracting enhanced features associated with the first voice signal and the second voice signal through a neural network (Variani col. 25, ll. 10-51, first and second data representing the first and second raw audio signals are filtered and processed through layers of a neural network to output spectral filtered output (extracted features), which col. 18, ll. 31-32 teaches the neural network module is for enhancement and acoustic modeling (thus enhanced output), and col. 21, ll. 6-64 detail the filtering and layer processing of the neural network, resulting in a frame-level feature vector                         
                            
                                
                                    Z
                                
                                
                                    f
                                
                                
                                    p
                                
                            
                            [
                            l
                            ]
                        
                     as given by equation 9); and 
obtaining a speech recognition result based on the enhanced features (Variani col. 25, ll. 50-53, from the spectral filtered output and further processing thereof in additional layers of the neural network, sub-word units in both the first and second raw audio signals are predicted).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 2 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Variani, as set forth above regarding claim 1 from which claim 2 depends, and as set forth above regarding claim 13 from which claim 14 depends, further in view of Ouyang et al., "A Fully Convolutional Neural Network for Complex Spectrogram Processing in Speech Enhancement," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5756-5760, doi: 10.1109/ICASSP.2019.8683423 (herein “Ouyang”).
Regarding claims 2 and 14, Variani teaches wherein extracting the enhanced features associated with the first voice signal and the second voice signal through the neural network comprises: performing complex Fourier transform on the first voice signal and the second voice signal, respectively, to obtain a transformed first voice signal and a transformed second voice signal (Variani col. 21, ll. 6-13, for each input channel (thus including the first and second voice signals) an M-point Fast Fourier Transform is performed resulting in complex frequency domain data                                 
                                    
                                        
                                            X
                                        
                                        
                                            c
                                        
                                    
                                    [
                                    l
                                    ]
                                
                             per channel (transformed first and second voice signals)); 
Variani col. 21, ll. 13-39, spectral filtering layer performs a convolution on the complex valued frequency bands (complex convolution), then then a complex linear projection (complex linear transformation) is performed which performs max pooling in the frequency domain, where col. 17, ll. 45-46 teaches the neural network is a convolutional neural network, which would be complex by way of its processing of complex values); and 
converting the complex features into enhanced features in real number (Variani col. 21, ll. 37-49, including equation 7, frame-level feature vector                         
                            
                                
                                    Z
                                
                                
                                    f
                                
                                
                                    p
                                
                            
                            [
                            l
                            ]
                        
                     is the absolute value of the output of the spectral convolution                         
                            
                                
                                    W
                                
                                
                                    f
                                
                                
                                    p
                                
                            
                            [
                            l
                            ,
                            k
                            ]
                        
                    , where the absolute value of a complex value is a real value).
Variani does not explicitly teach performing complex offset.
Ouyang teaches performing complex offset (Ouyang page 5757 sections 2.2-2.3, CNN layer including a dilation performed with complex values (see real / imaginary spectrogram input in fig. 3), the dilation realizing an offset (dilation factor) of 1, 2 and 4 as shown in fig. 1).
Therefore, taking the teachings of Variani and Ouyang together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the dilation performed in Ouyang at least because doing so would allow for keeping the size of a filter small while still having a receptive field large enough .
Claims 3-4 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Variani in view of Ouyang, as set forth above regarding claim 2 from which claim 3 depends, and as set forth above regarding claim 14 from which claim 15 depends, further in view of Shao et al., (US 2021/0020175 A1, herein “Shao”).
Regarding claims 3 and 15, Variani teaches wherein obtaining the speech recognition result comprises: determining, based on the enhanced features and output corresponding to the first voice signal and the second voice signal (Variani col. 25, ll. 50-53, from the spectral filtered output and further processing thereof in additional layers of the neural network, sub-word units in both the first and second raw audio signals are predicted), but does not explicitly teach the remainder of the limitations of claim 3. 
Shao teaches a character output corresponding to the voice signal through a streaming multi-layer truncated attention model (Shao fig. 1, paras. 25-27, features extracted from a voice signal are input to a decoder using a streaming multi-layer truncated attention model, and a recognition result comprised of text/characters is output therefrom).
Therefore, taking the teachings of Variani and Shao together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the character output through a streaming multi-layer truncated attention model 
Regarding claims 4 and 16, Variani teaches wherein obtaining the speech recognition result further comprises: compressing the enhanced features based on a predetermined size (Variani col. 21, ll. 55-62, output of spatial convolution layer (enhanced features) has a power compression applied and linearly projected down to an F (predetermined size) dimensional space). Variani does not teach providing the features compressed to the streaming multi-layer truncated attention model.
Shao teaches providing the features compressed to the streaming multi-layer truncated attention model (Shao fig. 1, paras. 25-27, features extracted from a voice signal are input to a decoder using a streaming multi-layer truncated attention model).
Therefore, taking the teachings of Variani and Shao together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the character output through a streaming multi-layer truncated attention model as disclosed in Shao at least because doing so would provide accurate voice to text decoding with streaming audio (Shao paras. 24-25).
Claims 3-4, and 15-16 are also rejected under 35 U.S.C. 103 as being unpatentable over Variani in view of Ouyang, as set forth above regarding claim 2 from which claim 3 depends, and as set forth above regarding claim 14 from which claim 15 depends, further in view of Wang et al., "Stream Attention-based Multi-array End-to-end Speech Recognition," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 7105-7109, doi: 10.1109/ICASSP.2019.8682650 (herein “Wang”).
Regarding claims 3 and 15, Variani teaches wherein obtaining the speech recognition result comprises: determining, based on the enhanced features and output corresponding to the first voice signal and the second voice signal (Variani col. 25, ll. 50-53, from the spectral filtered output and further processing thereof in additional layers of the neural network, sub-word units in both the first and second raw audio signals are predicted), but does not explicitly teach the remainder of the limitations of claim 3. 
Wang teaches a character output corresponding to the voice signal through a streaming multi-layer truncated attention model (Wang page 7106-7107 section 3, most probable letter sequence C given a speech input X is given by the multi-stream architecture shown in fig. 1, which processes streams (streaming) with multiple attention layers (shown as Attention 1, Attention 2) and using CTCs 1 and 2, where the decoding of the CTC network follows equations 3 and 4 on page 7106, including an “arg max” (maxima values/peak) of the CTC).
Therefore, taking the teachings of Variani and Wang together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the multi-stream architecture as disclosed in Wang at least because doing so would provide improvement in (i.e. lower) word error rate over conventional strategies (Wang Abstract).
Regarding claims 4 and 16, Variani teaches wherein obtaining the speech recognition result further comprises: compressing the enhanced features based on a predetermined size (Variani col. 21, ll. 55-62, output of spatial convolution layer (enhanced features) has a power compression applied and linearly projected down to an F (predetermined size) dimensional space). Variani does not teach providing the features compressed to the streaming multi-layer truncated attention model.
Wang teaches providing the features compressed to the streaming multi-layer truncated attention model (Wang page 7106-7107 section 3, most probable letter sequence C given a speech input X is given by the multi-stream architecture shown in fig. 1, which processes streams (streaming) with multiple attention layers (shown as Attention 1, Attention 2) and using CTCs 1 and 2, where the decoding of the CTC network follows equations 3 and 4 on page 7106, including an “arg max” (maxima values/peak) of the CTC).
Therefore, taking the teachings of Variani and Wang together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the multi-stream architecture as disclosed in Wang at least because doing so would provide improvement in (i.e. lower) word error rate over conventional strategies (Wang Abstract).
Claims 5 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Variani in view of Ouyang, as set forth above regarding claim 2 from which claim 5 depends, and as set forth above regarding claim 14 from which claim 17 depends, further in view of Sivasankaran et al., "Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment," Interspeech 2018 - 19th Annual Conference of the International Speech Communication Association, Sep 2018, Hyderabad, India. ⟨hal-01817519⟩ (herein “Sivasankaran”) in view of Jung et al., (US 2021/0372595 A1, herein “Jung”).
Regarding claims 5 and 17, while Variani col. 23, l. 66 – col. 24, l. 3 discloses that beam patterns of the disclosed CLP model show a magnitude response as a function of direction of arrival, thus at least suggesting relationships between the neural network processing of Variani and direction of arrival, Variani does not explicitly disclose the limitations of claims 5 and 17. 
Sivasankaran teaches further comprising determining a direction of a target sound source associated with the first voice signal and the second voice signal based on the enhanced features (Sivasankaran sections 2 and 4, and fig. 2, based on CSIPD, Speech magnitude spectrum and target identifier features, they are input to a CNN to determine a direction of arrival of a keyword spoken by a target, in consideration of audio signals received at two different microphones). 
Jung teaches turning on a reminder light associated with the direction determined (Jung fig. 8B, para. 77, detecting the direction of a voice sound and displaying a light in the direction of the sound).
Therefore, taking the teachings of Variani and Sivasankaran together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the direction of arrival of a keyword spoken by a target as disclosed in Sivasankaran at least because doing so would make a speech recognition application  
Further, taking the teachings of Variani and Jung together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the displaying a light in the direction of detected sound as disclosed in Jung at least because doing so would provide a way to signal to the user in an aesthetic sense (see Jung para. 5).
Claims 6 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Variani in view of Ouyang, as set forth above regarding claim 2 from which claim 6 depends, and as set forth above regarding claim 14 from which claim 18 depends, further in view of Sivasankaran.
Regarding claims 6 and 18, Variani teaches further comprising: initiating a character recognition process (Variani col. 25, ll. 50-59, the spectral filtered output is processed to predict (thus also initiating) sub-word (including characters) units encoded in the first and second audio signal, and to perform an action using the predicted sub-word units).
Variani does not teach determining, based on the enhanced features, whether the first voice signal and the second voice signal involve a wakeup word; and in response to determining that the first voice signal and the second voice signal involve the wakeup word.
Sivasankaran section 3.4, feature vectors from the input signals (first and second voice signals) are input to a CNN to identify a target that speaks a keyword (wakeup word)); and in response to determining that the first voice signal and the second voice signal involve the wakeup word (Sivasankaran section 3.4, fig. 2, and Introduction section 1, the target identifier is used to determine (thus in response to) the direction of arrival for the keyword speech, the DOA then can be used for speech recognition).
Therefore, taking the teachings of Variani and Sivasankaran together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the target identification as disclosed in Sivasankaran at least because doing so would make a speech recognition application more efficient as the recognition can be restricted to only processing speech produced by the identified target via knowing the direction of arrival of the target (see Sivasankaran section 1).
Claims 7 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Variani.
Regarding claims 7 and 19, Variani teaches wherein extracting the enhanced features associated with the first voice signal and the second voice signal through the neural network comprises: obtaining a third voice signal collected from a third microphone in the microphone array (Variani col. 24, l. 63 – col. 25, l. 9, and col. 16, lines 25-35, process performed by the computing system 420, which includes two or more microphones (thus also a third) located at different spatial positions receiving respective raw audio signals of user utterance/speech); and 
extracting enhanced features associated with the first voice signal, the second voice signal and the third voice signal through the neural network (Variani col. 16, ll. 27-31, col. 25, ll. 10-51, data respectively representing multi channel audio signals, the channels being more than two (thus a third voice signal) are filtered and processed through layers of a neural network to output spectral filtered output (extracted features), which col. 18, ll. 31-32 teaches the neural network module is for enhancement and acoustic modeling (thus enhanced output), and col. 21, ll. 6-64 detail the filtering and layer processing of the neural network, resulting in a frame-level feature vector                         
                            
                                
                                    Z
                                
                                
                                    f
                                
                                
                                    p
                                
                            
                            [
                            l
                            ]
                        
                     as given by equation 9).
While Variani discloses its system to be processing multiple channels, and having two or more microphones at different spatial positions, Variani does not explicitly disclose a multi-channel configuration with three channels. However, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the multi-channel processing disclosed in Variani to be three-channels and thus obtaining a third voice signal, and extracting features therefrom, at least because doing so would have been a mere duplication of parts – that is, duplicating the signal processing disclosed for channels 1 and 2 also for channel 3. see MPEP 2144.04(VI)(B).
Claims 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Variani, as set forth above regarding claim 1 from which claim 8 depends, further in view of Xue et al., (US 2019/0237065 A1, herein “Xue”) further in view of Wang.
Regarding claim 8, Variani teaches training an integrated speech enhancement and recognition model by using the multi-channel far-field voice signals (Variani col. 22, ll. 39-59, and col. 18, ll. 31-42, training data used for the multichannel enhancement and acoustic model (integrated speech enhancement and recognition model) is from a two-channel microphone array of speakers utterances, the speakers distanced from the microphones by 1 to four meters (thus far-field)), but does not teach the remainder of the limitations of claim 8.
Xue teaches further comprising: obtaining a same number of multi-channel far-field voice signals as microphones in the microphone array, the multi-channel far-field voice signals at least comprising a first far-field voice signal and a second far-field voice signal (Xue fig. 1, para. 34, far field audio data generated from audio data recorded through a microphone array, each microphone providing its own signal y1(t) through y4(t) (thus including at least first and second far-field voice signals)).
Variani does not teach that its speech recognition neural network is “end-to-end.”
Wang teaches end-to-end (Wang page 7106, section 3, end-to-end speech recognition system with microphone arrays).
Therefore, taking the teachings of Variani and Xue together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the training data being from a microphone array with the same number of far-field voice signals as microphones in the array as disclosed in Xue at least because doing so would achieve a closer simulation of the expected audio data to be processed 
Further, taking the teachings of Variani and Wang together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the end-to-end architecture as disclosed in Wang at least because doing so would provide improvement in (i.e. lower) word error rate over conventional strategies (Wang Abstract).
Regarding claim 9, Variani does not explicitly teach the limitations of claim 9. Xue teaches wherein obtaining the same number of multi-channel far-field voice signals as the microphones in the microphone array comprise: simulating, based on near-field voice signals, the multi-channel far-field voice signals in real time through a random noise addition (Xue paras. 21, 33-35, far field audio data is simulated to include isotropic noise which can include wind or the sound of vehicles on the road (both random type noises)).
Therefore, taking the teachings of Variani and Xue together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the simulation of far-field voice signals as disclosed in Xue at least because doing so would achieve a closer simulation of the expected audio data to be processed in the trained model in the actual setting, thus reducing overall cost of capturing far field audio data and improving the accuracy of far field audio models trained (Xue para. 36).
Claims 10-12 are rejected under 35 U.S.C. 103 as being unpatentable over Variani in view of Xue in view of Wang, as set forth above regarding claim 9 from which claim 10 depends, further in view of Ko et al., "A study on data augmentation of reverberant speech for robust speech recognition," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220-5224, doi: 10.1109/ICASSP.2017.7953152 (herein “Ko”).
Regarding claim 10, Variani does not teach the limitations of claim 10. Ko teaches wherein simulating the multi-channel far-field voice signals in real time through the random noise addition comprises: randomly setting simulation parameters (Ko page 5221, section 2, a number of room impulse responses (RIR) are randomly generated for integration into simulation of far-field speech as a type of distortion/noise): configurations of a room (Ko page 5221, section 2, room parameters are sampled to generate the RIRs), a position of the microphone array in the room, a position of a target sound source in the room, and a position of a noise source in the room (Ko page 5221, section 2, RIRs sampled from speaker (target sound source) and receiver (microphone) position, and where the point source noises are labelled and applied to the model by classification of foreground or background indicating a position in the room), the configurations of the room comprising a length, a width, and a height of the room, and a wall reflection coefficient (Ko page 5221, section 2, room parameters are sampled to generate the RIRs, the parameters including width, length and height, and absorption coefficients which pertain to the wall (see footnote 1), where absorption is the inverse of reflection and as such does indicate a reflection (wall reflection coefficient)).

Regarding claim 11, Variani does not teach the limitations of claim 11. Ko teaches wherein simulating the multi-channel far-field voice signals in real time through the random noise addition comprises: generating, based on the simulation parameters, a first group of impulse responses for the near-field voice signals (Ko page 5221, algorithm 1, for each recording of speech x(t) (near-field voice signals) an impulse response is sampled against the probability distribution of RIRs given the room (based on the room simulation parameters)) and a second group of impulse responses for noise signals randomly selected (Ko page 5221, algorithm 1, for a point noise source (noise signals), an impulse response is sampled against the probability distribution of RIRs given the room, where the point source noise is sampled according to a probability distribution of different point-source noises (noise signals randomly selected)).
Therefore, taking the teachings of Variani and Ko together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the RIR simulations and algorithm 1 processing as disclosed in Ko at least because doing so would provide a way to robustly train an acoustic model without costly training data acquisition  (Ko Abstract and Introduction).
Regarding claim 12, Variani does not teach the limitations of claim 11. Ko teaches wherein simulating the multi-channel far-field voice signals in real time through the random noise addition comprises: generating the multi-channel far-field voice signals based on the near-field voice signals, the first group of impulse responses, the noise signals, the second group of impulse responses, and a signal-to-noise ratio (Ko page 5221, section 2, algorithm 1 detailing the simulation of far-field speech to include the sampling of point-source noise and randomly selecting an offset thereto, the impulse response samplings of the input speech database samples (first group of impulse responses), an isotropic noise is sampled along with the point-source noises (the noise signals), and including the an impulse response is sampled against the probability distribution of RIRs given the room for the point noise source (second group of impulse responses) as well as sampling an SNR (signal to noise ratio) from a probability distribution of SNRs).
Therefore, taking the teachings of Variani and Ko together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network processing disclosed in Variani to include the RIR simulations and algorithm 1 processing as disclosed in Ko at least because doing so would provide a way to robustly train an acoustic model without costly training data acquisition  (Ko Abstract and Introduction).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Allen et al., "Image method for efficiently simulating small‐room acoustics," The Journal of the Acoustical Society of America 65, 943 (1979). Allen is directed towards modeling the effects of reverberation in a room, and details the parameters and various mathematical relationships between these parameters to define a room reverberation model.
Choi et al., "Phase-aware Speech Enhancement With Deep Complex U-Net," March 7, 2019,  arXiv:1903.03107v1 [cs.SD]. Choi is directed towards a complex convolutional neural network that processes speech for speech enhancement and includes a complex convolution operation.
Li et al., US 2021/0383795 A1, directed towards a voice recognition apparatus and method that acquires voice data and can process the acquired voice data using a far-field voice recognition model.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Friday, 09:30-18:30 EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 

MICHELLE M. KOETH
Primary Examiner
Art Unit 2656



/MICHELLE M KOETH/Primary Examiner, Art Unit 2656