DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Request for Continued Examination
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/14/2020 has been entered. 

Response to Amendments and Arguments
Regarding an obviousness type double patenting rejection, applicant stated (Remarks, page 7) that applicant will file a terminal disclaimer if the only remaining issue is the double patenting rejection. 

During an interview conducted on 12/30/2020, applicant’s representative (Grant Griffith, Reg. 72,777) agreed to further distinguish with the cited reference to Sainath by adding a limitation related to using a stride parameter being greater than one in the spatial filtering convolutional layer. The examiner believes that the scope of amended 
 
Regarding the rejection under 35 U.S.C. §103, applicant amended independent claims 2, 20 and 21 by adding several new limitations. Applicant argued (Remarks, pages 7-9) that the amendment overcame the rejection under §103. 

After reviewing the cited references, the examiner contacted applicant’s representative (Grant Griffith, Reg. 72,777). The examiner explained that the cited primary reference (Sainath, “Factored spatial and spectral multichannel raw waveform CLDNNs”, IEEE, 2016) was co-authored by a number of inventors of the instant application. The neural network based speech recognition system shown in Fig. 1 is exactly the same as Fig. 1 of the instant application. 

The examiner suggested including a feature related to an improvement to inventor’s previous published work.  Mr. Griffith sent a proposed amendment based on the suggestion. The examiner believed that the proposed amendment would be sufficient to distinguish with the cited references. Mr. Griffith authorized the examiner to enter the proposed amendment. The rejection under 35 U.S.C. §103 has been withdrawn.  

Examiner’s Amendment
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.

Authorization for this examiner’s amendment was given in a telephone interview with Mr. Grant Griffith (Reg. 72,777)  on 12/30/2020 

Please replace all prior versions, and listing of claims in the application with the listing of claims below:
 
1.	(Canceled)

2.	(Examiner Amendment) A method comprising:
receiving, at data processing hardware, a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time;
obtaining, by the data processing hardware, a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal;
generating, by the data processing hardware, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one;
 converting, by the data processing hardware, the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data; and


3.	(Previously Presented) The method of claim 2, wherein converting the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data comprises computing a discrete Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions.

4.	(Previously Presented) The method of claim 3, wherein computing the discrete Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions comprises computing a fast Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions.
5.	(Previously Presented) The method of claim 2, wherein the neural network is part of a speech recognition model.

6.	(Previously Presented) The method of claim 2, wherein the neural network is part of an acoustic model configured to indicate probabilities of sub-word units.

7.	(Previously Presented) The method of claim 2, wherein the one or more additional neural network layers comprise one or more deep neural network layers that provide output to one or more long short-term memory layers.

8.	(Previously Presented) The method of claim 2, wherein the corresponding spatial filtered output generated for each of the multiple spatial directions comprises a single channel of time-domain data.

9.	(Previously Presented) The method of claim 2, wherein at least one additional neural network layer of the one or more additional neural network layers is configured to perform feature extraction.

10.	(Previously Presented) The method of claim 9, wherein the at least one additional neural network layer of the one or more additional neural network layers that is configured to perform feature extraction is also configured to apply a transformation to the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions.

11.	(Previously Presented) The method of claim 10, wherein the transformation is a linear transformation.

12.	(Previously Presented) The method of claim 10, wherein the transformation is a projection.

13.	(Previously Presented) The method of claim 10, wherein the transformation is a complex linear projection.

14.	(Previously Presented) The method of claim 10, wherein the transformation is a linear projection of energy.

15.	(Previously Presented) The method of claim 2, wherein the neural network comprises:
the spatial filtering convolutional layer;
at least one feature extraction neural network layer configured to determine frequency-based characteristics of the corresponding frequency-domain data converted from the spatially filtered output generated for each of the multiple spatial directions; and
one or more neural network layers configured to receive output of the at least one feature extraction neural network layer and determine speech content using one or more recurrent neural network layers and one or more deep neural network layers.

16.	(Previously Presented) The method of claim 2, further comprising: 
detecting, by the data processing hardware, the first audio signal and the second audio signal using multiple microphones of a computing device, 


17.	(Previously Presented) The method of claim 2, further comprising: 
detecting, by the data processing hardware, the first audio signal and the second audio signal using multiple microphones of a computing device;
	wherein the neural network is stored or implemented on the computing device.
	
18.	(Previously Presented) The method of claim 2, wherein processing the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions comprises identifying a voice command indicated by the first audio signal and the second audio signal.

19.	(Previously Presented) The method of claim 2, wherein the spatial filtering convolutional layer and the one or more additional layers have been jointly trained during training of the neural network.

20.	(Examiner Amendment) A system comprising:
one or more computing devices; and
one or more computer-readable media storing instructions that, when executed by the one or more computing devices, cause the one or more computing devices to perform operations comprising:
receiving a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time;
obtaining a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal;
generating, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one;

processing, using one or more additional neural network layers of the neural network, the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions to predict speech content encoded in the first audio signal and the second audio signal.

21.	(Examiner Amendment) One or more non-transitory computer-readable media storing instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations comprising:
receiving a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time;
obtaining a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal;
generating, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one;
 converting the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data; and
processing, using one or more additional neural network layers of the neural network, the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions to predict speech content encoded in the first audio signal and the second audio signal.

Allowable Subject Matter
Claims 2-21 are allowed. 

The following is an examiner’s statement of reasons for allowance:


  The claimed invention defined by independent claims 2, 20 and 21 is based on description in the specification ([0096], [0120]) and drawing (Fig. 6 is an improvement to a system shown in Fig. 1). The claimed system uses time-domain convolution for spatial filtering and frequency domain convolution for spectral filtering. The claimed invention also includes a feature of using a stride parameter greater than 1 in the spatial filtering layer. This feature is described in the specification ([0096], [0110], [0120], [0124-0125], using a stride size > 1 to reduce computation cost).

Although Sainath reference discloses the neural network based speech recognition system, Sainath uses a stride size being equal to one (page 5076, section 2.2) and does not discloses a feature of using a stride size being greater than one. Therefore, Sainath fails to disclose a limitation recited in each of independent claims:

“the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one”

as a whole, prior art of record, either alone or in combination, does not teach or suggest above underlined limitation.  Therefore, prior art of record fails to anticipate or render obvious the claimed invention.

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jialong He, whose telephone number is (571) 270-5359.  The examiner can normally be reached on Monday – Friday, 8:00AM – 4:30PM, EST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on (571) 272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.




/JIALONG HE/Primary Examiner, Art Unit 2659