DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
In response to the Office Action mailed August 20, 2020, applicant submitted an amendment filed on December 15, 2020, in which the applicant amended and requested reconsideration.

Response to Arguments
Applicants argue that the prior art cited fails to teach the claims as amended.  Applicants’ arguments are persuasive, but are moot in view of new grounds of rejection.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claims 1-3, 5-6 and 14 is/are rejected under 35 U.S.C. 103 as being anticipated over Hwang et al. (PGPUB 2016/0086602), hereinafter referenced as Hwang in view of Bisani et al. (PGPUB 2015/0095026), hereinafter referenced as Bisani.
claim 1, Hwang discloses a speech recognition system, comprising: 
a plurality of microphones to receive acoustic signals including speech signals (plurality of microphones; paragraphs 0053, 0123); 
an input interface to generate multichannel inputs from the acoustic signals (multiple channel; paragraphs 0063-0070); 
one or more storages to store neural networks (paragraphs 0061-0068) including a multichannel speech recognition network (voice recognition; paragraph 0156), wherein the multichannel speech recognition network comprises: 
an encoder-decoder network trained to transform the enhanced speech data into a text (STT; paragraph 0140); 
one or more processors, using the multichannel speech recognition network in association with the one or more storages, to generate the text from the multichannel inputs (STT; paragraph 0140); and 
an output interface to render the text (output; abstract with paragraphs 0072-0077), but does not specifically teach wherein the beamformer network.
Bisani discloses a system comprising:
a beamformer network trained to generate an enhanced speech, wherein the beamformer network is optimized for best accuracy of the text output of the encoder-decoder network (p. 0017-0018, 0032, 0060-0069), to provide multi-channel enhancements.
Therefore, it would have been obvious to one of ordinary skill of the art to modify the system as described above, to improve performance.  
claim 2, Hwang discloses a system wherein the mask estimation networks include a first mask network and a second mask network, wherein the first mask network is trained to generate speech masks for the multichannel inputs (multiple channel; paragraphs 0063-0070) and the second mask network is trained to generate noise masks for the multichannel inputs (mask noise; paragraph 0132). 
Regarding claim 3, Hwang discloses a system wherein the first and second mask networks are integrated with the beam former network (beamforming; paragraphs 0062-0070, 0089). 
Regarding claim 5, Hwang discloses a system wherein the beamformer network uses frequency-domain datasets (paragraphs 0024-0026, 0056-0057, 0062-0070, 0129, 0145-0146). 
Regarding claim 6, Hwang discloses a system wherein the multichannel speech recognition network includes a first feature extractor to extract signal features from the multichannel inputs based on short-term Fourier-transformation algorithm (paragraphs 0057, 0128 and 0145). 
Regarding claim 14, Hwang discloses a system wherein the neural network is trained in end-to-end fashion to reduce an error between a recognition of the noisy multi-channel speech signal (mask noise; paragraph 0132) and a ground truth text (STT; paragraph 0140) corresponding the noisy multi-channel speech signal (multiple channel; paragraphs 0063-0070). 

Claim 1 is alternately rejected under 35 U.S.C. 103 as being anticipated over Hwang et al. (PGPUB 2016/0086602), hereinafter referenced as Hwang in view of Ayrapetian et al. (PGPUB 2017/0178662), hereinafter referenced as Ayrapetian.

Regarding claim 1, Hwang discloses a speech recognition system, comprising: 
a plurality of microphones to receive acoustic signals including speech signals (plurality of microphones; paragraphs 0053, 0123); 
an input interface to generate multichannel inputs from the acoustic signals (multiple channel; paragraphs 0063-0070); 
one or more storages to store neural networks (paragraphs 0061-0068) including a multichannel speech recognition network (voice recognition; paragraph 0156), wherein the multichannel speech recognition network comprises: 
an encoder-decoder network trained to transform the enhanced speech data into a text (STT; paragraph 0140); 
one or more processors, using the multichannel speech recognition network in association with the one or more storages, to generate the text from the multichannel inputs (STT; paragraph 0140); and 
an output interface to render the text (output; abstract with paragraphs 0072-0077), but does not specifically teach wherein the beamformer network.
Ayrapetian discloses a system comprising:
a beamformer network trained to generate an enhanced speech, wherein the beamformer network is optimized for best accuracy of the text output of the encoder-
Therefore, it would have been obvious to one of ordinary skill of the art to modify the system as described above, to improve performance.  


Claims 4 and 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hwang in view of Bisani and in further view of Chorowski et al. (End-to-end continuous speech recognition using attention-based RNN: First results), hereinafter referenced as Chorowski.

Regarding claim 4, Hwang and Bisani disclose a system as described above, but does not specifically teach wherein the encoder-decoder network is an attention-based encoder-decoder network. 
Chorowski discloses a system wherein the encoder-decoder network is an attention-based encoder-decoder network (abstract and introduction, sections 1.2, 2 and 3.2), to provide good recognition performance.
Therefore, it would have been obvious to one of ordinary skill of the art to modify the system as described above, to ease training.
Regarding claim 10, it is interpreted and rejected for similar reasons as set forth above.  In addition, Chorowski discloses a system wherein the mask estimation networks are bi-directional long-short term memory recurrent neural networks (abstract and introduction, sections 1.2, 2 and 3.2). 

Claim 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hwang in view of Bisani and in further view of Viriani et al. (USPN 10/224,058), hereinafter referenced as Viriani.

Regarding claim 7, Hwang and Bisani disclose a system as described above, but does not specifically teach a system wherein the first feature extractor used log Mel filterbank coefficients for the signal features. 
Viriani discloses a system wherein the first feature extractor used log Mel filterbank coefficients for the signal features (column 21, lines 30-67), to improve accuracy.
Therefore, it would have been obvious to one of ordinary skill of the art to modify the system as described above, to improve the learning process and performance.

Claims 9 and 11-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hwang in view of Bisani and in further view of Kim et al. (PGPUB 2014/0337021), hereinafter referenced as Kim.

Regarding claim 9, Hwang and Bisani disclose a system as described above, but does not specifically teach a system wherein the beamformer network uses speech power spectral density (PSD) matrices. 
Kim discloses a system wherein the beamformer network uses speech power spectral density (PSD) matrices (paragraph 0030), to increase speech intelligibility.

Regarding claim 11, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kim discloses a system wherein the multichannel speech recognition network further comprises a first feature extractor connected to the mask estimation networks, wherein the first feature extractor is a differentiable function (paragraphs 0030-0034, 0041-0052, 0122). 
Regarding claim 12, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kim discloses a system wherein the differentiable function is a bark function of a magnitude of the channel signal (paragraphs 0037, 0088). 
Regarding claim 13, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kim discloses a system wherein the input interface is an array of microphones, and wherein the output interface includes a display device (paragraph 0127). 

Claim 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hwang in view of Bisani and in further view of Saric et al. (USPN 9,215,527), hereinafter referenced as Saric.

Regarding claim 16, it is interpreted and rejected for similar reasons as set forth above, but does not specifically teach wherein the multichannel speech recognition network further comprises mask estimation networks configured to generate time-frequency masks from the multichannel inputs.  

Therefore, it would have been obvious to one of ordinary skill of the art to modify the system as described above, to improve speech separation.    

Claims 17-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hwang in view of Bisani and in further view of Sundaram (USPN 9,972,339).

Regarding claim 17, it is interpreted and rejected for similar reasons as set forth above, but does not specifically teach wherein the beamformer network automatically selects a reference channel (or microphone) vector that is estimated using an attention mechanism of a neural network. 
Sundaram discloses a system wherein the beamformer network automatically selects a reference channel (or microphone) vector that is estimated using an attention mechanism of a neural network (column 13, line 50 – column 14, line 3 with column 18, line 41 – column 19, line 12), to select desired audio.
Therefore, it would have been obvious to one of ordinary skill of the art to modify the system as described above, to provide a neural network that concentrating on particularly data. 
Regarding claim 18, it is interpreted and rejected for similar reasons as set forth above.  In addition, Sundaram discloses a system wherein the beamformer network and .

Allowable Subject Matter
Claim 15 is allowed.
	The following is a statement of reasons for allowance:
As for independent claim 15 it recites a medium for multichannel end to end speech recognition.  Prior art of record discloses a similar medium, but fails to teach the claims in combination with receiving multi-channel speech signals from an input interface; performing the speech recognition using a multichannel speech recognition neural network including a beamformer network trained to determine first microphone data sets the multi-channel signal into a single-channel signal, wherein the beamformer network and the encoder-decoder network are jointly optimized, and a recognition sub-network trained to recognize text from speech features of the single-channel signal, wherein the enhancement sub- network and the recognition sub-network are jointly trained; and providing the recognized texts to an output interface.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  This information has been detailed in the PTO 892 attached (Notice of References Cited).

THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAKIEDA R JACKSON whose telephone number is (571)272-7619.  The examiner can normally be reached on Mon - Fri 6:30a-2:30p.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on 571.272.5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.




/JAKIEDA R JACKSON/Primary Examiner, Art Unit 2657