DETAILED ACTION
Introduction
1.	This office action is in response to Applicant’s submission filed on 3/23/2021.   Claims 1-48 are pending in the application and have been examined.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
3.	The drawings filed on 3/23/2021 have been accepted and considered by the Examiner.

Information Disclosure Statement
4.	The information disclosure statements (IDSs) submitted on March 23, 2021, August 25, 2021, and November 19, 2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Claim Rejections - 35 USC § 102
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


6.	Claims 1-5, 8, 9, 12-22, 24-28, 31, 32, 35-40, 42-45, 47, and 48 are rejected under 35 U.S.C. 102 as anticipated by US Pat. No. 10,224,058 (Variani et al., hereinafter “Var”) (cited in IDS dated 11/19/2021).
With regard to Claim 1, Var describes:
“A method for recognizing a voice, the method comprising:
inputting a target voice into a pre-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, (Column 2, lines 1-9 describes that sub-word units (text) is determined based on raw audio data)
the recognition network comprising a plurality of preset types of processing layers, (Column 2, lines 1-9 describes that a neural network including multiple layers is used) and
at least one type of processing layer of the recognition network being obtained by training based on a voice sample in a preset direction interval; (Column 6, lines 16-20 describe that the training includes training a filter for a direction in space) and
determining a voice recognition result of the target voice, based on the initial text. (Column 25, lines 51-56 describes that a device may issue a command based on the sub-word units (text) determined.)
With regard to Claim 2, Var describes “in response to the voice recognition model being a first voice recognition model, the at least one recognition network is one recognition network.”  Column 2, lines 1-9 describes that a neural network is used as a recognition network.
3. The method according to claim 2, wherein the first voice recognition model further comprises a Fourier transform network; (Column 2, lines 58-65 describes that the model includes a Fourier transform network)
the inputting a target voice into a pre-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, (Column 2, lines 1-9 describes that sub-word units (text) is determined based on raw audio data) comprises:
inputting the target voice into a pre-trained first voice recognition model, (Column 2, lines 1-9 describes that sub-word units (text) is determined based on raw audio data) and 
performing Fourier transform on the target voice using the Fourier transform network to obtain a transformed voice; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)
predicting a text corresponding to the transformed voice using the one recognition network to obtain the initial text; (Column 2, lines 1-9 describes that sub-word units (text) is determined based on raw audio data) and
the determining a voice recognition result of the target voice, based on the initial text, comprises:
determining the initial text as the voice recognition result of the target voice. (Column 25, lines 51-56 describes that a device may issue a command based on the sub-word units (text) determined.)
With regard to Claim 4, Var describes “wherein in response to the voice recognition model being a second voice recognition model, the at least one recognition network is a plurality of recognition networks; and the plurality of recognition networks respectively correspond to a plurality of preset direction intervals.” Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.
With regard to Claim 5, Var describes “wherein the second voice recognition model further comprises a Fourier transform network; (Column 2, lines 58-65 describes that the model includes a Fourier transform network)
the inputting a target voice into a pre-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, (Column 2, lines 1-9 describes that sub-word units (text) is determined based on raw audio data) comprises:
inputting the target voice into a pre-trained second voice recognition model, and performing Fourier transform on the target voice using the Fourier transform network to obtain a transformed voice; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice) and
inputting the transformed voice into each recognition network of the plurality of recognition networks to obtain the initial text output by each recognition network.  (Claim 1 descries that the sub word units are determined for each of a plurality of directions)
With regard to Claim 8, Var describes “wherein in response to the voice recognition model being a third voice recognition model, the recognition network comprises an omnidirectional network and a plurality of directional networks, (Column 6, lines 9-13 describes that the neural network combines multiple inputs into a single channel (cited as “an omnidirectional network”).  Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.)
any one of the directional networks and the omnidirectional network comprise a plurality of preset types of processing layers; (Column 2, lines 1-9 describes that a neural network including multiple layers is used) and
the plurality of directional networks respectively correspond to a plurality of preset direction intervals. (Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.)
With regard to Claim 9, Var describes “wherein the third voice recognition model further comprises a Fourier transform network; (Column 2, lines 58-65 describes that the model includes a Fourier transform network)
the inputting a target voice into a pre-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, (Column 2, lines 1-9 describes that sub-word units (text) is determined based on raw audio data) comprises:
inputting the target voice into a pre-trained third voice recognition model, and performing Fourier transform on the target voice using the Fourier transform network to obtain a transformed voice; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)
inputting the transformed voice into the omnidirectional network to obtain a voice feature output by the omnidirectional network; (Column 6, lines 9-13 describes that the neural network combines multiple inputs into a single channel (cited as “an omnidirectional network”).  Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.) and
inputting the voice feature into each directional network of the plurality of directional networks to obtain the initial text output by the each directional network.  (Column 6, lines 39-65 describe that multiple input channels C are merged into a single input, which is then input into P directional filters.)
With regard to Claim 12, Var describes:
“A method for training a voice recognition model, the method comprising:
acquiring a training sample, a voice sample for training in the training sample comprising a voice sample in a preset direction interval; (Column 11, lines 53 and 54 describe that the training process receives an audio sample as input)
inputting the voice sample for training into a to-be-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, (Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text))
the recognition network comprising a plurality of preset types of processing layers; (Column 2, lines 1-9 describes that a neural network including multiple layers is used)  and
training the voice recognition model to obtain a trained voice recognition model, based on the initial text. (Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text))
With regard to Claim 13, Var describes “in response to the voice recognition model being a first voice recognition model, the at least one recognition network is one recognition network.”  Column 2, lines 1-9 describes that a neural network is used as a recognition network.
With regard to Claim 14, Var describes “wherein the first voice recognition model further comprises a Fourier transform network;
the acquiring a training sample, comprises:
acquiring a first training sample, wherein the first training sample comprises a first voice sample in one preset direction interval; (Column 11, lines 53 and 54 describe that the training process receives an audio sample as input) and
the inputting the voice sample for training into the to-be trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, (Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text)) comprises:
inputting the first voice sample into the first voice recognition model, and performing Fourier transform on the first voice sample using the Fourier transform network to obtain a transformed voice; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)  and
inputting the transformed sample into the one recognition network to obtain an initial text predicting a text corresponding to the first voice sample.” (Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text))
With regard to Claim 15, Var describes “wherein the training the voice recognition model to obtain a trained voice recognition model, based on the initial text, comprises: 
determining a loss value of the voice recognition result using the initial text as a voice recognition result, (Column 10, lines 15-20 describes that the cross entropy may be used in training (“cross entropy” is cited as “a loss value”). Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text)) and
performing back propagation in the first voice recognition model using the loss value to update a parameter in the first voice recognition model to obtain a trained first voice recognition model. (Column 9, lines 17-20 and Equation 2 show that the back propagation in the training is done with the cross entropy loss (the p( s I vt) term in Equation 2))
With regard to Claim 16, Var describes “wherein in response to the voice recognition model being a second voice recognition model, the at least one recognition network is a plurality of recognition networks; and the plurality of recognition networks respectively correspond to a plurality of preset direction intervals. (Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.)
With regard to Claim 17, Var describes “the second voice recognition model further comprises a Fourier transform network; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)
the acquiring a training sample, comprises:
acquiring a plurality of second training samples, wherein each of the second training samples comprises a second voice sample in one of the plurality of direction intervals, and in a plurality of second voice samples comprised in the plurality of second training samples, a second voice sample in each of the direction intervals is comprised; (Column 9, lines 20-26 describes that first and second training samples can be obtained.  The second training samples will all be in some particular direction.)
and the inputting the voice sample for training into the to-be trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, comprises:
inputting the second voice sample into the Fourier transform network to obtain a transformed sample; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)
for each recognition network of the plurality of recognition networks, inputting the transformed sample into the each recognition network, to obtain an initial text predicting the second voice sample and being output by the each recognition network, in response to the second voice sample being in a direction interval corresponding to the each recognition network. (Column 9, lines 26-32 describes that the samples are processed by the network. Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text))
With regard to Claim 18, Var describes “wherein the training the voice recognition model to obtain a trained voice recognition model, based on the initial text, comprises: 
determining, for the initial text corresponding to each recognition network, a loss value of the initial text, (Column 10, lines 15-20 describes that the cross entropy may be used in training (“cross entropy” is cited as “a loss value”). Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text)) and
performing back propagation in the recognition network using the loss value to update a parameter in the recognition network; (Column 9, lines 17-20 and Equation 2 show that the back propagation in the training is done with the cross entropy loss (the p( s I vt) term in Equation 2)) and
using a second voice recognition model comprising the plurality of recognition networks with updated parameters as a trained second voice recognition model. (Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.)
With regard to Claim 19, Var describes “in response to the voice recognition model being a third voice recognition model, the recognition network comprises an omnidirectional network and a plurality of directional networks, (Column 6, lines 9-13 describes that the neural network combines multiple inputs into a single channel (cited as “an omnidirectional network”).  Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.)
any one of the directional networks and the omnidirectional network comprise a plurality of preset types of processing layers; (Column 2, lines 1-9 describes that a neural network including multiple layers is used) and
the plurality of directional networks respectively correspond to a plurality of preset direction intervals.” (Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.)
20. The method according to claim 19, wherein a network structure of the third voice recognition model for training comprises a voice directional layer; (Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.)
the acquiring a training sample, comprises:
acquiring a training sample of a third voice sample included in the plurality of direction intervals; (Column 9, lines 20-26 describes that first and second training samples can be obtained.  The second training samples will all be in some particular direction.) and
the inputting the voice sample for training into the to-be-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, comprises:
inputting the third voice sample into the Fourier transform network to obtain a transformed third sample, (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)
wherein the third voice sample comprises a sub-voice in at least one direction interval; (Column 9, lines 20-26 describes that noise may be included in some samples (cited as “a sub-voice”))
inputting the transformed third sample into the omnidirectional network to obtain a voice feature output by the omnidirectional network; (Column 9, lines 38-50 describes that features are determined by the network)
determining, using the voice directional layer, in the voice feature, a sub-voice feature corresponding to a sub-voice in any one of the plurality of direction intervals, and using a directional network corresponding to the any one of the directional intervals as a to-be-input directional network of the sub-voice feature; (Column 9, lines 38-50 describes that features are determined by the network. Column 6, lines 16-20 describe that the training includes training a filter for each of a plurality of directions in space.) and
inputting the sub-voice feature into the to-be-input directional network, to obtain an initial text predicting the third voice sample and being output by the to-be-input directional network. (Column 9, lines 26-32 describes that the samples are processed by the network. Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text))
With regard to Claim 21, Var describes “wherein the training the voice recognition model to obtain a trained voice recognition model, based on the initial text, comprises: 
determining, for the initial text corresponding to each directional network, a loss value of the initial text, (Column 10, lines 15-20 describes that the cross entropy may be used in training (“cross entropy” is cited as “a loss value”). Column 13, lines 17-27 describes that the training process can include predicting sub-word units (text)) and
performing back propagation in the third voice recognition model based on the loss value to update a parameter in the third voice recognition model.” (Column 9, lines 17-20 and Equation 2 show that the back propagation in the training is done with the cross entropy loss (the p( s I vt) term in Equation 2))
With regard to Claim 22, Var describes “the third voice recognition model further comprises a direction interval determination module; (Column 18, lines 27-30 describes that a target direction is determined by the deep neural network.)
the performing back propagation in the third voice recognition model based on the loss value to update a parameter in the third voice recognition model, comprises:
for each directional network, performing back propagation in the each directional network, by using the loss value obtained using the each directional network, to obtain a back propagation result; (Column 9, lines 17-20 and Equation 2 show that the back propagation in the training is done with the cross entropy loss (the p( s I vt) term in Equation 2))
combining back propagation results corresponding to the plurality of directional networks using the direction interval determination module, to obtain a propagation result set; (Column 9, lines 17-20 and Equation 2 show that the back propagation in the training is done with the cross entropy loss (the p( s I vt) term in Equation 2) and the features.  The cross entropy loss and the features together are cited as “a propagation result set.” ) and 
performing back propagation in the omnidirectional network to update a parameter of the omnidirectional network and parameters of the plurality of directional networks, by using the combined propagation result set. (Column 9, lines 17-20 and Equation 2 show that the back propagation in the training is done with the cross entropy loss (the p( s I vt) term in Equation 2) and the features.  The cross entropy loss and the features together are cited as “a propagation result set.” )
With respect to Claims 24-28, 31, and 32, apparatus Claim 24 and method Claim 1 are related as an apparatus programmed to perform the same method, with each claimed apparatus function corresponding to each claimed method step. Further, column 28, lines 15-17 of Var describes processor 852 that executes a program stored in memory 864.  .  Accordingly, Claims 24-28, 31, and  32 are similarly rejected under the same rationale as applied above with respect to Claims 1-5, 8, and 9.
With respect to Claims 35-45, apparatus Claim 35 and method Claim 12 are related as an apparatus programmed to perform the same method, with each claimed apparatus function corresponding to each claimed method step. Further, column 28, lines 15-17 of Var describes processor 852 that executes a program stored in memory 864.  Accordingly, Claims 35-45 are similarly rejected under the same rationale as applied above with respect to Claims 12-22.
With regard to Claim 47, Var describes “A non-transitory computer readable storage medium, storing a computer program thereon, the program, when executed by a processor, implements the method according to claim 1.”  Column 28, lines 15-17 describes processor 852 that executes a program stored in memory 864.  
With regard to Claim 48, Var describes “A non-transitory computer readable storage medium, storing a computer program thereon, the program, when executed by a processor, implements the method according to claim 12. ”  Column 28, lines 15-17 describes processor 852 that executes a program stored in memory 864.  

Claim Rejections - 35 USC § 103
7.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


8.	Claims 6, 7, 10, 11, 23, 29, 30, 33, 34, and 46 are rejected under 35 U.S.C. 103 as being unpatentable over Var in view of US Pat. App. Pub. No. 20190095430 (Smus et al., hereinafter “Smus”) and US Pat. No. 10,930,299 (Lu et al., hereinafter “Lu”).
With regard to Claim 6, Var does not explicitly describe the subject matter of this claim.
However, Smus describes: 
“after inputting a target voice into a pre-trained voice recognition model, the method further comprises:
obtaining a confidence of each initial text output by each of the recognition networks.”
Paragraph 26 of Smus describes that the device determines a probability that the text predicted is correct, which is cited as “a confidence.”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the probabilities as described by Smus into the system of Var to compensate for possible recognition errors, as described at paragraph 26 of Smus.
Var in view of Smus does not explicitly describe:
“wherein the second voice recognition model further comprises a direction interval determination module;
the determining a voice recognition result of the target voice, based on the initial text, comprises:
determining a probability that the target voice has a sub-voice in each of the direction intervals corresponding to the plurality of recognition networks respectively, using the direction interval determination module;
weighting, for each initial text, the confidence of the each initial text output by the plurality of recognition networks, by using the probability corresponding to each recognition network as a weight of the confidence of the each initial text output by the each recognition network; and
using an initial text corresponding to a largest weighting result as the voice recognition result.”
However, Lu describes:
“wherein the second voice recognition model further comprises a direction interval determination module; (Source direction determination unit 803, Figure 8)
the determining a voice recognition result of the target voice, based on the initial text, comprises:
determining a probability that the target voice has a sub-voice in each of the direction intervals corresponding to the plurality of recognition networks respectively, using the direction interval determination module; (Column 7, lines 1-15 describes that the weights are determined based on how close a sample is to each direction.  The weight is “a probability.”)
weighting, for each initial text, the confidence of the each initial text output by the plurality of recognition networks, by using the probability corresponding to each recognition network as a weight of the confidence of the each initial text output by the each recognition network; (Column 7, lines 1-15 describes that the weights are applied to the output of each directional channel.) and
using an initial text corresponding to a largest weighting result as the voice recognition result.” (Column 7, lines 16-27 describes that the weights are iterated to determine a final directional result.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the direction determination algorithm as described by Lu into the system of Var in view of Smus to determine a direction of a source voice, as described at column 7, lines 25-27 of Lu.
With regard to Claim 7,  Var describes “the second voice recognition model further comprises a Fourier transform network; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)  
the inputting a target voice into a pre-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, (Column 2, lines 1-9 describes that sub-word units (text) is determined based on raw audio data) comprises:
inputting the target voice into the pre-trained second voice recognition model, and performing Fourier transform on the target voice using the Fourier transform network to obtain the transformed voice; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice) and
the deep neural network is configured to predict a direction interval of arrival of voice.” (Column 18, lines 27-30 describes that a target direction is determined by the deep neural network.)
However, Var does not explicitly describe:
“the determining a probability that the target voice has a sub-voice in each of direction intervals corresponding to the plurality of recognition networks respectively, using the direction interval determination module, comprises:
inputting the transformed voice into the direction interval determination module, and determining, by the direction interval determination module, the probability that the target voice has the sub-voice in each of the direction intervals corresponding to the plurality of recognition networks respectively, using a preset direction determination technology; 
wherein, the preset direction determination technology comprises an arrival direction estimation algorithm or a pre-trained deep neural network.”
However, Lu describes:
“the determining a probability that the target voice has a sub-voice in each of direction intervals corresponding to the plurality of recognition networks respectively, using the direction interval determination module, comprises:
inputting the transformed voice into the direction interval determination module, and determining, by the direction interval determination module, the probability that the target voice has the sub-voice in each of the direction intervals corresponding to the plurality of recognition networks respectively, using a preset direction determination technology; (Column 7, lines 1-15 describes that the weights are determined based on how close a sample is to each direction.  The weight is “a probability.”  The sound sample is the input and the algorithm is a preset direction determination technology.)
wherein, the preset direction determination technology comprises an arrival direction estimation algorithm or a pre-trained deep neural network.” (Column 7, lines 16-27 describes that the weights are iterated to determine a final directional result.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the direction determination algorithm as described by Lu into the system of Var in view of Smus to determine a direction of a source voice, as described at column 7, lines 25-27 of Lu.
With regard to Claim 10, Var does not explicitly describe the subject matter of this claim.
However, Smus describes: 
“after inputting the target voice into the pre-trained voice recognition model, the method further comprises:
obtaining a confidence of each initial text output by each of the directional networks.”
Paragraph 26 of Smus describes that the device determines a probability that the text predicted is correct, which is cited as “a confidence.”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the probabilities as described by Smus into the system of Var to compensate for possible recognition errors, as described at paragraph 26 of Smus.
Var in view of Smus does not explicitly describe:
“the third voice recognition model further comprises a direction interval determination module;
the determining a voice recognition result of the target voice, based on the initial text, comprises:
determining a probability that the target voice has a sub-voice in each of the direction intervals corresponding to the plurality of directional networks respectively, using the direction interval determination module;
weighting, for each initial text, the confidence of the each initial text output by the plurality of directional networks, by using the probability corresponding to each directional network as a weight of the confidence of the each initial text output by the each directional network; and
using an initial text corresponding to a largest weighting result as the voice recognition result.”
However, Lu describes:
“the third voice recognition model further comprises a direction interval determination module; (Source direction determination unit 803, Figure 8)
the determining a voice recognition result of the target voice, based on the initial text, comprises:
determining a probability that the target voice has a sub-voice in each of the direction intervals corresponding to the plurality of directional networks respectively, using the direction interval determination module; (Column 7, lines 1-15 describes that the weights are determined based on how close a sample is to each direction.  The weight is “a probability.”)
weighting, for each initial text, the confidence of the each initial text output by the plurality of directional networks, by using the probability corresponding to each directional network as a weight of the confidence of the each initial text output by the each directional network; (Column 7, lines 1-15 describes that the weights are applied to the output of each directional channel.) and
using an initial text corresponding to a largest weighting result as the voice recognition result.” (Column 7, lines 16-27 describes that the weights are iterated to determine a final directional result.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the direction determination algorithm as described by Lu into the system of Var in view of Smus to determine a direction of a source voice, as described at column 7, lines 25-27 of Lu.
With regard to Claim 11, Var describes “the third voice recognition model further comprises the Fourier transform network; (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)
the inputting a target voice into a pre-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, comprises: (Column 2, lines 1-9 describes that sub-word units (text) is determined based on raw audio data)
inputting the target voice into the pre-trained third voice recognition model, and performing Fourier transform on the target voice using the Fourier transform network to obtain the transformed voice; and (Column 2, lines 58-65 describes that the model performs a Fourier transform on the input voice)
inputting the transformed voice into the omnidirectional network to obtain a processed voice feature output by a complex linear transformation layer of the omnidirectional network; (Column 6, lines 9-13 describes that the neural network combines multiple inputs into a single channel (cited as “an omnidirectional network”).  Equation 1 shows the complex linear transformation performed by the layer.)
the deep neural network is configured to predict a direction interval of arrival of voice.” (Column 18, lines 27-30 describes that a target direction is determined by the deep neural network.)
Var does not explicitly describe:
“the determining a probability that the target voice has a sub-voice in each of the direction intervals corresponding to the plurality of directional networks respectively, using the direction interval determination module, comprises:
inputting the processed voice feature into the direction interval determination module, and determining by the direction interval determination module the probability that the target voice has the sub-voice in each of the direction intervals corresponding to the plurality of directional networks respectively, using a preset direction determination technology;
wherein, the preset direction determination technology comprises an arrival direction estimation algorithm or a pre-trained deep neural network.”
However, Lu describes:
“the determining a probability that the target voice has a sub-voice in each of direction intervals corresponding to the plurality of recognition networks respectively, using the direction interval determination module, comprises:
inputting the transformed voice into the direction interval determination module, and determining, by the direction interval determination module, the probability that the target voice has the sub-voice in each of the direction intervals corresponding to the plurality of recognition networks respectively, using a preset direction determination technology; (Column 7, lines 1-15 describes that the weights are determined based on how close a sample is to each direction.  The weight is “a probability.”  The sound sample is the input and the algorithm is a preset direction determination technology.)
wherein, the preset direction determination technology comprises an arrival direction estimation algorithm or a pre-trained deep neural network.” (Column 7, lines 16-27 describes that the weights are iterated to determine a final directional result.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the direction determination algorithm as described by Lu into the system of Var in view of Smus to determine a direction of a source voice, as described at column 7, lines 25-27 of Lu.
With regard to Claim 23, Var describes:
“determining a loss value of the voice recognition result using an initial text [[corresponding to a largest weighting result]] as the voice recognition result, (Column 9, lines 17-20 and Equation 2 describe a cross entropy loss (the p( s I vt) term in Equation 2)) and 
performing back propagation in the third voice recognition model using the loss value to update a parameter in the third voice recognition model to obtain a trained third voice recognition model.” (Column 9, lines 17-20 and Equation 2 show that the back propagation in the training is done with the cross entropy loss (the p( s I vt) term in Equation 2))
Var does not explicitly describe that predicted result corresponds to a largest weighting result, or:
“the third voice recognition model further comprises a direction interval determination module; and
 after inputting the sub-voice feature into the to-be-input directional network, the method further comprises:
obtaining a confidence of each initial text output by each of the directional networks; and
the method further comprises:
determining a probability that the third voice sample has a sub-voice in each of the direction intervals corresponding to the plurality of directional networks respectively, using the direction interval determination module;
weighting the confidence of each initial text output by the plurality of directional networks, by using the probability corresponding to each directional network as a weight of the confidence of the each initial text output by the each directional network.”
However, Smus describes: 
“after inputting the sub-voice feature into the to-be-input directional network, the method further comprises:
obtaining a confidence of each initial text output by each of the directional networks.”
Paragraph 26 of Smus describes that the device determines a probability that the text predicted is correct, which is cited as “a confidence.”
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the probabilities as described by Smus into the system of Var to compensate for possible recognition errors, as described at paragraph 26 of Smus.
Var in view of Smus does not explicitly describe that a predicted result corresponds to a largest weighting result, or:
“the third voice recognition model further comprises a direction interval determination module; and
the method further comprises:
determining a probability that the third voice sample has a sub-voice in each of the direction intervals corresponding to the plurality of directional networks respectively, using the direction interval determination module;
weighting the confidence of each initial text output by the plurality of directional networks, by using the probability corresponding to each directional network as a weight of the confidence of the each initial text output by the each directional network.”
However, Lu describes that a predicted result corresponds to a largest weighting result, (Column 7, lines 16-27 describes that the weights are iterated to determine a final directional result.) and:
“the third voice recognition model further comprises a direction interval determination module; (Source direction determination unit 803, Figure 8) and
the method further comprises:
determining a probability that the third voice sample has a sub-voice in each of the direction intervals corresponding to the plurality of directional networks respectively, using the direction interval determination module; (Column 7, lines 1-15 describes that the weights are determined based on how close a sample is to each direction.  The weight is “a probability.”)
weighting the confidence of each initial text output by the plurality of directional networks, by using the probability corresponding to each directional network as a weight of the confidence of the each initial text output by the each directional network.” (Column 7, lines 1-15 describes that the weights are applied to the output of each directional channel.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the direction determination algorithm as described by Lu into the system of Var in view of Smus to determine a direction of a source voice, as described at column 7, lines 25-27 of Lu.
With respect to Claims 29, 30, 33, and 34, apparatus Claim 24 and method Claim 1 are related as an apparatus programmed to perform the same method, with each claimed apparatus function corresponding to each claimed method step. Further, column 28, lines 15-17 of Var describes processor 852 that executes a program stored in memory 864.  .  Accordingly, Claims 29, 30, 33, and 34 are similarly rejected under the same rationale as applied above with respect to Claims 6, 7, 10, and 11.
With respect to Claim 46,  apparatus Claim 35 and method Claim 12 are related as an apparatus programmed to perform the same method, with each claimed apparatus function corresponding to each claimed method step. Further, column 28, lines 15-17 of Var describes processor 852 that executes a program stored in memory 864.   Accordingly, Claim 46 is similarly rejected under the same rationale as applied above with respect to Claim 23.

Conclusion
9.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US Pat. App. Pub. No. 20190164552 (Lim et al.) also describes using neural networks to determine an input sound direction.
10.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to EDWARD TRACY whose telephone number is (571)272-8332. The examiner can normally be reached Monday-Friday 9 AM- 5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/EDWARD TRACY JR./           Examiner, Art Unit 2656        

/BHAVESH M MEHTA/           Supervisory Patent Examiner, Art Unit 2656