DETAILED ACTION
This communication is in response to the application filed on 17 February, 2021.  Claims 1-20 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 17 February, 2021 and 15 October, 2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the IDS are being considered by the examiner.

Claim Objections
Claims 2, 11, and 19 are objected to because of the following informality:
“acquired by the each of the microphones” in line 8 should be changed to “acquired by each of the microphones” in order to establish proper antecedent basis for the claim.

Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 3-4, 12-13, and 19-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 3, 12, and 19, lines 13-14, recite the limitation “in response to determining the each of the microphones in the microphone array is the target microphone.”  It is unclear to what “determining the each of the microphones […] is the target” precisely refers.  Is each microphone a separate target microphone?  Is the group of microphones together one target microphone?  If so, how is a group microphones a singular target microphone?  Are either of these options what Applicant is claiming?
Claims 4, 13, and 20, lines 11-12, 10-11, and 11-12 respectively, recite the limitation “in response to determining each of the plurality of enhancement directions is selected as the target enhancement direction.”  It is unclear to what “determining each of the plurality of […] directions is selected as the target […] direction” precisely refers.  Is each of the plurality of directions selected as a target direction?  If so, then how is that then a singular target direction?  Is the word “each” supposed to read “which?”  Are either of these options what Applicant is claiming?
In order to overcome these rejections, claims 3 and 4 should be changed or reworded in such a way as to provide clarity as to what is being claimed in each claim’s respective limitation above.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1-2, 6-7, 10-11, 15-16, 17-18 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kim et al. (US 2018/0336892; hereafter Kim).
Regarding claim 1,
Kim teaches: 
an audio data processing method, performed by an audio data processing device, the method comprising (see Kim ¶ 5, 42, 54: example methods are disclosed herein; system is implemented on one or more standalone data processing apparatus; audio circuitry receives audio data and converts the audio data to an electrical signal):
obtaining multi-path audio data in an environmental space (see Kim ¶ 248, 268: electronic device samples an audio signal at each of a plurality of microphones of the electronic device to obtain a plurality of audio signals [multipath audio data]; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [an environmental space]), 
obtaining a speech data set based on the multi-path audio data (Kim ¶ 54, 249: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing; electronic device samples via multiple microphones a plurality of audio signals [multipath audio data], which include a natural-language speech input from user), 
and separately generating, in a plurality of enhancement directions, enhanced speech information corresponding to the speech data set (see Kim ¶ 54, 249, 250, 251: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing [speech data set]; electronic device samples via multiple microphones a plurality of audio signals which include a natural-language speech input from user [speech information]; plurality of audio streams includes one or more audio beams, multiple audio signals to obtain a single audio beam (e.g., via beamforming techniques [enhancement]) of the plurality of audio beams, at least one audio beam of the plurality of audio beams is obtained using source separation techniques; each of six microphones samples an audio signal and processes the six audio signals to obtain six audio beams [plurality of enhancement directions; enhanced speech information corresponding to the speech data set]);
matching a speech hidden feature in the enhanced speech information with a target matching word (see Kim ¶ 214, 253: speech pre-processor extracts representative features [hidden features] from the speech input; electronic device determines based on the plurality of audio streams (e.g., audio beams) [enhanced speech information] whether any of the plurality of audio signals corresponds to a spoken trigger (e.g., “Hey Siri”) [target matching word]),
and determining an enhancement direction corresponding to the enhanced speech information having a highest degree of matching with the target matching word as a target audio direction (see Kim ¶ 253, 254, 255: each of the plurality of audio streams is associated with directional information, and the electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger based on the directional information [corresponding to the enhanced speech information]; electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger [determining an enhancement direction, as a target audio direction] based on one or more confidence scores (or trigger scores [having highest degree of matching with the target matching word]; [¶ 254 and 255 further demonstrate a process whereby multiple audio beams are assigned confidence scores corresponding to the confidence level that the beam contains a spoken trigger and then the beam/data with the highest score is selected]);
obtaining speech spectrum features in the enhanced speech information, and obtaining, from the speech spectrum features, a speech spectrum feature in the target audio direction (see Kim ¶ 214, 252: STT processing module 730 includes one or more ASR systems [that] include a front-end speech pre-processor [that] extract[s] spectral features that characterize the speech input [obtaining speech spectrum features], ASR system includes speech recognition models, speech recognition models and speech recognition engines are used to process the extracted representative features of the front-end speech pre-processor [a speech spectrum feature]; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition result(s) in a subsequent speech recognition analysis [in the target audio direction], after adjusting the audio beam the electronic device passes the plurality of audio beams to a speech recognizer [730] for speech recognition [in the enhanced speech information]; 
performing speech authentication on the speech hidden feature and the speech spectrum feature in the target audio direction based on the target matching word (see Kim ¶ 214, 252, 257: speech pre-processor extracts representative features [speech hidden features] from the speech input, STT processing module 730 includes one or more ASR systems [that] include a front-end speech pre-processor [that] extract[s] spectral features that characterize the speech input [obtaining speech spectrum features], ASR system includes speech recognition models, speech recognition models and speech recognition engines are used to process the extracted representative features of the front-end speech pre-processor [a speech spectrum feature]; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition result(s) in a subsequent speech recognition analysis [in the target audio direction]; electronic device determines whether audio signals corresponds to a spoken trigger [target matching word] by performing speech recognition on one or more of the audio beams [performing speech authentication]),
to obtain a target authentication result (see Kim ¶ 257, 258: the electronic device can perform speech recognition analysis on one or more of the audio [beams] to obtain one or more speech recognition results; thereafter, the electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger [obtain a target authentication result]), 
the target authentication result being used for representing a probability of existence of the target matching word in the target audio direction for controlling a terminal (see Kim ¶ 45, 247, 254, 257, 262: functions of a digital assistant are implemented as a standalone application installed on a user device; process is performed using a client device/user device (e.g., a mobile phone and a smart watch) [a terminal); electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more confidence scores (or trigger scores), trigger score indicates a confidence level that a trigger phrase is included in an audio stream [probability of existence of the target matching word]; electronic device can obtain information indicative a direction associated with the uttered word [in the target audio direction]; in accordance with a determination that the plurality of audio signals corresponds to the spoken trigger, the electronic device initiates a session of the digital assistant [controlling a terminal]).

Regarding claim 2,
	Kim teaches:
obtaining a microphone array corresponding to the environmental space in which the terminal is located, the microphone array including a plurality of microphones and an array structure of the plurality of microphones (see Kim ¶ 247, 248, 268, 270: process is performed using a client device/user device (e.g., a mobile phone and a smart watch) [a terminal); electronic device samples an audio signal at each of a plurality of microphones [obtaining a microphone array, the microphone array including a plurality of microphones] of the electronic device to obtain a plurality of audio signals; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [corresponding to the environmental space in which the terminal is located]; electronic device can receive information including location of the microphone(s) [an array structure of the plurality of microphones]);
acquiring an audio signal in the environmental space based on the array structure of the plurality of the microphones, the audio signal including at least one speech signal (see Kim ¶ 247, 248, 249, 268, 270: process is performed using a client device/user device (e.g., a mobile phone and a smart watch); electronic device samples an audio signal [acquiring an audio signal] at each of a plurality of microphones of the electronic device to obtain a plurality of audio signals; plurality of audio signals include a natural-language speech input from user [the audio signal including at least one speech signal]; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [in the environmental space]; electronic device can receive information including location of the microphone(s) [based on the array structure of the plurality of the microphones]); and
separately determining the at least one speech signal acquired by the each of the microphones as one-path audio data corresponding to the each of the microphones, the one-path audio data being the at least one speech signal acquired by one microphone. (see Kim ¶ 54, 249, 251: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing; electronic device samples via multiple microphones a plurality of audio signals which include a natural-language speech input from user [the at least one speech signal]; each of six microphones samples an audio signal and processes the six audio signals to obtain six audio beams [separately determining the at least one signal acquired by the each of the microphones as one-path audio data corresponding to the each of the microphones, the one-path audio data acquired by one microphone]);

Regarding claim 6,
Kim teaches:
obtaining a speech hidden feature in enhanced speech information in each enhancement direction based on a first wake-up detection model (see Kim ¶ 39, 214, 253: processing modules utilize data and models to process speech input and determine the user's intent based on natural language input [based on a first model]; speech pre-processor extracts representative features [obtaining a speech hidden feature] from the speech input; electronic device determines based on the plurality of audio streams (e.g., audio beams) [in enhanced speech information in each enhancement direction] whether any of the plurality of audio signals corresponds to a spoken trigger (e.g., “Hey Siri”) [wake-up detection]),
one speech hidden feature being a feature obtained after feature extraction is performed by the first wake-up detection model on a speech spectrum feature of one piece of enhanced speech information (see Kim ¶ 214, 258: front-end speech pre-processor extracts representative features from the speech input [one speech hidden feature being a feature obtained after feature extraction is performed], speech pre-processor extract[s] spectral features that characterize the speech input [on a speech spectrum feature of one piece of enhanced speech information], each ASR system includes one or more speech recognition models, speech recognition models used to process the extracted representative features to produce recognition results, recognition result is passed to natural language processing module for intent deduction; electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more recognized words or one or more intents [by the first wake-up detection model]);
performing speech recognition on each speech hidden feature based on the target matching word, to obtain a speech recognition result corresponding to the first wake-up detection model (see Kim ¶ 214, 258: speech recognition models used to process the extracted representative features to produce recognition results [to obtain a speech recognition result corresponding to the first wake-up detection model]; electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more recognized words [target matching word; wake-up detection model]), 
the speech recognition result comprising a degree of matching between the speech hidden feature corresponding to the each enhancement direction and the target matching word (see Kim ¶ 214, 254, 258: speech recognition models used to process the extracted representative features [speech hidden feature] to produce recognition results, STT processing module produces recognition results containing a text string, each candidate text representation [speech recognition result] is associated with a speech recognition confidence score [a degree of matching], based on the speech recognition confidence scores, STT processing module ranks the candidate text representations; electronic device determines whether any of the plurality of audio signals [the each enhancement direction] corresponds to a spoken trigger [target matching word] based on one or more confidence scores [comprising a degree of matching between the speech hidden feature corresponding to the each enhancement direction and the target matching word]; electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more recognized words [target matching word; wake-up detection model])); and
determining, based on the speech recognition result, the enhancement direction corresponding to the enhanced speech information having the highest degree of matching with the target matching word as the target audio direction (see Kim ¶ 255, 261: [trigger score assigned for each of two audio beams (one score per beam)], based on the trigger scores the electronic device determines whether any of the plurality of audio signals corresponds to the spoken trigger [based on the speech recognition result], electronic device determines the audio signal include a spoken trigger because the highest trigger score exceeds a predetermined threshold value [having the highest degree of matching with the target matching word]; electronic device can select the one or more candidate audio streams based on respective trigger scores associated with the one or more candidate audio streams, if electronic device determines that a first candidate audio stream includes a spoken trigger electronic device may provide the first candidate audio stream [target audio direction]and not the second candidate audio stream [determining the enhancement direction corresponding to the enhanced speech information]).

Regarding claim 7,
 Kim teaches:
obtaining, based on the first wake-up detection model, a degree of matching between the each speech hidden feature and a plurality of wake-up features in the first wake-up detection model (see Kim ¶ 39, 214, 232, 259: processing modules utilize data and models to process speech input and determine the user's intent based on natural language input; front-end speech pre-processor extracts representative features [hidden features] from the speech input, speech recognition models are used to process the extracted representative features to produce text recognition results, each text representation is associated with a speech recognition confidence score [scores are based off the representative/hidden features], processing module ranks the text representations highest ranked text representations to natural language processing module for intent deduction based on the speech recognition confidence scores; natural language processing module is configured to receive a text representation to determine intent confidence scores [obtaining a degree of matching between the each speech hidden feature], natural language processing module can select one or more actionable intents based on the determined intent confidence scores [plurality of wake-up features (based on the broadest reasonable interpretation of the plain meaning of the term, Examiner interprets “wakeup feature” to be an actionable intent)]; electronic device obtains an audio beam and determines that the audio beam does not include a trigger because the audio does not correspond to an actionable intent [based on the first wake-up detection model; in the first wake-up detection model]); and
associating the degree of matching obtained by the first wake-up detection model with the target matching word corresponding to the plurality of wake-up features in the first wake-up detection model, to obtain the speech recognition result corresponding to the first wake-up detection model (see Kim ¶ 232, 253, 259: natural language processing module is configured to receive a text representation to determine intent confidence score [corresponding to the plurality of wake-up features]; electronic device determines based on the plurality of audio streams (e.g., audio beams) whether any of the plurality of audio signals corresponds to a spoken trigger (e.g., “Hey Siri”) [to obtain the speech recognition result corresponding to the first wake-up detection model]; electronic device obtains an audio beam and determines that the audio beam does not include a trigger because the audio does not correspond to an actionable intent [associating the degree of matching obtained by the first wake-up detection model with the target matching word; in the first wake-up detection model]. 

Regarding claim 10,
Kim teaches: 
an audio data processing apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory and configured to executing the computer program instructions and perform (see Kim ¶ 5, 9, 42, 54: electronic device having one or more processors, memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions; system is implemented on one or more standalone data processing apparatus; audio circuitry receives audio data and converts the audio data to an electrical signal):
obtaining multi-path audio data in an environmental space (see Kim ¶ 248, 268: electronic device samples an audio signal at each of a plurality of microphones of the electronic device to obtain a plurality of audio signals [multipath audio data]; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [an environmental space]), 
obtaining a speech data set based on the multi-path audio data (Kim ¶ 54, 249: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing; electronic device samples via multiple microphones a plurality of audio signals [multipath audio data], which include a natural-language speech input from user), 
and separately generating, in a plurality of enhancement directions, enhanced speech information corresponding to the speech data set (see Kim ¶ 54, 249, 250, 251: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing [speech data set]; electronic device samples via multiple microphones a plurality of audio signals which include a natural-language speech input from user [speech information]; plurality of audio streams includes one or more audio beams, multiple audio signals to obtain a single audio beam (e.g., via beamforming techniques [enhancement]) of the plurality of audio beams, at least one audio beam of the plurality of audio beams is obtained using source separation techniques; each of six microphones samples an audio signal and processes the six audio signals to obtain six audio beams [plurality of enhancement directions; enhanced speech information corresponding to the speech data set]);
matching a speech hidden feature in the enhanced speech information with a target matching word (see Kim ¶ 214, 253: speech pre-processor extracts representative features [hidden features] from the speech input; electronic device determines based on the plurality of audio streams (e.g., audio beams) [enhanced speech information] whether any of the plurality of audio signals corresponds to a spoken trigger (e.g., “Hey Siri”) [target matching word]),
and determining an enhancement direction corresponding to the enhanced speech information having a highest degree of matching with the target matching word as a target audio direction (see Kim ¶ 253, 254, 255: each of the plurality of audio streams is associated with directional information, and the electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger based on the directional information [corresponding to the enhanced speech information]; electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger [determining an enhancement direction, as a target audio direction] based on one or more confidence scores (or trigger scores [having highest degree of matching with the target matching word]; [¶ 254 and 255 further demonstrate a process whereby multiple audio beams are assigned confidence scores corresponding to the confidence level that the beam contains a spoken trigger and then the beam/data with the highest score is selected]);
obtaining speech spectrum features in the enhanced speech information, and obtaining, from the speech spectrum features, a speech spectrum feature in the target audio direction (see Kim ¶ 214, 252: STT processing module 730 includes one or more ASR systems [that] include a front-end speech pre-processor [that] extract[s] spectral features that characterize the speech input [obtaining speech spectrum features], ASR system includes speech recognition models, speech recognition models and speech recognition engines are used to process the extracted representative features of the front-end speech pre-processor [a speech spectrum feature]; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition result(s) in a subsequent speech recognition analysis [in the target audio direction], after adjusting the audio beam the electronic device passes the plurality of audio beams to a speech recognizer [730] for speech recognition [in the enhanced speech information]; 
performing speech authentication on the speech hidden feature and the speech spectrum feature in the target audio direction based on the target matching word (see Kim ¶ 214, 252, 257: speech pre-processor extracts representative features [speech hidden features] from the speech input, STT processing module 730 includes one or more ASR systems [that] include a front-end speech pre-processor [that] extract[s] spectral features that characterize the speech input [obtaining speech spectrum features], ASR system includes speech recognition models, speech recognition models and speech recognition engines are used to process the extracted representative features of the front-end speech pre-processor [a speech spectrum feature]; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition result(s) in a subsequent speech recognition analysis [in the target audio direction]; electronic device determines whether audio signals corresponds to a spoken trigger [target matching word] by performing speech recognition on one or more of the audio beams [performing speech authentication]),
to obtain a target authentication result (see Kim ¶ 257, 258: the electronic device can perform speech recognition analysis on one or more of the audio [beams] to obtain one or more speech recognition results; thereafter, the electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger [obtain a target authentication result]), 
the target authentication result being used for representing a probability of existence of the target matching word in the target audio direction for controlling a terminal (see Kim ¶ 45, 247, 254, 257, 262: functions of a digital assistant are implemented as a standalone application installed on a user device; process is performed using a client device/user device (e.g., a mobile phone and a smart watch) [a terminal); electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more confidence scores (or trigger scores), trigger score indicates a confidence level that a trigger phrase is included in an audio stream [probability of existence of the target matching word]; electronic device can obtain information indicative a direction associated with the uttered word [in the target audio direction]; in accordance with a determination that the plurality of audio signals corresponds to the spoken trigger, the electronic device initiates a session of the digital assistant [controlling a terminal]).

Regarding claim 11,
	Kim teaches:
obtaining a microphone array corresponding to the environmental space in which the terminal is located, the microphone array including a plurality of microphones and an array structure of the plurality of microphones (see Kim ¶ 247, 248, 268, 270: process is performed using a client device/user device (e.g., a mobile phone and a smart watch) [a terminal); electronic device samples an audio signal at each of a plurality of microphones [obtaining a microphone array, the microphone array including a plurality of microphones] of the electronic device to obtain a plurality of audio signals; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [corresponding to the environmental space in which the terminal is located]; electronic device can receive information including location of the microphone(s) [an array structure of the plurality of microphones]);
acquiring an audio signal in the environmental space based on the array structure of the plurality of the microphones, the audio signal including at least one speech signal (see Kim ¶ 247, 248, 249, 268, 270: process is performed using a client device/user device (e.g., a mobile phone and a smart watch); electronic device samples an audio signal [acquiring an audio signal] at each of a plurality of microphones of the electronic device to obtain a plurality of audio signals; plurality of audio signals include a natural-language speech input from user [the audio signal including at least one speech signal]; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [in the environmental space]; electronic device can receive information including location of the microphone(s) [based on the array structure of the plurality of the microphones]); and
separately determining the at least one speech signal acquired by the each of the microphones as one-path audio data corresponding to the each of the microphones, the one-path audio data being the at least one speech signal acquired by one microphone. (see Kim ¶ 54, 249, 251: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing; electronic device samples via multiple microphones a plurality of audio signals which include a natural-language speech input from user [the at least one speech signal]; each of six microphones samples an audio signal and processes the six audio signals to obtain six audio beams [separately determining the at least one signal acquired by the each of the microphones as one-path audio data corresponding to the each of the microphones, the one-path audio data acquired by one microphone]);

Regarding claim 15,
Kim teaches:
obtaining a speech hidden feature in enhanced speech information in each enhancement direction based on a first wake-up detection model (see Kim ¶ 39, 214, 253: processing modules utilize data and models to process speech input and determine the user's intent based on natural language input [based on a first model]; speech pre-processor extracts representative features [obtaining a speech hidden feature] from the speech input; electronic device determines based on the plurality of audio streams (e.g., audio beams) [in enhanced speech information in each enhancement direction] whether any of the plurality of audio signals corresponds to a spoken trigger (e.g., “Hey Siri”) [wake-up detection]),
one speech hidden feature being a feature obtained after feature extraction is performed by the first wake-up detection model on a speech spectrum feature of one piece of enhanced speech information (see Kim ¶ 214, 258: front-end speech pre-processor extracts representative features from the speech input [one speech hidden feature being a feature obtained after feature extraction is performed], speech pre-processor extract[s] spectral features that characterize the speech input [on a speech spectrum feature of one piece of enhanced speech information], each ASR system includes one or more speech recognition models, speech recognition models used to process the extracted representative features to produce recognition results, recognition result is passed to natural language processing module for intent deduction; electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more recognized words or one or more intents [by the first wake-up detection model]);
performing speech recognition on each speech hidden feature based on the target matching word, to obtain a speech recognition result corresponding to the first wake-up detection model (see Kim ¶ 214, 258: speech recognition models used to process the extracted representative features to produce recognition results [to obtain a speech recognition result corresponding to the first wake-up detection model]; electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more recognized words [target matching word; wake-up detection model]), 
the speech recognition result comprising a degree of matching between the speech hidden feature corresponding to the each enhancement direction and the target matching word (see Kim ¶ 214, 254, 258: speech recognition models used to process the extracted representative features [speech hidden feature] to produce recognition results, STT processing module produces recognition results containing a text string, each candidate text representation [speech recognition result] is associated with a speech recognition confidence score [a degree of matching], based on the speech recognition confidence scores, STT processing module ranks the candidate text representations; electronic device determines whether any of the plurality of audio signals [the each enhancement direction] corresponds to a spoken trigger [target matching word] based on one or more confidence scores [comprising a degree of matching between the speech hidden feature corresponding to the each enhancement direction and the target matching word]; electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more recognized words [target matching word; wake-up detection model])); and
determining, based on the speech recognition result, the enhancement direction corresponding to the enhanced speech information having the highest degree of matching with the target matching word as the target audio direction (see Kim ¶ 255, 261: [trigger score assigned for each of two audio beams (one score per beam)], based on the trigger scores the electronic device determines whether any of the plurality of audio signals corresponds to the spoken trigger [based on the speech recognition result], electronic device determines the audio signal include a spoken trigger because the highest trigger score exceeds a predetermined threshold value [having the highest degree of matching with the target matching word]; electronic device can select the one or more candidate audio streams based on respective trigger scores associated with the one or more candidate audio streams, if electronic device determines that a first candidate audio stream includes a spoken trigger electronic device may provide the first candidate audio stream [target audio direction]and not the second candidate audio stream [determining the enhancement direction corresponding to the enhanced speech information]).

	Regarding claim 16,
		Kim teaches:
obtaining, based on the first wake-up detection model, a degree of matching between the each speech hidden feature and a plurality of wake-up features in the first wake-up detection model (see Kim ¶ 39, 214, 232, 259: processing modules utilize data and models to process speech input and determine the user's intent based on natural language input; front-end speech pre-processor extracts representative features [hidden features] from the speech input, speech recognition models are used to process the extracted representative features to produce text recognition results, each text representation is associated with a speech recognition confidence score [scores are based off the representative/hidden features], processing module ranks the text representations highest ranked text representations to natural language processing module for intent deduction based on the speech recognition confidence scores; natural language processing module is configured to receive a text representation to determine intent confidence scores [obtaining a degree of matching between the each speech hidden feature], natural language processing module can select one or more actionable intents based on the determined intent confidence scores [plurality of wake-up features (based on the broadest reasonable interpretation of the plain meaning of the term, Examiner interprets “wakeup feature” to be an actionable intent)]; electronic device obtains an audio beam and determines that the audio beam does not include a trigger because the audio does not correspond to an actionable intent [based on the first wake-up detection model; in the first wake-up detection model]); and
associating the degree of matching obtained by the first wake-up detection model with the target matching word corresponding to the plurality of wake-up features in the first wake-up detection model, to obtain the speech recognition result corresponding to the first wake-up detection model (see Kim ¶ 232, 253, 259: natural language processing module is configured to receive a text representation to determine intent confidence score [corresponding to the plurality of wake-up features]; electronic device determines based on the plurality of audio streams (e.g., audio beams) whether any of the plurality of audio signals corresponds to a spoken trigger (e.g., “Hey Siri”) [to obtain the speech recognition result corresponding to the first wake-up detection model]; electronic device obtains an audio beam and determines that the audio beam does not include a trigger because the audio does not correspond to an actionable intent [associating the degree of matching obtained by the first wake-up detection model with the target matching word; in the first wake-up detection model]. 

Regarding claim 17,
Kim teaches: 
A non-transitory electronic-readable storage medium storing computer program instructions executable by at least one processor to perform (see Kim ¶ 7 : example non-transitory computer-readable storage medium stores one or more programs, one or more programs comprise instructions, which when executed by one or more processors of an electronic device):
obtaining multi-path audio data in an environmental space (see Kim ¶ 248, 268: electronic device samples an audio signal at each of a plurality of microphones of the electronic device to obtain a plurality of audio signals [multipath audio data]; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [an environmental space]), 
obtaining a speech data set based on the multi-path audio data (Kim ¶ 54, 249: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing; electronic device samples via multiple microphones a plurality of audio signals [multipath audio data], which include a natural-language speech input from user), 
and separately generating, in a plurality of enhancement directions, enhanced speech information corresponding to the speech data set (see Kim ¶ 54, 249, 250, 251: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing [speech data set]; electronic device samples via multiple microphones a plurality of audio signals which include a natural-language speech input from user [speech information]; plurality of audio streams includes one or more audio beams, multiple audio signals to obtain a single audio beam (e.g., via beamforming techniques [enhancement]) of the plurality of audio beams, at least one audio beam of the plurality of audio beams is obtained using source separation techniques; each of six microphones samples an audio signal and processes the six audio signals to obtain six audio beams [plurality of enhancement directions; enhanced speech information corresponding to the speech data set]);
matching a speech hidden feature in the enhanced speech information with a target matching word (see Kim ¶ 214, 253: speech pre-processor extracts representative features [hidden features] from the speech input; electronic device determines based on the plurality of audio streams (e.g., audio beams) [enhanced speech information] whether any of the plurality of audio signals corresponds to a spoken trigger (e.g., “Hey Siri”) [target matching word]),
and determining an enhancement direction corresponding to the enhanced speech information having a highest degree of matching with the target matching word as a target audio direction (see Kim ¶ 253, 254, 255: each of the plurality of audio streams is associated with directional information, and the electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger based on the directional information [corresponding to the enhanced speech information]; electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger [determining an enhancement direction, as a target audio direction] based on one or more confidence scores (or trigger scores [having highest degree of matching with the target matching word]; [¶ 254 and 255 further demonstrate a process whereby multiple audio beams are assigned confidence scores corresponding to the confidence level that the beam contains a spoken trigger and then the beam/data with the highest score is selected]);
obtaining speech spectrum features in the enhanced speech information, and obtaining, from the speech spectrum features, a speech spectrum feature in the target audio direction (see Kim ¶ 214, 252: STT processing module 730 includes one or more ASR systems [that] include a front-end speech pre-processor [that] extract[s] spectral features that characterize the speech input [obtaining speech spectrum features], ASR system includes speech recognition models, speech recognition models and speech recognition engines are used to process the extracted representative features of the front-end speech pre-processor [a speech spectrum feature]; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition result(s) in a subsequent speech recognition analysis [in the target audio direction], after adjusting the audio beam the electronic device passes the plurality of audio beams to a speech recognizer [730] for speech recognition [in the enhanced speech information]; 
performing speech authentication on the speech hidden feature and the speech spectrum feature in the target audio direction based on the target matching word (see Kim ¶ 214, 252, 257: speech pre-processor extracts representative features [speech hidden features] from the speech input, STT processing module 730 includes one or more ASR systems [that] include a front-end speech pre-processor [that] extract[s] spectral features that characterize the speech input [obtaining speech spectrum features], ASR system includes speech recognition models, speech recognition models and speech recognition engines are used to process the extracted representative features of the front-end speech pre-processor [a speech spectrum feature]; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition result(s) in a subsequent speech recognition analysis [in the target audio direction; electronic device determines whether audio signals corresponds to a spoken trigger [target matching word] by performing speech recognition on one or more of the audio beams [performing speech authentication]),
to obtain a target authentication result (see Kim ¶ 257, 258: the electronic device can perform speech recognition analysis on one or more of the audio [beams] to obtain one or more speech recognition results; thereafter, the electronic device can determine whether any of the plurality of audio signals corresponds to a spoken trigger [obtain a target authentication result]), 
the target authentication result being used for representing a probability of existence of the target matching word in the target audio direction for controlling a terminal (see Kim ¶ 45, 247, 254, 257, 262: functions of a digital assistant are implemented as a standalone application installed on a user device; process is performed using a client device/user device (e.g., a mobile phone and a smart watch) [a terminal); electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger based on one or more confidence scores (or trigger scores), trigger score indicates a confidence level that a trigger phrase is included in an audio stream [probability of existence of the target matching word]; electronic device can obtain information indicative a direction associated with the uttered word [in the target audio direction]; in accordance with a determination that the plurality of audio signals corresponds to the spoken trigger, the electronic device initiates a session of the digital assistant [controlling a terminal]).

Regarding claim 18,
	Kim teaches:
obtaining a microphone array corresponding to the environmental space in which the terminal is located, the microphone array including a plurality of microphones and an array structure of the plurality of microphones (see Kim ¶ 247, 248, 268, 270: process is performed using a client device/user device (e.g., a mobile phone and a smart watch) [a terminal); electronic device samples an audio signal at each of a plurality of microphones [obtaining a microphone array, the microphone array including a plurality of microphones] of the electronic device to obtain a plurality of audio signals; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [corresponding to the environmental space in which the terminal is located]; electronic device can receive information including location of the microphone(s) [an array structure of the plurality of microphones]);
acquiring an audio signal in the environmental space based on the array structure of the plurality of the microphones, the audio signal including at least one speech signal (see Kim ¶ 247, 248, 249, 268, 270: process is performed using a client device/user device (e.g., a mobile phone and a smart watch); electronic device samples an audio signal [acquiring an audio signal] at each of a plurality of microphones of the electronic device to obtain a plurality of audio signals; plurality of audio signals include a natural-language speech input from user [the audio signal including at least one speech signal]; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [in the environmental space]; electronic device can receive information including location of the microphone(s) [based on the array structure of the plurality of the microphones]); and
separately determining the at least one speech signal acquired by the each of the microphones as one-path audio data corresponding to the each of the microphones, the one-path audio data being the at least one speech signal acquired by one microphone. (see Kim ¶ 54, 249, 251: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing; electronic device samples via multiple microphones a plurality of audio signals which include a natural-language speech input from user [the at least one speech signal]; each of six microphones samples an audio signal and processes the six audio signals to obtain six audio beams [separately determining the at least one signal acquired by the each of the microphones as one-path audio data corresponding to the each of the microphones, the one-path audio data acquired by one microphone]);

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 3-5, 12-14, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US 2018/0336892; hereafter Kim) in view of Pandya et al. (US 2017/0278512; hereafter Pandya).
Regarding claim 3, Kim teaches all the limitations of claim 2.
Kim further teaches:
obtaining to-be-enhanced speech data separately corresponding to each path of audio data, in response to determining the each of the microphones in the microphone array is the target microphone (see Kim ¶ 251, 252, 253: electronic device has six microphones [microphone array], each of which samples an audio signal, and processes the six audio signals to obtain six audio beams; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition results [determining the each of the microphones in the microphone array is the target microphone], electronic device further adjusts the audio beam to minimize echo; determining whether any of the plurality of audio signals corresponds to the spoken trigger comprises determining whether each of the plurality of audio streams includes the spoken trigger [obtaining to-be-enhanced speech data separately corresponding to each path of audio data, in response to determining the each of the microphones is the target microphone], each of the plurality of audio streams is associated with directional information, electronic device determines whether the plurality of audio signals corresponds to a spoken trigger based on the directional information); and
separately adding each piece of to-be-enhanced speech data to the speech data set [see Kim ¶ 260: electronic device passes data corresponding to the set [speech data set] of candidate audio [beams] to a software module on the electronic device].
Kim does not teach:
a first speech signal and a second speech signal, the first speech signal being a sound signal that is transmitted by a user and acquired by the microphone array, the second speech signal being a sound signal that is transmitted by the terminal and acquired by the microphone array; and the obtaining a speech data set based on the multi-path audio data comprises: 
obtaining a target microphone from the microphones of the microphone array, and using audio data that comprises the first speech signal and the second speech signal and that corresponds to the target microphone as target audio data; reducing the second speech signal in the target audio data by using an echo canceler, and determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data;
Pandya discloses:
a first speech signal and a second speech signal, the first speech signal being a sound signal that is transmitted by a user and acquired by the microphone array, the second speech signal being a sound signal that is transmitted by the terminal and acquired by the microphone array (see Pandya ¶ 42, 43, 49: capturing module include[s] at least two microphones; echo cancellation module removes playback audio captured by the microphones [acquired by the microphone array]; Fig. 3 illustrates the processes performed by the echo cancellation module, dual talk detection and removal involves approximation of user speech from the mixture of machine sound (playback sound) [second speech signal being a sound signal that is transmitted by the terminal] and user speech [first speech signal being a sound signal that is transmitted by a user]); and
obtaining a target microphone from the microphones of the microphone array (see Pandya ¶ 42, 43, 50: capturing module include[s] at least two microphones; echo cancellation module removes playback audio captured by the microphones [from the microphones of the microphone array]; referring to Fig 3, a microphone and a speaker are situated in the same system enclosed by a room which may generate echo, audio signal received by the mic [obtaining a target microphone]), 
and using audio data that comprises the first speech signal and the second speech signal and that corresponds to the target microphone as target audio data (see Pandya ¶ 42, 43, 49, 50: capturing module [has] analog to digital converters which converts analog electrical signals into digital signals [audio data]; echo cancellation module would receive audio signals from the capturing module [using audio data]; dual talk detection and removal involves approximation of user speech from the mixture of machine sound (playback sound) [second speech signal] and user speech [first speech signal]; audio signal received by the microphone [target microphone as target audio data] represented by y(m) = x(m) + yf(m) where y(m) is the audio signal received by the mic, x(m) is the user's sound, and yf(m) is the sound of the speaker);
reducing the second speech signal in the target audio data by using an echo canceler, and determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data (see Pandya ¶ 43, 50: echo cancellation module would receive audio signals from the capturing module [using an echo canceler]; FIG. 3 illustrates processes performed by the echo cancellation module, audio signal received by the microphone [target audio data] represented by y(m) = x(m) + yf(m) where y(m) is the audio signal received by the mic, x(m) is the user's sound, and yf(m) is the sound of the speaker [second speech signal], yf(m) is the sound to be removed [reducing the second speech signal], y(m) would be subtracted by ŷ(m) which is generated by the acoustic feedback synthesizer which receives parameters from the adaptation algorithm, result of the subtraction would be an approximation of x(m) which is then fed into the adaptation algorithm as well as the acoustic feedback synthesizer for the generation of ŷ(m) [determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data]);
 Kim and Pandya are considered to be analogous because they are from the field of keyword detection.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention(s) to have modified Kim to incorporate the disclosure of Pandya in order to enhance input audio by removing audio playback from the input audio (see Pandya ¶ 43: echo cancellation used to enhance [audio] signals and remove playback audio captured by microphones).	

Regarding claim 4, Kim in view of Pandya teach all the limitations of claim 3.
Kim further teaches:
selecting any one of the plurality of enhancement directions of a beamformer as a target enhancement direction (see Kim ¶ 250: plurality of audio streams includes one or more audio beams, multiple audio signals to obtain a single audio beam (e.g., via beamforming techniques) of the plurality of audio beams, at least one audio beam of the plurality of audio beams is obtained using source separation techniques), 
enhancing the first speech signal in the speech data set based on the beamformer (see Kim ¶ 252: electronic device modifies one audio beam of the plurality of audio beams for better speech recognition results);
and using the enhanced first speech signal as directional enhanced data in the target enhancement direction (see Kim ¶ 262: if the electronic device detects a spoken trigger in a particular audio stream (associated with directional information), the electronic device can select a speaker facing the direction associated with the particular audio stream);
filtering out environmental noise carried in the directional enhanced data based on a noise canceler and a reverb canceler, and determining the directional enhanced data from which the environmental noise is filtered out as the enhanced speech information corresponding to the speech data set; and (see Kim ¶ 250, 252:  when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to steer a spatial null in the direction of the interfering sources [filtering out environmental noise carried in the directional enhanced data based on a noise canceler] so as to obtain an independent representation of the audio from the source of interest [determining the directional enhanced data from which the environmental noise is filtered out as the enhanced speech information corresponding to the speech data set]; electronic device further adjusts the audio beam to minimize reverberation [reverb canceler])
in response to determining each of the plurality of enhancement directions is selected as the target enhancement direction, obtaining the enhanced speech information of the speech data set in the enhancement directions (see Kim ¶ 54, 250: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing [speech data set]; multiple audio signals are processed to obtain an audio stream of the plurality of audio streams [determining each of the plurality of enhancement directions is selected as the target enhancement direction], when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to steer a spatial null in the direction of the interfering sources so as to obtain an independent representation of the audio from the source of interest [obtaining the enhanced speech information of the speech data set in the enhancement directions]). 

Regarding claim 5, Kim in view of Pandya teach all the limitations of claim 4.
Kim further teaches:
a sound sub-signal transmitted by a first user and a sound sub-signal transmitted by a second user (see Kim ¶ 249: plurality of audio signals can include a spoken trigger uttered by a user [and] interfering signals including but limited to: speech from competing speakers (e.g., people other than the user in physical proximity to the electronic device)), 
the first user being a user in the target enhancement direction, and the second user being a user in one of the plurality of enhancement directions except the target enhancement direction (see Kim ¶ 266: if the electronic device detects the spoken trigger from a particular audio beam associated with a direction toward the user [first user], the electronic device can select one or more microphones corresponding to the audio beam to sample subsequent audio signals [in the target enhancement direction], a microphone corresponds to the audio beam if it is associated with (e.g., is configured to face) the direction of the audio beam and/or if the microphone has sampled an audio signal that formed the audio beam, electronic device may suppress the recognition and interpretation of any audio signal associated with a speaker different from the user [second user being a user in one of the plurality of enhancement directions except the target enhancement direction]); and
enhancing, based on the beamformer, the sound sub-signal transmitted by the first user in the speech data set, and inhibiting, in the target enhancement direction, interference data generated by the sound sub-signal transmitted by the second user, to output the enhanced first speech signal (see Kim ¶ 250, 252:  device processes multiple audio signals to obtain a single audio beam, via beamforming techniques [based on the beamformer], of the plurality of audio beams; when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to steer a spatial null in the direction of the interfering sources [inhibiting, in the target enhancement direction, interference data generated by the sound sub-signal transmitted by the second user] so as to obtain an independent representation of the audio from the source of interest [enhancing the sound sub-signal transmitted by the first user in the speech data set; to output the enhanced first speech signal]); and
using the enhanced first speech signal as the directional enhanced data in the target enhancement direction [see Kim ¶ 252: when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to obtain an independent representation of the audio from the source of interest].

Regarding claim 12, Kim teaches all the limitations of claim 11.
Kim further teaches:
obtaining to-be-enhanced speech data separately corresponding to each path of audio data, in response to determining the each of the microphones in the microphone array is the target microphone (see Kim ¶ 251, 252, 253: electronic device has six microphones [microphone array], each of which samples an audio signal, and processes the six audio signals to obtain six audio beams; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition results [determining the each of the microphones in the microphone array is the target microphone], electronic device further adjusts the audio beam to minimize echo; determining whether any of the plurality of audio signals corresponds to the spoken trigger comprises determining whether each of the plurality of audio streams includes the spoken trigger [obtaining to-be-enhanced speech data separately corresponding to each path of audio data, in response to determining the each of the microphones is the target microphone], each of the plurality of audio streams is associated with directional information, electronic device determines whether the plurality of audio signals corresponds to a spoken trigger based on the directional information); and
separately adding each piece of to-be-enhanced speech data to the speech data set [see Kim ¶ 260: electronic device passes data corresponding to the set [speech data set] of candidate audio [beams] to a software module on the electronic device].
Kim does not teach:
a first speech signal and a second speech signal, the first speech signal being a sound signal that is transmitted by a user and acquired by the microphone array, the second speech signal being a sound signal that is transmitted by the terminal and acquired by the microphone array; and the obtaining a speech data set based on the multi-path audio data comprises: 
obtaining a target microphone from the microphones of the microphone array, and using audio data that comprises the first speech signal and the second speech signal and that corresponds to the target microphone as target audio data; reducing the second speech signal in the target audio data by using an echo canceler, and determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data;
Pandya discloses:
a first speech signal and a second speech signal, the first speech signal being a sound signal that is transmitted by a user and acquired by the microphone array, the second speech signal being a sound signal that is transmitted by the terminal and acquired by the microphone array (see Pandya ¶ 42, 43, 49: capturing module include[s] at least two microphones; echo cancellation module removes playback audio captured by the microphones [acquired by the microphone array]; Fig. 3 illustrates the processes performed by the echo cancellation module, dual talk detection and removal involves approximation of user speech from the mixture of machine sound (playback sound) [second speech signal being a sound signal that is transmitted by the terminal] and user speech [first speech signal being a sound signal that is transmitted by a user]); and
obtaining a target microphone from the microphones of the microphone array (see Pandya ¶ 42, 43, 50: capturing module include[s] at least two microphones; echo cancellation module removes playback audio captured by the microphones [from the microphones of the microphone array]; referring to Fig 3, a microphone and a speaker are situated in the same system enclosed by a room which may generate echo, audio signal received by the mic [obtaining a target microphone]), 
and using audio data that comprises the first speech signal and the second speech signal and that corresponds to the target microphone as target audio data (see Pandya ¶ 42, 43, 49, 50: capturing module [has] analog to digital converters which converts analog electrical signals into digital signals [audio data]; echo cancellation module would receive audio signals from the capturing module [using audio data]; dual talk detection and removal involves approximation of user speech from the mixture of machine sound (playback sound) [second speech signal] and user speech [first speech signal]; audio signal received by the microphone [target microphone as target audio data] represented by y(m) = x(m) + yf(m) where y(m) is the audio signal received by the mic, x(m) is the user's sound, and yf(m) is the sound of the speaker);
reducing the second speech signal in the target audio data by using an echo canceler, and determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data (see Pandya ¶ 43, 50: echo cancellation module would receive audio signals from the capturing module [using an echo canceler]; FIG. 3 illustrates processes performed by the echo cancellation module, audio signal received by the microphone [target audio data] represented by y(m) = x(m) + yf(m) where y(m) is the audio signal received by the mic, x(m) is the user's sound, and yf(m) is the sound of the speaker [second speech signal], yf(m) is the sound to be removed [reducing the second speech signal], y(m) would be subtracted by ŷ(m) which is generated by the acoustic feedback synthesizer which receives parameters from the adaptation algorithm, result of the subtraction would be an approximation of x(m) which is then fed into the adaptation algorithm as well as the acoustic feedback synthesizer for the generation of ŷ(m) [determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data]);
 Kim and Pandya are considered to be analogous because they are from the field of keyword detection.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention(s) to have modified Kim to incorporate the disclosure of Pandya in order to enhance input audio by removing audio playback from the input audio (see Pandya ¶ 43: echo cancellation used to enhance [audio] signals and remove playback audio captured by microphones).

Regarding claim 13 Kim in view of Pandya teach all the limitations of claim 12.
Kim further teaches:
selecting any one of the plurality of enhancement directions of a beamformer as a target enhancement direction (see Kim ¶ 250: plurality of audio streams includes one or more audio beams, multiple audio signals to obtain a single audio beam (e.g., via beamforming techniques) of the plurality of audio beams, at least one audio beam of the plurality of audio beams is obtained using source separation techniques), 
enhancing the first speech signal in the speech data set based on the beamformer (see Kim ¶ 252: electronic device modifies one audio beam of the plurality of audio beams for better speech recognition results);
and using the enhanced first speech signal as directional enhanced data in the target enhancement direction (see Kim ¶ 262: if the electronic device detects a spoken trigger in a particular audio stream (associated with directional information), the electronic device can select a speaker facing the direction associated with the particular audio stream);
filtering out environmental noise carried in the directional enhanced data based on a noise canceler and a reverb canceler, and determining the directional enhanced data from which the environmental noise is filtered out as the enhanced speech information corresponding to the speech data set; and (see Kim ¶ 250, 252:  when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to steer a spatial null in the direction of the interfering sources [filtering out environmental noise carried in the directional enhanced data based on a noise canceler] so as to obtain an independent representation of the audio from the source of interest [determining the directional enhanced data from which the environmental noise is filtered out as the enhanced speech information corresponding to the speech data set]; electronic device further adjusts the audio beam to minimize reverberation [reverb canceler])
in response to determining each of the plurality of enhancement directions is selected as the target enhancement direction, obtaining the enhanced speech information of the speech data set in the enhancement directions (see Kim ¶ 54, 250: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing [speech data set]; multiple audio signals are processed to obtain an audio stream of the plurality of audio streams [determining each of the plurality of enhancement directions is selected as the target enhancement direction], when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to steer a spatial null in the direction of the interfering sources so as to obtain an independent representation of the audio from the source of interest [obtaining the enhanced speech information of the speech data set in the enhancement directions]). 

Regarding claim 14, Kim in view of Pandya teach all the limitations of claim 13.
Kim further teaches:
a sound sub-signal transmitted by a first user and a sound sub-signal transmitted by a second user (see Kim ¶ 249: plurality of audio signals can include a spoken trigger uttered by a user [and] interfering signals including but limited to: speech from competing speakers (e.g., people other than the user in physical proximity to the electronic device)), 
the first user being a user in the target enhancement direction, and the second user being a user in one of the plurality of enhancement directions except the target enhancement direction (see Kim ¶ 266: if the electronic device detects the spoken trigger from a particular audio beam associated with a direction toward the user [first user], the electronic device can select one or more microphones corresponding to the audio beam to sample subsequent audio signals [in the target enhancement direction], a microphone corresponds to the audio beam if it is associated with (e.g., is configured to face) the direction of the audio beam and/or if the microphone has sampled an audio signal that formed the audio beam, electronic device may suppress the recognition and interpretation of any audio signal associated with a speaker different from the user [second user being a user in one of the plurality of enhancement directions except the target enhancement direction]); and
enhancing, based on the beamformer, the sound sub-signal transmitted by the first user in the first speech signal, and inhibit, in the target enhancement direction, interference data generated by the sound sub-signal transmitted by the second user, to output the enhanced first speech signal (see Kim ¶ 250, 252:  device processes multiple audio signals to obtain a single audio beam, via beamforming techniques [based on the beamformer], of the plurality of audio beams; when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to steer a spatial null in the direction of the interfering sources [inhibiting, in the target enhancement direction, interference data generated by the sound sub-signal transmitted by the second user] so as to obtain an independent representation of the audio from the source of interest [enhancing the sound sub-signal transmitted by the first user in the speech data set; to output the enhanced first speech signal]); and
using the enhanced first speech signal as the directional enhanced data in the target enhancement direction [see Kim ¶ 252: when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to obtain an independent representation of the audio from the source of interest].

Regarding claim 19, Kim teaches all the limitations of claim 18.
Kim further teaches:
obtaining to-be-enhanced speech data separately corresponding to each path of audio data, in response to determining the each of the microphones in the microphone array is the target microphone (see Kim ¶ 251, 252, 253: electronic device has six microphones [microphone array], each of which samples an audio signal, and processes the six audio signals to obtain six audio beams; electronic device modifies at least one audio beam of the plurality of audio beams for better speech recognition results [determining the each of the microphones in the microphone array is the target microphone], electronic device further adjusts the audio beam to minimize echo; determining whether any of the plurality of audio signals corresponds to the spoken trigger comprises determining whether each of the plurality of audio streams includes the spoken trigger [obtaining to-be-enhanced speech data separately corresponding to each path of audio data, in response to determining the each of the microphones is the target microphone], each of the plurality of audio streams is associated with directional information, electronic device determines whether the plurality of audio signals corresponds to a spoken trigger based on the directional information); and
separately adding each piece of to-be-enhanced speech data to the speech data set [see Kim ¶ 260: electronic device passes data corresponding to the set [speech data set] of candidate audio [beams] to a software module on the electronic device].
Kim does not teach:
a first speech signal and a second speech signal, the first speech signal being a sound signal that is transmitted by a user and acquired by the microphone array, the second speech signal being a sound signal that is transmitted by the terminal and acquired by the microphone array; and the obtaining a speech data set based on the multi-path audio data comprises: 
obtaining a target microphone from the microphones of the microphone array, and using audio data that comprises the first speech signal and the second speech signal and that corresponds to the target microphone as target audio data; reducing the second speech signal in the target audio data by using an echo canceler, and determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data;
Pandya discloses:
a first speech signal and a second speech signal, the first speech signal being a sound signal that is transmitted by a user and acquired by the microphone array, the second speech signal being a sound signal that is transmitted by the terminal and acquired by the microphone array (see Pandya ¶ 42, 43, 49: capturing module include[s] at least two microphones; echo cancellation module removes playback audio captured by the microphones [acquired by the microphone array]; Fig. 3 illustrates the processes performed by the echo cancellation module, dual talk detection and removal involves approximation of user speech from the mixture of machine sound (playback sound) [second speech signal being a sound signal that is transmitted by the terminal] and user speech [first speech signal being a sound signal that is transmitted by a user]); and
obtaining a target microphone from the microphones of the microphone array (see Pandya ¶ 42, 43, 50: capturing module include[s] at least two microphones; echo cancellation module removes playback audio captured by the microphones [from the microphones of the microphone array]; referring to Fig 3, a microphone and a speaker are situated in the same system enclosed by a room which may generate echo, audio signal received by the mic [obtaining a target microphone]), 
and using audio data that comprises the first speech signal and the second speech signal and that corresponds to the target microphone as target audio data (see Pandya ¶ 42, 43, 49, 50: capturing module [has] analog to digital converters which converts analog electrical signals into digital signals [audio data]; echo cancellation module would receive audio signals from the capturing module [using audio data]; dual talk detection and removal involves approximation of user speech from the mixture of machine sound (playback sound) [second speech signal] and user speech [first speech signal]; audio signal received by the microphone [target microphone as target audio data] represented by y(m) = x(m) + yf(m) where y(m) is the audio signal received by the mic, x(m) is the user's sound, and yf(m) is the sound of the speaker);
reducing the second speech signal in the target audio data by using an echo canceler, and determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data (see Pandya ¶ 43, 50: echo cancellation module would receive audio signals from the capturing module [using an echo canceler]; FIG. 3 illustrates processes performed by the echo cancellation module, audio signal received by the microphone [target audio data] represented by y(m) = x(m) + yf(m) where y(m) is the audio signal received by the mic, x(m) is the user's sound, and yf(m) is the sound of the speaker [second speech signal], yf(m) is the sound to be removed [reducing the second speech signal], y(m) would be subtracted by ŷ(m) which is generated by the acoustic feedback synthesizer which receives parameters from the adaptation algorithm, result of the subtraction would be an approximation of x(m) which is then fed into the adaptation algorithm as well as the acoustic feedback synthesizer for the generation of ŷ(m) [determining the target audio data from which the second speech signal is reduced as to-be-enhanced speech data]);
 Kim and Pandya are considered to be analogous because they are from the field of keyword detection.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention(s) to have modified Kim to incorporate the disclosure of Pandya in order to enhance input audio by removing audio playback from the input audio (see Pandya ¶ 43: echo cancellation used to enhance [audio] signals and remove playback audio captured by microphones).

Regarding claim 20, Kim in view of Pandya teach all the limitations of claim 19.
Kim further teaches:
selecting any one of the plurality of enhancement directions of a beamformer as a target enhancement direction (see Kim ¶ 250: plurality of audio streams includes one or more audio beams, multiple audio signals to obtain a single audio beam (e.g., via beamforming techniques) of the plurality of audio beams, at least one audio beam of the plurality of audio beams is obtained using source separation techniques), 
enhancing the first speech signal in the speech data set based on the beamformer (see Kim ¶ 252: electronic device modifies one audio beam of the plurality of audio beams for better speech recognition results);
and using the enhanced first speech signal as directional enhanced data in the target enhancement direction (see Kim ¶ 262: if the electronic device detects a spoken trigger in a particular audio stream (associated with directional information), the electronic device can select a speaker facing the direction associated with the particular audio stream);
filtering out environmental noise carried in the directional enhanced data based on a noise canceler and a reverb canceler, and determining the directional enhanced data from which the environmental noise is filtered out as the enhanced speech information corresponding to the speech data set; and (see Kim ¶ 250, 252:  when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to steer a spatial null in the direction of the interfering sources [filtering out environmental noise carried in the directional enhanced data based on a noise canceler] so as to obtain an independent representation of the audio from the source of interest [determining the directional enhanced data from which the environmental noise is filtered out as the enhanced speech information corresponding to the speech data set]; electronic device further adjusts the audio beam to minimize reverberation [reverb canceler])
in response to determining each of the plurality of enhancement directions is selected as the target enhancement direction, obtaining the enhanced speech information of the speech data set in the enhancement directions (see Kim ¶ 54, 250: audio circuitry converts the electrical signal to audio data and transmits the audio data to peripherals interface for processing [speech data set]; multiple audio signals are processed to obtain an audio stream of the plurality of audio streams [determining each of the plurality of enhancement directions is selected as the target enhancement direction], when there are a number of active audio sources (e.g., the user and one or more competing speakers in physical proximity to the electronic device), the electronic device is able to steer a spatial null in the direction of the interfering sources so as to obtain an independent representation of the audio from the source of interest [obtaining the enhanced speech information of the speech data set in the enhancement directions]). 

Claims 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US 2018/0336892; hereafter Kim) in view of Fujimura (US 2018/0268809).
Regarding claim 8, Kim teaches all the limitations of claim 7.
Kim further teaches:
obtaining the speech hidden feature in the target audio direction from the first wake-up detection model (see Kim ¶ 214, 254: speech pre-processor [first wake-up detection model] extracts representative features [speech hidden feature] from the speech input); electronic device determines whether any of the plurality of audio signals corresponds to a spoken trigger [in the target audio direction];
Kim does not teach:
wherein the speech spectrum feature in the enhanced speech information is extracted by a second wake-up detection model; and splicing the speech spectrum feature and the speech hidden feature in the target audio direction, to obtain a spliced vector feature; inputting the spliced vector feature into the second wake-up detection model, outputting a degree of matching between the spliced vector feature and a target wake-up feature in the second wake­ up detection model, and generating the target authentication result according to the degree of matching outputted by the second wake-up detection model; and waking up the terminal in response to determining the degree of matching in the target authentication result is greater than or equal to a matching threshold corresponding to the target matching word.
Fujimura discloses:
wherein the speech spectrum feature in the enhanced speech information is extracted by a second wake-up detection model (see Fujimura ¶ 61, 124: voice acquisition module extracts MFCC feature [speech spectrum feature] from the samples [enhanced speech information]; voice keyword detection program [wake-up detection] of the embodiment includes a voice acquisition module [a second wake-up detection model]); and
splicing the speech spectrum feature and the speech hidden feature in the target audio direction, to obtain a spliced vector feature (see Fujimura ¶ 36, 59, 61: application detects a target keyword voice [in the target audio direction] from a voice waveform and causes a device to operate in accordance with the keyword; voice acquisition module generates a feature vector; voice acquisition module extracts MFCC feature [speech spectrum feature] from the samples, module buffers the MFCC features for [a designated number of] frames and outputs thirty-six dimensional feature obtained by concatenating the MFCC features [splicing] for the [designated number of] frames as a feature at a time of a central frame in the [designated number of] frames [to obtain a spliced vector feature], the extracted feature is not limited to MFCC [e.g.] the RSTA-PLP feature [speech hidden feature], the features may be combined [splicing the speech spectrum feature and the speech hidden feature in the target audio direction]; 
inputting the spliced vector feature into the second wake-up detection model (see Fujimura ¶ 59, 61: voice acquisition module [second wake-up detection model] generates a feature vector; voice acquisition module extracts MFCC feature from the samples, module buffers the MFCC features for [a designated number of] frames and outputs thirty-six dimensional feature obtained by concatenating the MFCC features [splicing; inputting the spliced vector feature],
outputting a degree of matching between the spliced vector feature and a target wake-up feature in the second wake-up detection model (see Fujimura ¶ 62, 68: keyword score calculation module receives a voice feature generated by the voice acquisition module [spliced vector feature] and calculates a keyword/sub-keyword score [outputting a degree of matching between the spliced vector feature and a target wake-up feature in the second wake-up detection model]; keyword/sub-keyword model can be modeled by phonological representation units [target wake-up feature]), 
and generating the target authentication result according to the degree of matching outputted by the second wake-up detection model (see Fujimura ¶ 69: keyword detection module compares a keyword/sub-keyword score with a set threshold score [according to the degree of matching outputted by the second wake-up detection model] and determines whether there is a keyword or a sub-keyword [generating the target authentication result] having a score exceeding the threshold score); and
waking up the terminal in response to determining the degree of matching in the target authentication result is greater than or equal to a matching threshold corresponding to the target matching word (see Fujimura ¶ 24, 69: server detects a keyword from the voice data received from the client by using the keyword detection function and transmits the keyword to the client via the network 3, as a result the client can start a specific operation corresponding to the detected keyword [waking up the terminal in response to]; keyword detection module compares a keyword/sub-keyword score with a set threshold score and determines whether there is a keyword or a sub-keyword having a score exceeding the threshold score [determining the degree of matching in the target authentication result is greater than or equal to a matching threshold corresponding to the target matching word]).
Kim and Fujimura are considered to be analogous because they are from the field of keyword detection.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention(s) to have modified Kim to incorporate the disclosure of Fujimura in order to quickly and correctly detect a wake-word in a speech audio input (see Fujimura ¶ 39: as a result, a keyword can be quickly and correctly detected from a voice).

Regarding claim 9, Kim in view of Fujimura teach all the limitations of claim 8.
Kim further teaches:
performing the operation of obtaining multi-path audio data in an environmental space (see Kim ¶ 248, 268: electronic device samples an audio signal at each of a plurality of microphones of the electronic device to obtain a plurality of audio signals [multipath audio data]; above-described techniques can be used to select microphones to sample audio signals (e.g., as the user moves around the room) [an environmental space]).
Kim does not teach:
determining that authentication fails in response to determining the degree of matching in the target authentication result is less than the matching threshold corresponding to the target matching word.
Fujimura discloses:
determining that authentication fails in response to determining the degree of matching in the target authentication result is less than the matching threshold corresponding to the target matching word (see Fujimura ¶ 69, 71: keyword detection module compares a keyword/sub-keyword score with a set threshold score and determines whether there is a keyword [corresponding to the target matching word] or a sub-keyword having a score exceeding the threshold score; when a keyword exceeding the threshold is not detected [determining that authentication fails in response to determining the degree of matching in the target authentication result is less than the matching threshold], the process for detecting a keyword from voice data is continued).
Kim and Fujimura are considered to be analogous because they are from the field of keyword detection.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention(s) to have modified Kim to incorporate the disclosure of Fujimura in order to quickly and correctly detect a wake-word in a speech audio input (see Fujimura ¶ 39: as a result, a keyword can be quickly and correctly detected from a voice).


Conclusion	
Any inquiry concerning this communication or earlier communications from Examiner should be directed to AARON G. ZELLER whose telephone number is (571) 272-5765.  Examiner can normally be reached Monday - Thursday 10 AM - 7:30 PM and every other Friday 10:00 AM - 6:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool.  To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach Examiner by telephone are unsuccessful, Examiner’s supervisor, Pierre-Louis Desir can be reached at (571) 272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.  Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format.  For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AARON G ZELLER/
Examiner, Art Unit 2659                                                                                                                                                                                                                                                                                                                                                                                                           30 June 2022

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659