Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/28/2019 is being considered by the examiner.
Drawings
The drawing submitted on 03/28/2019 is being considered by the examiner.
Response to Amendment
Claims 1-5, 7-12, 14-19, and 21-22 are currently pending and among them claims 1, 10, 15 and 22 are independent claims and claims 1, 10, 12, 15 and 22 has been amended and claims 6, 13 and 20 has been cancelled.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 10, 15 and 22 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims , 7, 9-10, 13-15 and 21-22  are rejected under 35 U.S.C. 103 as being unpatentable over Koishida et al.(US 2018/0233142 A1) in view of Hatfield et al.(US 2016/0322045 A1).

Regarding Claims 1, and 15, Koishida et al. teach: An acoustic environment aware method for selecting a high quality audio stream during multi-stream speech recognition, comprising: receiving by a processor a plurality of audio streams ([0062] Accordingly, the entity tracker 100 may use a variety of audio processing techniques to more confidently identify a particular active participant who is engaged in a conversation with other people and/or with the intelligent assistant computer 20. As an entity tracker 100 may implement a voice activity detection (VAD) engine that may distinguish human voices from environmental noise, and identify the presence or absence of human speech. [0066] In this example, entity tracker 100 receives two speech fragments 400A and 400B. Speech fragment 400A includes recorded speech of a person 1, and speech fragment 400B includes recorded speech of a person 2.); determining in a first pass (voice activity detection (VAD) engine) whether at least one audio stream of the audio streams includes a voice trigger ([0062] As an example, the entity tracker 100 may implement a voice activity detection (VAD) engine that may distinguish human voices from environmental noise, and identify the presence or absence of human speech. [0077] The speech may include any suitable utterance that can be recognized and used to trigger the performance of a computing device action by the intelligent assistant. In some scenarios, the speech may include a keyword directing the intelligent assistant to analyze the speech spoken by the first user.); in response to determining that the at least one audio stream includes a voice trigger, for each audio stream of the plurality of audio streams: generating a voice trigger score(keyword confidence value) associated with the audio stream ([0063] General-purpose VAD engines may be used for the purpose of classifying a particular segment of audio as including either speech or non-speech, with a corresponding confidence value. [0078] As specific examples, score determination may include evaluating one or more of (1) the amplitude of recorded speech, (2) the signal-to-noise ratio (SNR) of recorded speech, (3) a keyword confidence value indicating a likelihood that the recorded speech includes a keyword or keyword phrase, and (4) a user identification confidence value indicating a likelihood that the user is a particular person--e.g., that the user's identity is a known calculating an acoustic environment measurement (SNR) associated with the audio stream ([0078] As specific examples, score determination may include evaluating one or more of (1) the amplitude of recorded speech, (2) the signal-to-noise ratio (SNR) of recorded speech, (3) a keyword confidence value indicating a likelihood that the recorded speech includes a keyword or keyword phrase, and (4) a user identification confidence value indicating a likelihood that the user is a particular person--e.g., that the user's identity is a known identity. [0080] The SNR may be calculated for the recorded speech by comparing a signal level of a user's voice to a level of background noise.); and calculating a combined score based on the voice trigger score (keyword/keyword phrase confidence score) associated with the audio stream and the acoustic environment measurement (SNR) associated with the audio stream ([0081] In some examples, a selection score may be determined by combining the four metrics described above (amplitude, SNR, keyword/keyword phrase confidence, user ID confidence) into a single selection score, such as by averaging the metrics. ); and outputting a preferred audio stream of the plurality of audio streams having a highest combined score ([0123] Device selector 174 may be configured to implement at least a portion of selection module 80 (FIG. 7) and method 500 (FIGS. 8A-8B). For example, device selector 174 may receive audio data streams from multiple intelligent assistants located in an environment, determine selection scores for each assistant, identify the assistant that produced the highest score, and cause transmission of an instruction to the highest-scoring assistant to respond to a requesting user in the environment. In other examples, the intelligent assistants may determine respective selection scores and transmit the scores to the remote services 170, which may identify the highest-scoring assistant and transmit an instruction to that assistant causing its response to the requesting user.).
Koishida et al. however do not teach: determining in a second pass whether the voice trigger is present in the selected stream; and in response to determining in the second pass that the voice trigger is present in the selected stream outputting the selected stream as a preferred audio stream.

Hatfield et al. teach: determining in a first pass (first trigger detection block 70) whether at least one audio stream of the audio streams includes a voice trigger; determining in a second pass (Second trigger detection block 72) whether the voice trigger is present in the selected stream; and in response to determining in the second pass that the voice trigger is present in the selected stream outputting the selected stream as a preferred audio stream ([0120] As discussed with reference to FIG. 2, input data might be sent continually to the buffer 38, the first trigger detection block 70, and the second trigger detection block 72, or an activity detection block might be provided, such that data is sent to or accepted by or processed by the buffer 38, the first trigger detection block 70, and the second trigger detection block 72, only when it is determined that the input signal contains some minimal signal activity. [0121] The first trigger detection block 70 detects whether or not the received signal contains data representing a spoken trigger phrase, using relatively loose detection criteria, meaning that the first trigger detection block 70 has a very high probability of recognizing the trigger phrase in the data, but with a correspondingly higher risk of a false positive (that is detecting the presence of a trigger phrase that was not in fact spoken). The second trigger detection block 72 also detects whether or not the received signal contains data representing a spoken trigger phrase, but using relatively tight detection criteria, meaning that the second trigger detection block 70 has a lower risk of producing a false positive detection. The first trigger detection block may be less complex than the second trigger detection block, and may therefore consume less power and/or be less computationally intensive when active. The second trigger detection block may be activated only after the first trigger detection block has detected a likely trigger phrase. [0143] As shown in FIG. 9, the output of the speech enhancement block 48 is supplied to the input of the second trigger detection block 72. Thus, in step 164 of the process shown in FIG. 11, the second trigger detection block 72 performs a trigger detection process on the output TP* of the speech enhancement block 48 resulting from the data TP read out from storage in the buffer 38, with the speech enhancement block 48 using the frozen, or only slowly converging, coefficients. [0144] The second trigger detection block 72 may be configured so that it detects the presence of data representing a specified trigger phrase in the data that it receives, or may be configured so that it detects the presence of data representing a specified trigger phrase, when spoken by a particular speaker. [0145] In this embodiment, the second trigger detection block 72 benefits from the fact that it is acting on an input signal TP* that has passed through the speech enhancement block 48, and therefore has reduced noise levels. The reduced noise levels may also make it feasible to provide a more reliable speaker recognition function in this block, to verify not only the presence of the defined trigger phrase but also to verify the identity of the person speaking it.)
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. to include the teaching of Hatfield et al. above in order to provide a more reliable speaker recognition function, to verify not only the presence of the defined trigger phrase but also to verify the identity of the person speaking.

Regarding Claims 7 and 21, Koishida et al. teach: The method of claim 1, wherein the acoustic environment measurement comprises at least one of a signal to noise ratio, a direct to reverberant ratio, an audio signal level, or a direction of arrival of the voice trigger ([0063] General-purpose VAD engines may be used for the purpose of classifying a particular segment of audio as including either speech or non-speech, with a corresponding confidence value. [0078] As specific examples, score determination may include evaluating one or more of (1) the amplitude of recorded speech, (2) the signal-to-noise ratio (SNR) of recorded speech, (3) a keyword confidence value indicating a likelihood that the recorded speech includes a keyword or keyword phrase, and (4) a user identification confidence value indicating a likelihood that the user is a particular person--e.g., that the user's identity is a known identity.).

Regarding Claim 9, Koishida et al. teach: The method of claim 1, further comprising determining whether the voice trigger was spoken by a desired speaker ([0062] Accordingly, the entity tracker 100 may use a variety of audio processing techniques to more confidently identify a particular active participant who is engaged in a conversation with other people and/or with the intelligent assistant computer 20. As an example, the entity tracker 100 may implement a voice activity detection (VAD) engine that may distinguish human voices from environmental noise, and identify the presence or absence of human speech. [0063] General-purpose VAD engines may be used for the purpose of classifying a particular segment of audio as including either speech or non-speech, with a corresponding confidence value. An entity tracker 100 also may utilize a speaker recognition engine to match a particular audio segment with a particular person.).

Regarding Claim 10, Koishida et al. teach: An acoustic environment aware method for selecting a high quality audio stream during multi-stream speech recognition, comprising: receiving, by a first pass voice trigger detector(VAD or entity tracker 100), a plurality of audio streams([0062] Accordingly, the entity tracker 100 may use a variety of audio processing techniques to more confidently identify a particular active participant who is engaged in a conversation with other people and/or with the intelligent assistant computer 20. As an example, the entity tracker 100 may implement a voice activity detection (VAD) engine that may distinguish human voices from environmental noise, and identify the presence or absence of human speech. [0066] In this example, entity tracker 100 receives two speech fragments 400A and 400B. Speech fragment 400A includes recorded speech of a person 1, and speech fragment 400B includes recorded speech of a person 2.); determining, by the first pass voice trigger detector, whether at least one of the audio streams includes a voice trigger ([0062] As an example, the entity tracker 100 may implement a voice activity detection (VAD) engine that may distinguish human voices from environmental noise, and identify the presence or absence of human speech. [0077] The speech may include any suitable utterance that can be recognized and used to trigger the performance of a computing device action by the intelligent assistant. In some scenarios, the speech may include a keyword directing the intelligent assistant to analyze the speech spoken by the first user.); in response to determining that at least one of the audio streams includes a determined voice trigger: generating a voice trigger score (keyword confidence value) ([0063] General-purpose VAD engines may be used for the purpose of classifying a particular segment of audio as including either speech or non-speech, with a corresponding confidence value. [0078] As specific examples, score determination may include evaluating one or more of (1) the amplitude of recorded speech, (2) the signal-to-noise ratio (SNR) of recorded speech, (3) a keyword confidence value indicating a likelihood that the recorded speech includes a keyword or keyword phrase, and (4) a user identification confidence value indicating a likelihood that the user is a particular person--e.g., that the user's identity is a known identity.); calculating a signal to noise ratio by utilizing the determined voice trigger as an anchor([0078] As specific examples, score determination may include evaluating one or more of (1) the amplitude of recorded speech, (2) the signal-to-noise ratio (SNR) of recorded speech, (3) a keyword confidence value indicating a likelihood that the recorded speech includes a keyword or keyword phrase, and (4) a user identification confidence value indicating a likelihood that the user is a particular person--e.g., that the user's identity is a known identity. [0080] The SNR may be calculated for the recorded speech by comparing a signal level of a user's voice to a level of background noise.); for each audio stream of the plurality of audio streams, calculating a combined score based on the voice trigger score associated with the audio stream and the signal to noise ratio associated with the audio stream([0081] In some examples, a selection score may be determined by combining the four metrics described above (amplitude, SNR, keyword/keyword phrase confidence, user ID confidence) into a single selection score, such as by averaging the metrics. ); and selecting an audio stream with highest combined score; and outputting the selected audio stream([0123] Device selector 174 may be configured to implement at least a portion of selection module 80 (FIG. 7) and method 500 (FIGS. 8A-8B). For example, device selector 174 may receive audio data streams from multiple intelligent assistants located in an environment, determine selection scores for each assistant, identify the assistant that produced the highest score, and cause transmission of an instruction to the highest-scoring assistant to respond to a requesting user in the environment. In other examples, the intelligent assistants may determine respective selection scores and transmit the scores to the remote services 170, which may identify the highest-scoring assistant and transmit an instruction to that assistant causing its response to the requesting user.).
Koishida et al. however do not teach: determining in a second pass whether the voice trigger is present in the selected stream; and in response to determining in the second pass that the voice trigger is present, outputting the selected stream as a preferred audio stream for speech recognition analysis.

Hatfield et al. teach: determining in a first pass (first trigger detection block 70) whether at least one audio stream of the audio streams includes a voice trigger; determining in a second pass (Second trigger detection block 72) whether the voice trigger is present in the selected stream; and in response to determining in the second pass that the voice trigger is present, outputting the selected stream as a preferred audio stream for speech recognition analysis ([0120] As discussed with reference to FIG. 2, input data might be sent continually to the buffer 38, the first trigger detection block 70, and the second trigger detection block 72, or an activity detection block might be provided, such that data is sent to or accepted by or processed by the buffer 38, the first trigger detection block 70, and the second trigger detection block 72, only when it is determined that the input signal contains some minimal signal activity. [0121] The first trigger detection block 70 detects whether or not the received signal contains data representing a spoken trigger phrase, using relatively loose detection criteria, meaning that the first trigger detection block 70 has a very high probability of recognizing the trigger phrase in the data, but with a correspondingly higher risk of a false positive (that is detecting the presence of a trigger phrase that was not in fact spoken). The second trigger detection block 72 also detects whether or not the received signal contains data representing a spoken trigger phrase, but using relatively tight detection criteria, meaning that the second trigger detection block 70 has a lower risk of producing a false positive detection. The first trigger detection block may be less complex than the second trigger detection block, and may therefore consume less power and/or be less computationally intensive when active. The second trigger detection block may be activated only after the first trigger detection block has detected a likely trigger phrase. [0143] As shown in FIG. 9, the output of the speech enhancement block 48 is supplied to the input of the second trigger detection block 72. Thus, in step 164 of the process shown in FIG. 11, the second trigger detection block 72 performs a trigger detection process on the output TP* of the speech enhancement block 48 resulting from the data TP read out from storage in the buffer 38, with the speech enhancement block 48 using the frozen, or only slowly converging, coefficients. [0144] The second trigger detection block 72 may be configured so that it detects the presence of data representing a specified trigger phrase in the data that it receives, or may be configured so that it detects the presence of data representing a specified trigger phrase, when spoken by a particular speaker. [0145] In this embodiment, the second trigger detection block 72 benefits from the fact that it is acting on an input signal TP* that has passed through the speech enhancement block 48, and therefore has reduced noise levels. The reduced noise levels may also make it feasible to provide a more reliable speaker recognition function in this block, to verify not only the presence of the defined trigger phrase but also to verify the identity of the person speaking it.)
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. to include the teaching of Hatfield et al. above in order to provide a more reliable speaker recognition function, to verify not only the presence of the defined trigger phrase but also to verify the identity of the person speaking.


Regarding Claim 14, Koishida et al. teach:  The method of claim 10, wherein the selected audio stream includes a payload, wherein the payload (Move my game from the den to this TV and pause it”) is speech that comes after the voice trigger (“Hey computer”) ([0018] As a particular example, FIG. 1 shows a first user 8 in living room 2 providing natural language input to request the transfer of an instance of a computer game from one computing device to another. Using this data, the computing device may automatically transfer the instance of the computer game to the other device.).

Regarding Claim 22:  An acoustic environment aware system for selecting a high quality audio stream during multi-stream speech recognition, comprising: a processor; and memory having stored therein instructions that when executed by the processor receives, by a first pass voice trigger detector, a plurality of audio streams; determines by the first pass voice trigger detector whether at least one of the audio streams includes a voice trigger; in response to determining that at least one of the audio streams includes a voice trigger: generates a voice trigger score for each audio stream of the plurality of audio streams; calculates a signal to noise ratio for each audio stream by utilizing the voice trigger for the audio stream as an anchor; for each audio stream of the plurality of audio streams, calculates a combined score based on the voice trigger score associated with the audio stream and the signal to noise ration associated with the audio stream; and selects one of the plurality of audio streams that has  a highest combined score and then determines in a second pass whether the voice trigger is present in the selected stream; and in response to determining in the second pass that the voice trigger is present in the selected stream outputs the selected stream as a preferred audio stream. (See anyone of the rejection of Claims 1, or 10).

Claims 2-5, and 16-19, (Alternate rejection 7 and 21), are rejected under 35 U.S.C. 103 as being unpatentable over Koishida et al. in view of Hatfield et al. further in view of Ramprasad et al.(US 2018/0033447 A1).

Regarding Claims 2 and 16, Koishida et al. teach: The method of claim 1, wherein the plurality of audio streams include an at least one beamformed audio stream ([0070] By allowing the entity tracker 100 to recognize irrelevant background noise, the ability of the entity tracker to recognize relevant human speech and other sounds may be improved. In some implementations, positional knowledge of a sound source may be used to focus listening from a directional microphone array.
Koishida et al. do not teach: The method of claim 1, wherein the plurality of audio streams include an at least one beamformed audio stream and an at least one blind source separation audio stream.
Hatfield et al. teach: wherein the plurality of audio streams include an at least one beamformed audio stream ([0123] In this illustrated embodiment, the speech enhancement block 48 takes the form of a beamformer, which receives data from multiple microphone sources (which may advantageously be at least somewhat directional, and located on the host device such that they detect sounds from different directions), and generates an output signal in the form of a selection and/or combination of the input signals. The output signal may for example be obtained from the input signals by applying different weightings and phasings to the input signals. Thus, in moderately noisy environments, the output signal can emphasise the signal from one or more microphone that is directed generally towards the speaker, and can suppress the signal from one or more microphone that is directed towards a source of background noise, in order to produce an output signal that has a higher signal to noise ratio than would be achievable using any single one of the microphones alone. [0124] In the case of an enhancement block in the form of a beamformer, the training or adaptation configures the directionality of the beamformer for example. By training the algorithm using audio data from multiple microphones, it is possible to identify speech sources and to configure the beamformer's filters such that they enhance audio content from the direction of the loudest speech source and attenuate audio from other sources. [0132] As discussed above, the enhancement block 48 may be a beamformer in this example, and so the process of adaptation involves selecting the weightings and phasings applied to the multiple microphone signals, in order to generate an output signal that has a higher signal to noise ratio.).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. to include the teaching of Hatfield et al. above in order to generate an output signal that has a higher signal to noise ratio.
Koishida et al. in view of Hatfield et al. do not teach: the plurality of audio streams include an at least one blind source separation audio stream.
Ramprasad et al. teach: the plurality of audio streams include an at least one beamformed audio stream and an at least one blind source separation audio stream ([0008] An embodiment herein aims to address the problem of how to adaptively or dynamically, e.g., during in-the-field use of a mobile phone that can be in a changing ambient environment, analyze available microphone signals that generate a plurality of acoustic beams to determine an appropriate pair or group of beams, such that at least one pair shows both good voice separation and good noise matching. [0009] In this embodiment, a second subset of microphones is assigned to produce a beam to pick up the ambient noise, and the acoustic pick up beam defined by the signals available from this subset of microphones is considered to be the " noise beam". In other embodiments, the audio system may use audio-based blind source separation and estimation, or a camera, to locate a primary talker and/or any noise sources in the environment and to correlate this information with audio signals in order to determine which microphones should be used to generate a voice beam and which microphones should be used to generate a noise beam. [0010] In one embodiment, possible pairs of noise beams and voice beams that may be produced by the microphone signals are tested based on the positions of the microphones, the locations of the local voice and the ambient noise and the directions of the local voice and the ambient noise to determine which beam pairs maintain thresholds for voice-separation and noise-matching. For example, thresholds are defined to maintain sufficiently large voice separation and noise-matching and two or more acoustic pickup beams are selected for input to a noise suppressor based on satisfaction of the thresholds. To determine whether there is sufficient noise-matching between two acoustic pick up beams, in one embodiment, instantaneous and average ratios are obtained over a time interval between a strength of a noise component in one beam and a strength of a noise component in another beam. ).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. in view of Hatfield et al.  to include the teaching of Ramprasad et al. above in order to determine which beam pairs maintain thresholds for voice-separation and noise-matching.

Regarding Claims 3 and 18, Koishida et al. teach: wherein the at least one beamformed audio stream are generated by processing signals from a microphone array ([0070]).
Koishida et al. teach do not teach: The method of claim 2, wherein the at least one beamformed audio stream and the at least one blind source separation audio stream are generated by processing signals from a microphone array.
Hatfield et al. teach:  wherein the at least one beamformed audio stream are generated by processing signals from a microphone array ([0132] As discussed above, the enhancement block 48 may be a beamformer in this example, and so the process of adaptation involves selecting the weightings and phasings applied to the multiple microphone signals, in order to generate an output signal that has a higher signal to noise ratio. [0156] Thus, as in FIG. 9, signals from multiple microphones 18, 20 are sent to a buffer 38. There is also a first trigger detection block 70, which detects whether or not data it receives represents a predetermined trigger phrase. [0163] At intermediate noise levels, speech recognition will work, and so at least the first trigger detection block may be active (while a second trigger detector may be active or may be activated in response to the first trigger detection events). Moreover, the speech enhancement is likely to improve the operation of the downstream speech recognition, and so the enhancement block can be brought into a state where it is enablable in response to trigger phrase detection events for example receiving signals from multiple microphones in the case of a beamformer.).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. to include the teaching of Hatfield et al. above in order to generate an output signal that has a higher signal to noise ratio.
Koishida et al. in view of Hatfield et al. do not teach: the at least one blind source separation audio stream are generated by processing signals from a microphone array.
Ramprasad et al. teach: the at least one beamformed audio stream and the at least one blind source separation audio stream are generated by processing signals from a microphone array ([0007] Such a group of microphones often constitutes a microphone array or a microphone cluster. For example, on a mobile phone, a cluster may be localized on one part of the phone, e.g. the bottom. A cluster may include some microphones from the bottom and some microphones from the top. [0008] An embodiment herein aims to address the problem of how to adaptively or dynamically, e.g., during in-the-field use of a mobile phone that can be in a changing ambient environment, analyze available microphone signals that generate a plurality of acoustic beams to determine an appropriate pair or group of beams, such that at least one pair shows both good voice separation and good noise matching. [0010] In one embodiment, possible pairs of noise beams and voice beams that may be produced by the microphone signals are tested based on the positions of the microphones, the locations of the local voice and the ambient noise and the directions of the local voice and the ambient noise to determine which beam pairs maintain thresholds for voice-separation and noise-matching. For example, thresholds are defined to maintain sufficiently large voice separation and noise-matching and two or more acoustic pickup beams are selected for input to a noise suppressor based on satisfaction of the thresholds. To determine whether there is sufficient noise-matching between two acoustic pick up beams, in one embodiment, instantaneous and average ratios are obtained over a time interval between a strength of a noise component in one beam and a strength of a noise component in another beam. ).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. in view of Hatfield et al.  to include the teaching of Ramprasad et al. above in order to determine which beam pairs maintain thresholds for voice-separation and noise-matching.

Regarding Claims 4 and 17, Ramprasad et al. teach:  The method of claim 2, wherein the blind source separation audio stream comprises a plurality of blind source separation audio streams at least one of which contains target speech from a user ([0012] If a pair of a voice beam and a noise beam is determined to satisfy the thresholds for noise-matching and voice separation, these beams can be selected for input to a noise suppressor or a voice activity detector (VAD). The selected voice beam that is voice dominant is provided as a voice input signal to a multi-channel noise suppression process or VAD, and the noise beam that is noise dominant is provided as a noise input signal to a multi-channel noise suppression process or VAD. This should enable the noise suppression process to produce more accurate voice activity decisions and noise and voice estimates which in turn should lead to a less distorted, noise-suppressed, voice output signal produced by the noise suppression process. In other embodiments, more than two beams may be selected as input to the multi-channel noise suppressor or the VAD. Also, in embodiments in which multiples pairs of beams satisfy the thresholds for voice-separation and noise-matching, selection of the beams balances the individual measures of voice separation and noise matching in order to select an appropriate beam pair.).

Regarding Claims 5 and 19, Ramprasad et al. teach:  The method of claim I, wherein the plurality of audio streams are received from more than one speech enabled device ([0030] In one embodiment, arrangements of any suitable number of microphones and microphones clusters in the housing of a tablet computer, a laptop computer, or a desktop computer are possible. In one embodiment, distributed arrangements of microphones and microphone clusters are possible. For example, the microphones and microphone clusters of the audio system may be arranged in separate housings of tablet computers, laptop computers, desktop computers, mobile phones or other audio systems. [0037] As one example, suitable combinations of the signals from microphones 1 and 2 may generate a number of acoustic pick up beams. Beam analyzers 150 and 155 may each analyze the received microphone signals to determine which of the microphone signals will produce a beam that captures a desired source (such as a local voice) and an undesired source (such as ambient noise), respectively.).

 (Alternate Rejection) Regarding Claims 7 and 21, Ramprasad et al. teach: The method of claim 1, wherein the acoustic environment measurement comprises at least one of a signal to noise ratio, a direct to reverberant ratio, an audio signal level, or a direction of arrival of the voice trigger ([0008] An embodiment herein aims to address the problem of how to adaptively or dynamically, e.g., during in-the-field use of a mobile phone that can be in a changing ambient environment, analyze available microphone signals that generate a plurality of acoustic beams to determine an appropriate pair or group of beams, such that at least one pair shows both good voice separation and good noise matching. In one embodiment, one acoustic beam, often the one with larger SNR, is used to pick-up a desired local voice (referred to as a "voice beam") and the other beam, typically having lower SNR, is used to pick up undesired ambient noise (referred to as a " noise beam").).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. in view of Hatfield et al.  to include the teaching of Ramprasad et al. above in order to determine which beam pairs maintain thresholds for voice-separation and noise-matching.

9.	Claims 8 and 11, are rejected under 35 U.S.C. 103 as being unpatentable over Koishida et al. in view of Hatfield et al. further in view of Yoshioka (US 2009/0103740 A1).) 

Regarding Claim 8, Koishida et al. teach: determining the voice trigger and the acoustic environment measurement comprises the signal to noise ratio calculated using voice trigger i) signal in an interval ii) noise in an interval during voice trigger ([0063] General-purpose VAD engines may be used for the purpose of classifying a particular segment of audio as including either speech or non-speech, with a corresponding confidence value. [0068] In some examples, an entity tracker 100 may be configured to identify background noise present in an environment, and use audio processing techniques to subtract such background noise from received audio data. [0069] Accordingly and in some examples, the device playing the background audio and/or another microphone-equipped device recording the background audio may send the captured audio signal to the entity tracker 100. In this manner, the entity tracker 100 may subtract the background audio from the audio signal received from the microphone-equipped devices. In some examples, the subtraction of the background audio signal from the recorded audio data may be performed by the device(s) that capture the audio data, or by associated audio-processing components, prior to sending the audio data to the entity tracker 100.  [0070] Additionally or alternatively, devices and/or the entity tracker 100 may be trained to recognize particular sources of background noise (e.g., from an air vent or refrigerator), and automatically ignore waveforms corresponding to such noise in recorded audio. In some examples, an entity tracker 100 may include one or more audio-recognition models trained specifically to recognize background noise. For example, audio from various noise databases may be run through unsupervised learning algorithms in order to more consistently recognize such noise. By allowing the entity tracker 100 to recognize irrelevant background noise, the ability of the entity tracker to recognize relevant human speech and other sounds may be improved. In some implementations, positional knowledge of a sound source may be used to focus listening from a directional microphone array. [0078] As specific examples, score determination may include evaluating one or more of (1) the amplitude of recorded speech, (2) the signal-to-noise ratio (SNR) of recorded speech, (3) a keyword confidence value indicating a likelihood that the recorded speech includes a keyword or keyword phrase, and (4) a user identification confidence value indicating a likelihood that the user is a particular person--e.g., that the user's identity is a known identity. [0080] The SNR may be calculated for the recorded speech by comparing a signal level of a user's voice to a level of background noise. In some examples, the amplitude of the input may be used to determine a proximity of the user to a corresponding microphone. It will be appreciated that the metrics discussed in the present implementations are provided as examples and are not meant to be limiting. [0081] In some examples, a selection score may be determined by combining the four metrics described above (amplitude, SNR, keyword/keyword phrase confidence, user ID confidence) into a single selection score, such as by averaging the metrics. In some examples and prior to combining, each of the metrics may be weighted by empirically-determined weights that reflect the accuracy of a metric in predicting the device/microphone and corresponding audio data stream that will provide the best user experience.).
Koishida et al. do not teach: The method of claim 7 wherein determining the voice trigger comprises a determined start time and a determined end time of the voice trigger, and wherein the acoustic environment measurement comprises the signal to noise ratio calculated using i) signal in an interval between the determined start time and the determined end time and ii) noise in an interval before the determined start time.
Hatfield et al. tech: determining the voice trigger comprises a determined start time and a determined end time of the voice trigger, and wherein the acoustic environment measurement comprises the signal to noise ratio calculated using i) signal in an interval between the determined start time and the determined end time and and ii) noise in an interval during the determined start time ([0069] Over the course of the time shown in the figure the buffer 38 contains Pre-data (PD), which represents the data recorded by the buffer 38 before the user starts speaking the predefined trigger phrase, trigger phrase data (TP) and four command word data sections (C, C2, C3, C4). The end of the trigger phrase occurs at time T.sub.ph. [0070] In step 106 of the process of FIG. 4, the trigger phrase detection block 40 is continually attempting to detect the trigger phrase in the received microphone signals. The trigger phrase detection block 40 inevitably has a finite processing time, and so the trigger phrase is actually detected by the trigger detection block 40 at time T.sub.TPD, a time interval Tdd after the end of the actual spoken trigger phrase at T.sub.ph. [0091] FIG. 6 shows an example of the operation of the system shown in FIG. 4, and FIG. 7 is a flow chart showing the process performed. The process shown in FIG. 7 starts with step 122, in which the acoustic signals received at the or each microphone are converted into digital electrical signals representing the detected sounds. In step 124, these microphone signals are stored in the buffer 38. The axis labelled Bin in FIG. 6 shows the data received and written into the buffer 38 at any given time. The start of this writing of data to the buffer 38 may be activated by the level of sound being recorded by the microphone 18 increasing over a threshold value. In other embodiments the buffer 38 may be continuously writing. Over the course of the time shown in the figure the buffer 38 contains Pre-data (PD), which represents the data recorded by the buffer 38 before the user starts speaking the predefined trigger phrase, trigger phrase data (TP) and four command word data sections (C, C2, C3, C4).).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. to include the teaching of Hatfield et al. above in order to generate an output signal that has a higher signal to noise ratio.
Koishida et al. in view of Hatfield et al. do not teach: wherein the acoustic environment measurement comprises the signal to noise ratio calculated using ii) noise in an interval before the determined start time.
Yoshioka teach: determining the voice trigger comprises a determined start time and a determined end time of the voice trigger, and wherein the acoustic environment measurement comprises the signal to noise ratio calculated using i) signal in an interval between the determined start time and the determined end time and ii) noise in an interval before the determined start time ([0012] According to the audio signal processing device, the audio signal obtained and stored before the trigger signal is obtained is considered to be an audio signal showing only an environmental noise to calculate the S/N ratio, and the sound generating period is specified on the basis of the S/N ratio, so that the specified result with high accuracy can be obtained. [0013] In the audio signal processing device, the trigger signal obtaining unit may obtain the trigger signal generated by an operating unit in accordance with a prescribed operation by the user, or may obtain the trigger signal generated by the information of an informing unit that informs the user of urging the user to give a voice. [0016] Further, in the audio signal processing device, the specifying unit may calculate the S/N ratios respectively for a plurality of frames obtained by dividing the audio signal obtained by the audio signal obtaining unit after the trigger signal is obtained at intervals of prescribed time length and specify the start time of the frame whose S/N ratio satisfies a prescribed condition as a start time of the sound generating period. [0017] Further, in the audio signal processing device, the specifying unit may calculate the S/N ratios respectively for a plurality of frames obtained by dividing the audio signal obtained by the audio signal obtaining unit after the trigger signal is obtained at intervals of prescribed time length and specify the end time of the frame whose S/N ratio satisfies the prescribed condition as an end time of the sound generating period. [0086] The SNR calculated by the S/N ratio calculating part 1142 as described above is an index showing the ratio of the level of the sound in the audio space at a current time relative to the level of the environmental noise. Accordingly, the SNR calculated while the user does not give a voice shows a value near 1 and the SNR calculated while the user gives a voice shows a numeric value considerably larger than 1. Thus, the condition deciding part 1143 specifies the sound generating period in accordance with the SNR sequentially calculated by the S/N ratio calculating part 1142 in such a way as described below.)
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. in view of Hatfield et al.to include the teaching of Yoshioka above in order to determine sound generating period in accordance with the SNR.

Regarding Claim 11, Yoshioka teach: The method of claim 10, wherein utilizing the determined voice trigger as an anchor comprises determining a runtime interval that comprises a start time for the determined voice trigger and an end time for the determined voice trigger, and calculating the signal to noise ratio comprises comparing interference or noise during the runtime interval to the interference or noise prior to the start time of the determined voice trigger (See rejection of claim 8).

Claim 12, is rejected under 35 U.S.C. 103 as being unpatentable over Koishida et al. in view of Hatfield et al. further in view of Yoshioka further in view of Vuorinen  et al. (US 6847689 B1).
Regarding Claim 12, Koishida et al. in view of Hatfield et al. further in view of Yoshioka teach: calculating the signal to noise ratio comprises of i) a portion of the audio stream during the runtime interval and ii) a portion of the audio stream before the start time (See Yoshioka, [0012] According to the audio signal processing device, the audio signal obtained and stored before the trigger signal is obtained is considered to be an audio signal showing only an environmental noise to calculate the S/N ratio, and the sound generating period is specified on the basis of the S/N ratio, so that the specified result with high accuracy can be obtained [0016] Further, in the audio signal processing device, the specifying unit may calculate the S/N ratios respectively for a plurality of frames obtained by dividing the audio signal obtained by the audio signal obtaining unit after the trigger signal is obtained at intervals of prescribed time length and specify the start time of the frame whose S/N ratio satisfies a prescribed condition as a start time of the sound generating period. [0017] Further, in the audio signal processing device, the specifying unit may calculate the S/N ratios respectively for a plurality of frames obtained by dividing the audio signal obtained by the audio signal obtaining unit after the trigger signal is obtained at intervals of prescribed time length and specify the end time of the frame whose S/N ratio satisfies the prescribed condition as an end time of the sound generating period. [0086] The SNR calculated by the S/N ratio calculating part 1142 as described above is an index showing the ratio of the level of the sound in the audio space at a current time relative to the level of the environmental noise.).
Koishida et al. in view of Hatfield et al. further in view of Yoshioka teach, however do not teach: calculating the signal to noise ratio comprises root mean square of i) a portion of the audio stream during the runtime interval and ii) a portion of the audio stream before the start time.
Vuorinen et al. teach: calculating the signal to noise ratio comprises root mean square new signal sequence to a previous signal sequence using an average sum of the square differences in the samples (Col7, lines 10-25,  when the signal-to-noise ratio is high and the signal components overlap each other in both the frequency and time space. The reliability of the amplitude adaptation can be measured using an RMS error, which is performed so as to compare the new signal sequence to a previous signal sequence using an average sum of the square differences in the samples. The previous signal sequence model does not include useful signal, meaning that is formed at a point in which no useful signal was present, or the effect of the useful signal was insignificant.)
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Koishida et al. in view of Hatfield et al. further in view of Yoshioka to include the teaching of the Vuorinen et al. above in order to measured reliability of the amplitude adaptation using an RMS error.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Sundaram et al.(US 9734822 B1) teach: Beamformed signal selection. The selection may consider processing feedback information to identify when the current beam selection may need to be re-evaluated. The feedback information may further be used to select a beamformed signal for processing. For example, beams which detect wake-words or yield high confidence speech recognition may be favored over beams which fail to detect or recognize at a lower confidence level .
Jorgovanovic (US 2017/0090864A1) teach: synchronization of multiple voice-controlled devices to establish priority of one of the devices to respond to an acoustic signal, preventing other devices from responding to a single user command. The computing device may produce a combined quality value that includes one or more of the above signal strength values and potentially other values. The quality value may be calculated using the entirety of the audio input, or the computing device may identify a portion of the audio input, such as the portion including the wakeword and/or the input command/inquiry, and obtain the quality value from analysis of the portion. In some embodiments, the computing device may send the audio input to a remote device, such as a speech recognition server or signal processing device, and receive the quality value in return. Additionally or alternatively, the speech recognition server may analyze the audio input for the wakeword (step 310). 
Devaraj et al.(US 2018/0061404 A1) teach: In response to determining the occurrence of a communication alteration trigger, such as repeated messages between the same two devices, the system may automatically change a mode of a speech-controlled device, such as no longer requiring a wakeword, no longer requiring an indication of a desired recipient, or automatically connecting the two speech-controlled devices in a voice-chat mode.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878.  The examiner can normally be reached on Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656