DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The Amendment filed July 27, 2022 has been entered.  Claims 1 – 20 are pending in the application.  Applicant’s amendments to the Specification and Claims have overcome each and every objection and 35 U.S.C. 112(b) rejection previously set forth in the Non-Final Office Action mailed May 3, 2022.
Response to Arguments
Applicant’s arguments with respect to claims 1 – 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 – 2, 4, 9 – 10, 15, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Zheng et al. ("Air- and Bone-Conductive Integrated Microphones for Robust Speech Detection and Enhancement"), hereinafter Zheng, in view of Kechichian et al. (“Model-Based Speech Enhancement using a Bone-Conducted Signal”), hereinafter Kechichian.
Regarding claim 1, Zheng discloses an audio system comprising:
a microphone array configured to detect sounds from a local area, the sounds from the local area including a voice of a user of the audio system (Abstract, lines 1-3, "We present a novel hardware device that combines a regular microphone with a bone-conductive microphone."; Section 3, lines 21-22, "the regular microphone contains wideband speech suitable for recognition");
a contact transducer configured to detect tissue based vibrations on a portion of a head of the user, the tissue based vibrations generated by the voice of the user and pass through tissue of the user prior to being detected by the contact transducer (Section 3, lines 1-3, "When we speak, there is vibration on the bones of the head. The bone-conductive sensors, when pressed again the bones, can capture the bone vibrations.");
a controller configured to: identify, based on the trained model, the voice of the user in the sounds from the local area detected by the microphone array (Section 3, lines 30-35, "We are taking a more practical approach: using the bone sensor to enhance the wideband noisy speech for use with an existing speech recognition system. Since the bone sensor signals contain very little noise, we can combine the bone sensor signals with the close talk microphone signals to obtain a better estimate of the clean speech."; A processor or controller component to perform the speech enhancement function is inherently taught as part of the hardware device.);
and update a sound filter based on the identified voice of the user, wherein audio content is modified using the updated sound filter (Section 5, lines 1-4, "In this section, we describe how to use the bone sensor for speech enhancement in an environment with highly nonstationary noises such as when there are people talking in the background."; Section 5.2, lines 5-15, "Assuming that the noise level in the b is negligible and the additive noise is uncorrelated with the speech signal, the problem can then be formulated as follows: Sy(ω) = Sx(ω) + Sn(ω)  Sx(ω) = f(Sy(ω), Sb(ω))   Sx(ω) = H(ω)Sy(ω) where Sy, Sx, Sb and Sn are the power spectrum for noisy speech, clean speech, bone signal, and noise, respectively, and f(z) is a nonlinear mapping function.  Our goal is to find the optimal H (the Wiener filter)."),
and the modified audio content is presented by at least one audio system (Section 4.1, lines 8-12, "In this way, our integrated microphone can be directly used with any existing speech recognition system. To measure the performance of the noise removal algorithm, we used our new microphone with Microsoft’s speech recognition system.").
Zheng does not specifically disclose: determine, based on the tissue based vibrations, at least one of a spectral correlation and a spatial correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array; train a model based on the determined at least one of the spectral correlation and the spatial correlation to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array.
Kechichian teaches:
determine, based on the tissue based vibrations, at least one of a spectral correlation and a spatial correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array (Section 2, lines 1-4, “Consider an additive noise model where the AC microphone signal is given by z(n) = d(n) + u(n), with d(n) denoting the desired signal, in this case the user’s speech, u(n) the undesired signal, such as background noise, and n the discrete-time index.”; Section 2, lines 7-16, “Let the short-time Fourier transform (STFT) of the microphone signal be denoted by Z(ω) = D(ω) + U(ω), where D(ω) and U(ω) denote the STFT of the desired and undesired signal, respectively. Also, let P̂z(ω), P̂d(ω), and P̂u(ω) denote estimated PSDs of the microphone, desired, and undesired signals, respectively. Typically, P̂u(ω) is obtained using a noise estimation algorithm that estimates the noise PSD from the noisy PSD. P̂d(ω) is then obtained as max (P̂z(ω) - P̂u(ω), 0). The gain function G(ω) is usually expressed in terms of these PSDs, and the enhanced signal is given by Q(ω) = G(ω) Z(ω)”; Section 1, lines 22-28, “An example of such a sensor is a bone-conduction (BC) microphone which is placed in contact with a small area of the user’s skin. The location of the BC microphone can affect the quality and intelligibility of the captured BC speech, a preferred location being an area near the larynx. Unlike air-conduction (AC) microphones which also capture external background noise present in the user’s environment, BC sensors are immune to this background noise since the captured signal is transmitted through bone and tissue directly to the sensor.”; Section 3, lines 1-11, “The proposed method relies on trained codebooks of AC and BC clean speech PSDs, and a mapping between the two that is generated during an offline training procedure. During the actual online noise reduction, the AC microphone observes both the speech signal and the background noise whereas the BC microphone observes only the BC speech signal. For each short-time segment of the noisy speech, the BC codebook vector that is closest (with respect to a selected distortion criterion) to the PSD of the observed BC signal is identified in a first step. Though the BC codebook is trained on clean BC speech signals, this step is justified as the BC signal is noise-free even in the presence of background noise. In the second step, the AC codebook vector that is mapped to the selected BC codebook vector is used in the codebook-based speech enhancement algorithm, instead of the entire AC codebook.”; The gain function G(ω) expressed in terms of the power spectral densities (PSDs) reads on a spectral correlation between the voice of the user detected by the microphone array and the sounds from the local area detected by the microphone array, where Q(ω) is the frequency domain representation of the voice of the user detected by the microphone array and Z(ω) is the frequency domain representation of the sounds from the local area detected by the microphone array.  Using the air-conduction (AC) microphone and the bone-conduction (BC) microphone to determine the background noise reads on determining the spectral correlation based on the tissue based vibrations.);
train a model based on the determined at least one of the spectral correlation and the spatial correlation to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array (Section 1, lines 11-16, “Model-based approaches to speech enhancement, which rely on trained codebooks of gain-normalized speech and noise PSDs, have been shown to provide better performance in nonstationary noise environments In such methods, the speech and noise PSD codebook candidates and their respective gain factors that describe the observed noisy PSD best (according to a certain criterion), are obtained for each short-time segment.”).
Kechichian teaches determining a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and training a model to perform speech enhancement that determines the voice of the user from the sounds from the local area in order to improve the performance of a codebook-based speech enhancement system (Section 5, lines 1-3, “This paper has investigated the use of a reference signal provided by a bone-conducting microphone to improve the performance of a codebook-based speech enhancement system.”).
Zheng and Kechichian are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng to incorporate the teachings of Kechichian to determine a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and train a model to perform speech enhancement that determines the voice of the user from the sounds from the local area.  Doing so would allow for improving the performance of a codebook-based speech enhancement system.
Regarding claim 2, Zheng in view of Kechichian discloses the audio system as claimed in claim 1.  Zheng further discloses wherein the audio system is integrated into a headset (Section 1, lines 10-15, "In this paper, we propose a hardware device that combines regular microphone (air-conductive microphone) with a bone-conductive microphone with the purpose of handling noisy environment. The device is designed in such a way that people wear it just like a regular headset, and it can be plugged into any machine with a USB port.”).
Regarding claim 4, Zheng in view of Kechichian discloses the audio system as claimed in claim 1.  Zheng further discloses wherein:
the updated sound filter enhances the voice of the user (Section 5, lines 1-4, "In this section, we describe how to use the bone sensor for speech enhancement in an environment with highly nonstationary noises such as when there are people talking in the background."; Section 5.2, lines 5-15, "Assuming that the noise level in the b is negligible and the additive noise is uncorrelated with the speech signal, the problem can then be formulated as follows: Sy(ω) = Sx(ω) + Sn(ω)  Sx(ω) = f(Sy(ω), Sb(ω))   Sx(ω) = H(ω)Sy(ω) where Sy, Sx, Sb and Sn are the power spectrum for noisy speech, clean speech, bone signal, and noise, respectively, and f(z) is a nonlinear mapping function.  Our goal is to find the optimal H (the Wiener filter)."),
and the controller is further configured to: modify the audio content with the updated sound filter, wherein the modified audio content enhances the voice of the user (Section 6, lines 7-8, "We then apply our speech enhancement algorithm to estimate the clean speech.");
and provide the modified audio content to a second audio system, wherein the second audio system presents the modified audio content (Section 4.1, lines 8-12, "In this way, our integrated microphone can be directly used with any existing speech recognition system. To measure the performance of the noise removal algorithm, we used our new microphone with Microsoft’s speech recognition system."; Section 6, lines 15-18, "The corrupted audio files and enhanced audio files are mixed randomly, and then played to the evaluators (the people who gave the scores) with desktop speakers.").
Regarding claim 9, Zheng discloses a method comprising:
detecting, via a microphone array of an audio system, sounds from a local area, the sounds from the local area including a voice of a user of the audio system (Abstract, lines 1-3, "We present a novel hardware device that combines a regular microphone with a bone-conductive microphone."; Section 3, lines 21-22, "the regular microphone contains wideband speech suitable for recognition");
detecting, via a contact transducer, tissue based vibrations on a portion of a head of the user, wherein the tissue based vibrations are generated by the voice of the user and pass through tissue of the user prior to being detected by the contact transducer (Section 3, lines 1-3, "When we speak, there is vibration on the bones of the head. The bone-conductive sensors, when pressed again the bones, can capture the bone vibrations.");
identifying, based on the trained model, the voice of the user in the sounds from the local area detected by the microphone array (Section 3, lines 30-35, "We are taking a more practical approach: using the bone sensor to enhance the wideband noisy speech for use with an existing speech recognition system. Since the bone sensor signals contain very little noise, we can combine the bone sensor signals with the close talk microphone signals to obtain a better estimate of the clean speech."; A processor or controller component to perform the speech enhancement function is inherently taught as part of the hardware device.);
and updating a sound filter based on the identified voice of the user, wherein audio content is modified using the updated sound filter (Section 5, lines 1-4, "In this section, we describe how to use the bone sensor for speech enhancement in an environment with highly nonstationary noises such as when there are people talking in the background."; Section 5.2, lines 5-15, "Assuming that the noise level in the b is negligible and the additive noise is uncorrelated with the speech signal, the problem can then be formulated as follows: Sy(ω) = Sx(ω) + Sn(ω)  Sx(ω) = f(Sy(ω), Sb(ω))   Sx(ω) = H(ω)Sy(ω) where Sy, Sx, Sb and Sn are the power spectrum for noisy speech, clean speech, bone signal, and noise, respectively, and f(z) is a nonlinear mapping function.  Our goal is to find the optimal H (the Wiener filter)."),
and the modified audio content is presented by at least one audio system (Section 4.1, lines 8-12, "In this way, our integrated microphone can be directly used with any existing speech recognition system. To measure the performance of the noise removal algorithm, we used our new microphone with Microsoft’s speech recognition system.").
Zheng does not specifically disclose: determining, based on the tissue based vibrations, at least one of a spectral correlation and a spatial correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array; training a model based on the determined at least one of the spectral correlation and the spatial correlation to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array.
Kechichian teaches:
determining, based on the tissue based vibrations, at least one of a spectral correlation and a spatial correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array (Section 2, lines 1-4, “Consider an additive noise model where the AC microphone signal is given by z(n) = d(n) + u(n), with d(n) denoting the desired signal, in this case the user’s speech, u(n) the undesired signal, such as background noise, and n the discrete-time index.”; Section 2, lines 7-16, “Let the short-time Fourier transform (STFT) of the microphone signal be denoted by Z(ω) = D(ω) + U(ω), where D(ω) and U(ω) denote the STFT of the desired and undesired signal, respectively. Also, let P̂z(ω), P̂d(ω), and P̂u(ω) denote estimated PSDs of the microphone, desired, and undesired signals, respectively. Typically, P̂u(ω) is obtained using a noise estimation algorithm that estimates the noise PSD from the noisy PSD. P̂d(ω) is then obtained as max (P̂z(ω) - P̂u(ω), 0). The gain function G(ω) is usually expressed in terms of these PSDs, and the enhanced signal is given by Q(ω) = G(ω) Z(ω)”; Section 1, lines 22-28, “An example of such a sensor is a bone-conduction (BC) microphone which is placed in contact with a small area of the user’s skin. The location of the BC microphone can affect the quality and intelligibility of the captured BC speech, a preferred location being an area near the larynx. Unlike air-conduction (AC) microphones which also capture external background noise present in the user’s environment, BC sensors are immune to this background noise since the captured signal is transmitted through bone and tissue directly to the sensor.”; Section 3, lines 1-11, “The proposed method relies on trained codebooks of AC and BC clean speech PSDs, and a mapping between the two that is generated during an offline training procedure. During the actual online noise reduction, the AC microphone observes both the speech signal and the background noise whereas the BC microphone observes only the BC speech signal. For each short-time segment of the noisy speech, the BC codebook vector that is closest (with respect to a selected distortion criterion) to the PSD of the observed BC signal is identified in a first step. Though the BC codebook is trained on clean BC speech signals, this step is justified as the BC signal is noise-free even in the presence of background noise. In the second step, the AC codebook vector that is mapped to the selected BC codebook vector is used in the codebook-based speech enhancement algorithm, instead of the entire AC codebook.”; The gain function G(ω) expressed in terms of the power spectral densities (PSDs) reads on a spectral correlation between the voice of the user detected by the microphone array and the sounds from the local area detected by the microphone array, where Q(ω) is the frequency domain representation of the voice of the user detected by the microphone array and Z(ω) is the frequency domain representation of the sounds from the local area detected by the microphone array.  Using the air-conduction (AC) microphone and the bone-conduction (BC) microphone to determine the background noise reads on determining the spectral correlation based on the tissue based vibrations.);
training a model based on the determined at least one of the spectral correlation and the spatial correlation to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array (Section 1, lines 11-16, “Model-based approaches to speech enhancement, which rely on trained codebooks of gain-normalized speech and noise PSDs, have been shown to provide better performance in nonstationary noise environments In such methods, the speech and noise PSD codebook candidates and their respective gain factors that describe the observed noisy PSD best (according to a certain criterion), are obtained for each short-time segment.).
Kechichian teaches determining a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and training a model to perform speech enhancement that determines the voice of the user from the sounds from the local area in order to improve the performance of a codebook-based speech enhancement system (Section 5, lines 1-3, “This paper has investigated the use of a reference signal provided by a bone-conducting microphone to improve the performance of a codebook-based speech enhancement system.”).
Zheng and Kechichian are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng to incorporate the teachings of Kechichian to determine a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and train a model to perform speech enhancement that determines the voice of the user from the sounds from the local area.  Doing so would allow for improving the performance of a codebook-based speech enhancement system.
Regarding claim 10, Zheng in view of Kechichian discloses the method as claimed in claim 9.  Zheng further discloses wherein:
the updated sound filter enhances the voice of the user (Section 5, lines 1-4, "In this section, we describe how to use the bone sensor for speech enhancement in an environment with highly nonstationary noises such as when there are people talking in the background."; Section 5.2, lines 5-15, "Assuming that the noise level in the b is negligible and the additive noise is uncorrelated with the speech signal, the problem can then be formulated as follows: Sy(ω) = Sx(ω) + Sn(ω)  Sx(ω) = f(Sy(ω), Sb(ω))   Sx(ω) = H(ω)Sy(ω) where Sy, Sx, Sb and Sn are the power spectrum for noisy speech, clean speech, bone signal, and noise, respectively, and f(z) is a nonlinear mapping function.  Our goal is to find the optimal H (the Wiener filter)."),
and the method further comprises: modifying the audio content with the updated sound filter, wherein the modified audio content enhances the voice of the user (Section 6, lines 7-8, "We then apply our speech enhancement algorithm to estimate the clean speech.");
and providing the modified audio content to a second audio system, wherein the second audio system presents the modified audio content (Section 4.1, lines 8-12, "In this way, our integrated microphone can be directly used with any existing speech recognition system. To measure the performance of the noise removal algorithm, we used our new microphone with Microsoft’s speech recognition system."; Section 6, lines 15-18, "The corrupted audio files and enhanced audio files are mixed randomly, and then played to the evaluators (the people who gave the scores) with desktop speakers.").
Regarding claim 15, Zheng in view of Kechichian discloses the method as claimed in claim 9.  Zheng further discloses wherein the audio system is integrated into a headset (Section 1, lines 10-15, "In this paper, we propose a hardware device that combines regular microphone (air-conductive microphone) with a bone-conductive microphone with the purpose of handling noisy environment. The device is designed in such a way that people wear it just like a regular headset, and it can be plugged into any machine with a USB port.”).
Regarding claim 17, Zheng teaches a non-transitory computer readable medium configured to store program code instructions, when executed by a processor of an audio system, cause the audio system to perform steps comprising:
detecting, via a microphone array of an audio system, sounds from a local area, the sounds from the local area including a voice of a user of the audio system (Abstract, lines 1-3, "We present a novel hardware device that combines a regular microphone with a bone-conductive microphone."; Section 3, lines 21-22, "the regular microphone contains wideband speech suitable for recognition");
detecting, via a contact transducer, tissue based vibrations on a portion of a head of the user, wherein the tissue based vibrations are generated by the voice of the user and pass through tissue of the user prior to being detected by the contact transducer (Section 3, lines 1-3, "When we speak, there is vibration on the bones of the head. The bone-conductive sensors, when pressed again the bones, can capture the bone vibrations.");
identifying, based on the trained model, the voice of the user in the sounds from the local area detected by the microphone array (Section 3, lines 30-35, "We are taking a more practical approach: using the bone sensor to enhance the wideband noisy speech for use with an existing speech recognition system. Since the bone sensor signals contain very little noise, we can combine the bone sensor signals with the close talk microphone signals to obtain a better estimate of the clean speech."; A processor or controller component to perform the speech enhancement function is inherently taught as part of the hardware device.);
and updating a sound filter based on the identified voice of the user, wherein audio content is modified using the updated sound filter (Section 5, lines 1-4, "In this section, we describe how to use the bone sensor for speech enhancement in an environment with highly nonstationary noises such as when there are people talking in the background."; Section 5.2, lines 5-15, "Assuming that the noise level in the b is negligible and the additive noise is uncorrelated with the speech signal, the problem can then be formulated as follows: Sy(ω) = Sx(ω) + Sn(ω)  Sx(ω) = f(Sy(ω), Sb(ω))   Sx(ω) = H(ω)Sy(ω) where Sy, Sx, Sb and Sn are the power spectrum for noisy speech, clean speech, bone signal, and noise, respectively, and f(z) is a nonlinear mapping function.  Our goal is to find the optimal H (the Wiener filter)."),
and the modified audio content is presented by at least one audio system (Section 4.1, lines 8-12, "In this way, our integrated microphone can be directly used with any existing speech recognition system. To measure the performance of the noise removal algorithm, we used our new microphone with Microsoft’s speech recognition system.").
Zheng does not specifically disclose: determining, based on the tissue based vibrations, at least one of a spectral correlation and a spatial correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array; training a model based on the determined at least one of the spectral correlation and the spatial correlation to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array.
Kechichian teaches:
determining, based on the tissue based vibrations, at least one of a spectral correlation and a spatial correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array (Section 2, lines 1-4, “Consider an additive noise model where the AC microphone signal is given by z(n) = d(n) + u(n), with d(n) denoting the desired signal, in this case the user’s speech, u(n) the undesired signal, such as background noise, and n the discrete-time index.”; Section 2, lines 7-16, “Let the short-time Fourier transform (STFT) of the microphone signal be denoted by Z(ω) = D(ω) + U(ω), where D(ω) and U(ω) denote the STFT of the desired and undesired signal, respectively. Also, let P̂z(ω), P̂d(ω), and P̂u(ω) denote estimated PSDs of the microphone, desired, and undesired signals, respectively. Typically, P̂u(ω) is obtained using a noise estimation algorithm that estimates the noise PSD from the noisy PSD. P̂d(ω) is then obtained as max (P̂z(ω) - P̂u(ω), 0). The gain function G(ω) is usually expressed in terms of these PSDs, and the enhanced signal is given by Q(ω) = G(ω) Z(ω)”; Section 1, lines 22-28, “An example of such a sensor is a bone-conduction (BC) microphone which is placed in contact with a small area of the user’s skin. The location of the BC microphone can affect the quality and intelligibility of the captured BC speech, a preferred location being an area near the larynx. Unlike air-conduction (AC) microphones which also capture external background noise present in the user’s environment, BC sensors are immune to this background noise since the captured signal is transmitted through bone and tissue directly to the sensor.”; Section 3, lines 1-11, “The proposed method relies on trained codebooks of AC and BC clean speech PSDs, and a mapping between the two that is generated during an offline training procedure. During the actual online noise reduction, the AC microphone observes both the speech signal and the background noise whereas the BC microphone observes only the BC speech signal. For each short-time segment of the noisy speech, the BC codebook vector that is closest (with respect to a selected distortion criterion) to the PSD of the observed BC signal is identified in a first step. Though the BC codebook is trained on clean BC speech signals, this step is justified as the BC signal is noise-free even in the presence of background noise. In the second step, the AC codebook vector that is mapped to the selected BC codebook vector is used in the codebook-based speech enhancement algorithm, instead of the entire AC codebook.”; The gain function G(ω) expressed in terms of the power spectral densities (PSDs) reads on a spectral correlation between the voice of the user detected by the microphone array and the sounds from the local area detected by the microphone array, where Q(ω) is the frequency domain representation of the voice of the user detected by the microphone array and Z(ω) is the frequency domain representation of the sounds from the local area detected by the microphone array.  Using the air-conduction (AC) microphone and the bone-conduction (BC) microphone to determine the background noise reads on determining the spectral correlation based on the tissue based vibrations.);
training a model based on the determined at least one of the spectral correlation and the spatial correlation to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array (Section 1, lines 11-16, “Model-based approaches to speech enhancement, which rely on trained codebooks of gain-normalized speech and noise PSDs, have been shown to provide better performance in nonstationary noise environments In such methods, the speech and noise PSD codebook candidates and their respective gain factors that describe the observed noisy PSD best (according to a certain criterion), are obtained for each short-time segment.).
Kechichian teaches determining a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and training a model to perform speech enhancement that determines the voice of the user from the sounds from the local area in order to improve the performance of a codebook-based speech enhancement system (Section 5, lines 1-3, “This paper has investigated the use of a reference signal provided by a bone-conducting microphone to improve the performance of a codebook-based speech enhancement system.”).
Zheng and Kechichian are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng to incorporate the teachings of Kechichian to determine a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and train a model to perform speech enhancement that determines the voice of the user from the sounds from the local area.  Doing so would allow for improving the performance of a codebook-based speech enhancement system.
Claims 3 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Zheng in view of Kechichian and further in view of Maruri et al. (“V-Speech: Noise-Robust Speech Capturing Glasses Using Vibration"), hereinafter Maruri.
Regarding claim 3, Zheng in view of Kechichian discloses the audio system as claimed in claim 2, but does not specifically disclose wherein the contact transducer is configured to sense vibrations of a portion of a nose of the user.
Maruri teaches: wherein the contact transducer is configured to sense vibrations of a portion of a nose of the user (Section 3.1, lines 26-27, "In order to assess the best location to place the sensor, tests were performed in which a vibration sensor was located in four different places on the nose of five different participants.").  Maruri teaches locating a sensor to sense vibrations from the nose provides a high signal-to-noise ratio signal with low background interference (Section 7, lines 1-3, "While capturing speech from a user with vibration sensors located on the nasal pads of regular glasses provides a quite high SNR signal with low background noise interference, the nasal distortion present in the signal makes it not very suitable for H2H communication or ASR.").
Zheng, Kechichian, and Maruri are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Maruri to locate a sensor to sense vibrations from the nose.  Doing so would provide a high signal-to-noise ratio signal with low background interference.
Regarding claim 16, Zheng in view of Kechichian discloses the method as claimed in claim 15, but does not specifically disclose wherein the contact transducer is configured to be in contact with a portion of a nose of the user.
Maruri teaches: wherein the contact transducer is configured to be in contact with a portion of a nose of the user (Section 3.1, lines 26-27, "In order to assess the best location to place the sensor, tests were performed in which a vibration sensor was located in four different places on the nose of five different participants.").  Maruri teaches locating a sensor to sense vibrations from the nose provides a high signal-to-noise ratio signal with low background interference (Section 7, lines 1-3, "While capturing speech from a user with vibration sensors located on the nasal pads of regular glasses provides a quite high SNR signal with low background noise interference, the nasal distortion present in the signal makes it not very suitable for H2H communication or ASR.").
Zheng, Kechichian, and Maruri are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Maruri to locate a sensor to sense vibrations from the nose.  Doing so would provide a high signal-to-noise ratio signal with low background interference.
Claims 5 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Zheng in view of Kechichian and further in view of Jackson et al. (US Patent No. 10,873,798), hereinafter Jackson.
Regarding claim 5, Zheng in view of Kechichian discloses the audio system as claimed in claim 1.  Zheng further discloses wherein:
the updated sound filter enhances the voice of the user (Section 5, lines 1-4, "In this section, we describe how to use the bone sensor for speech enhancement in an environment with highly nonstationary noises such as when there are people talking in the background."; Section 5.2, lines 5-15, "Assuming that the noise level in the b is negligible and the additive noise is uncorrelated with the speech signal, the problem can then be formulated as follows: Sy(ω) = Sx(ω) + Sn(ω)  Sx(ω) = f(Sy(ω), Sb(ω))   Sx(ω) = H(ω)Sy(ω) where Sy, Sx, Sb and Sn are the power spectrum for noisy speech, clean speech, bone signal, and noise, respectively, and f(z) is a nonlinear mapping function.  Our goal is to find the optimal H (the Wiener filter)."),
and the controller is further configured to: modify the audio content with the updated filter, wherein the modified audio content enhances the voice of the user (Section 6, lines 7-8, "We then apply our speech enhancement algorithm to estimate the clean speech.").
Zheng in view of Kechichian does not specifically disclose: determine that the modified audio content includes a command; and perform an action in accordance with the command.
Jackson teaches:
determine that the modified audio content includes a command (Column 7, lines 9-12, "For example, the wearable audio device 100 may include a first microphone, such as a beamforming microphone, that is configured to detect voice commands from a user");
and perform an action in accordance with the command (Column 9, lines 10-15, "Audio outputs may be configured to change in response to inputs received at the wearable audio device 100. For example, the processing unit 150 may be configured to change the audio output provided by a speaker in response to an input corresponding to a gesture input, physical manipulation, voice command, and so on.").
Jackson teaches detecting a voice command and performing an action based on the voice command to provide voice control over the audio device output (Column 7, lines 18-30, "The processing unit 150 may receive a detection output from each microphone and distinguish between the various types of inputs. For example, the processing unit 150 may identify a detection output from the microphone(s) associated with an input (e.g., a voice command, a facial tap, and so on) and initiate a signal that is used to control a corresponding function of the wearable audio device 100, such as an output provided by an output device 140. The processing unit 150 may also identify signals from the microphone(s) associated with an ambient condition and ignore the signal and/or use the signal to control an audio output of the wearable audio device 100 (e.g., a speaker), such as acoustically cancelling or mitigating the effects of ambient noise.").
Zheng, Kechichian, and Jackson are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Jackson to detect a voice command and performing an action based on the voice command.  Doing so would provide voice control over the audio device output.
Regarding claim 11, Zheng in view of Kechichian discloses the method as claimed in claim 9.  Zheng further discloses wherein:
 the updated sound filter enhances the voice of the user (Section 5, lines 1-4, "In this section, we describe how to use the bone sensor for speech enhancement in an environment with highly nonstationary noises such as when there are people talking in the background."; Section 5.2, lines 5-15, "Assuming that the noise level in the b is negligible and the additive noise is uncorrelated with the speech signal, the problem can then be formulated as follows: Sy(ω) = Sx(ω) + Sn(ω)  Sx(ω) = f(Sy(ω), Sb(ω))   Sx(ω) = H(ω)Sy(ω) where Sy, Sx, Sb and Sn are the power spectrum for noisy speech, clean speech, bone signal, and noise, respectively, and f(z) is a nonlinear mapping function.  Our goal is to find the optimal H (the Wiener filter)."),
and the method further comprises: modifying the audio content with the updated filter, wherein the modified audio content enhances the voice of the user (Section 6, lines 7-8, "We then apply our speech enhancement algorithm to estimate the clean speech.").
Zheng in view of Kechichian does not specifically disclose: determining that the modified audio content includes a command; and performing an action in accordance with the command.
Jackson teaches:
determining that the modified audio content includes a command (Column 7, lines 9-12, "For example, the wearable audio device 100 may include a first microphone, such as a beamforming microphone, that is configured to detect voice commands from a user");
and performing an action in accordance with the command (Column 9, lines 10-15, "Audio outputs may be configured to change in response to inputs received at the wearable audio device 100. For example, the processing unit 150 may be configured to change the audio output provided by a speaker in response to an input corresponding to a gesture input, physical manipulation, voice command, and so on.").
Jackson teaches detecting a voice command and performing an action based on the voice command to provide voice control over the audio device output (Column 7, lines 18-30, "The processing unit 150 may receive a detection output from each microphone and distinguish between the various types of inputs. For example, the processing unit 150 may identify a detection output from the microphone(s) associated with an input (e.g., a voice command, a facial tap, and so on) and initiate a signal that is used to control a corresponding function of the wearable audio device 100, such as an output provided by an output device 140. The processing unit 150 may also identify signals from the microphone(s) associated with an ambient condition and ignore the signal and/or use the signal to control an audio output of the wearable audio device 100 (e.g., a speaker), such as acoustically cancelling or mitigating the effects of ambient noise.").
Zheng, Kechichian, and Jackson are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Jackson to detect a voice command and perform an action based on the voice command.  Doing so would provide voice control over the audio device output.
Claims 6, 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Zheng in view of Kechichian and further in view of Zhou et al. ("A Real-Time Dual-Microphone Speech Enhancement Algorithm Assisted by Bone Conduction Sensor"), hereinafter Zhou, and Bergh et al. ("Multi-Speaker Voice Activity Detection Using a Camera-assisted Microphone Array"), hereinafter Bergh.
Regarding claim 6, Zheng in view of Kechichian discloses the audio system as claimed in claim 1, but does not specifically disclose: the controller is further configured to: train an adaptive beamformer using the tissue based vibrations and the sounds from the local area.
Zhou teaches an adaptive beamformer using the tissue based vibrations and the sounds from the local area (Abstract, line 1-5, "The quality and intelligibility of the speech are usually impaired by the interference of background noise when using internet voice calls. To solve this problem in the context of wearable smart devices, this paper introduces a dual-microphone, bone-conduction (BC) sensor assisted beamformer and a simple recurrent unit (SRU)-based neural network postfilter for real-time speech enhancement."; Figure 4, "The framework of the proposed BC signal-assisted adaptive beamforming algorithm for speech enhancement.").  Zhou teaches implementing an adaptive beamformer using tissue-based vibrations and sound provides better sound quality and intelligibility compared to using only an air-conduction microphone (Abstract, lines 12-14, "Experimental results demonstrate that the proposed real-time speech enhancement system provides significant speech sound quality and intelligibility improvements for all noise types and levels when compared with the AC-only beamformer with a postfiltering algorithm.").
Zheng, Kechichian, and Zhou are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Zhou to implement an adaptive beamformer using tissue-based vibrations and sound.  Doing so would provide better sound quality and intelligibility compared to using only an air-conduction microphone.
Zheng in view of Kechichian and further in view of Zhou does not specifically disclose training an adaptive beamformer.
Bergh teaches training an adaptive beamformer (Abstract, lines 3-7, "The proposed method uses face detection to identify locations of potential speech sources, and uses this information in an adaptive beamforming procedure to form a spatially directed detection algorithm to identify voice activity for individual speakers."; Section III, lines 4-10; "The experimental data consist of video and array audio (16 × 16 channels), as well as audio recorded with two handheld close-talking microphones, one for each participant. The audio power from the handheld microphones is thresholded to yield a binary variable (speaking/not speaking) for each recording, which is then used as the "ground truth" labels for the classifier training and evaluation.").  Bergh teaches training an adaptive beamformer improves accuracy (Section IV, lines 1-3, "The method we have presented has a higher accuracy than comparable methods, is robust to multiple simultaneous speakers, and works well even on moderately sized arrays.").
Zheng, Kechichian, Zhou, and Bergh are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian and further in view of Zhou to incorporate the teachings of Bergh to train an adaptive beamformer.  Doing so would improve accuracy.
Regarding claim 12, Zheng in view of Kechichian discloses the method as claimed in claim 9, but does not specifically disclose: the method further comprising: training an adaptive beamformer using the tissue based vibrations and the sounds from the local area.
Zhou teaches an adaptive beamformer using the tissue based vibrations and the sounds from the local area (Abstract, line 1-5, "The quality and intelligibility of the speech are usually impaired by the interference of background noise when using internet voice calls. To solve this problem in the context of wearable smart devices, this paper introduces a dual-microphone, bone-conduction (BC) sensor assisted beamformer and a simple recurrent unit (SRU)-based neural network postfilter for real-time speech enhancement."; Figure 4, "The framework of the proposed BC signal-assisted adaptive beamforming algorithm for speech enhancement.").  Zhou teaches implementing an adaptive beamformer using tissue-based vibrations and sound provides better sound quality and intelligibility compared to using only an air-conduction microphone (Abstract, lines 12-14, "Experimental results demonstrate that the proposed real-time speech enhancement system provides significant speech sound quality and intelligibility improvements for all noise types and levels when compared with the AC-only beamformer with a postfiltering algorithm.").
Zheng, Kechichian, and Zhou are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Zhou to implement an adaptive beamformer using tissue-based vibrations and sound.  Doing so would provide better sound quality and intelligibility compared to using only an air-conduction microphone.
Zheng in view of Kechichian and further in view of Zhou does not specifically disclose training an adaptive beamformer.
Bergh teaches training an adaptive beamformer (Abstract, lines 3-7, "The proposed method uses face detection to identify locations of potential speech sources, and uses this information in an adaptive beamforming procedure to form a spatially directed detection algorithm to identify voice activity for individual speakers."; Section III, lines 4-10; "The experimental data consist of video and array audio (16 × 16 channels), as well as audio recorded with two handheld close-talking microphones, one for each participant. The audio power from the handheld microphones is thresholded to yield a binary variable (speaking/not speaking) for each recording, which is then used as the "ground truth" labels for the classifier training and evaluation.").  Bergh teaches training an adaptive beamformer improves accuracy (Section IV, lines 1-3, "The method we have presented has a higher accuracy than comparable methods, is robust to multiple simultaneous speakers, and works well even on moderately sized arrays.").
Zheng, Kechichian, Zhou, and Bergh are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian and further in view of Zhou to incorporate the teachings of Bergh to train an adaptive beamformer.  Doing so would improve accuracy.
Regarding claim 18, Zheng in view of Kechichian discloses the computer readable medium as claimed in claim 17, but does not specifically disclose: the program code instructions, when executed by the processor, further cause the processer to perform steps comprising: training an adaptive beamformer using the tissue based vibrations and the sounds from the local area.
Zhou teaches an adaptive beamformer using the tissue based vibrations and the sounds from the local area (Abstract, line 1-5, "The quality and intelligibility of the speech are usually impaired by the interference of background noise when using internet voice calls. To solve this problem in the context of wearable smart devices, this paper introduces a dual-microphone, bone-conduction (BC) sensor assisted beamformer and a simple recurrent unit (SRU)-based neural network postfilter for real-time speech enhancement."; Figure 4, "The framework of the proposed BC signal-assisted adaptive beamforming algorithm for speech enhancement.").  Zhou teaches implementing an adaptive beamformer using tissue-based vibrations and sound provides better sound quality and intelligibility compared to using only an air-conduction microphone (Abstract, lines 12-14, "Experimental results demonstrate that the proposed real-time speech enhancement system provides significant speech sound quality and intelligibility improvements for all noise types and levels when compared with the AC-only beamformer with a postfiltering algorithm.").
Zheng, Kechichian, and Zhou are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Zhou to implement an adaptive beamformer using tissue-based vibrations and sound.  Doing so would provide better sound quality and intelligibility compared to using only an air-conduction microphone.
Zheng in view of Kechichian and further in view of Zhou does not specifically disclose training an adaptive beamformer.
Bergh teaches training an adaptive beamformer (Abstract, lines 3-7, "The proposed method uses face detection to identify locations of potential speech sources, and uses this information in an adaptive beamforming procedure to form a spatially directed detection algorithm to identify voice activity for individual speakers."; Section III, lines 4-10; "The experimental data consist of video and array audio (16 × 16 channels), as well as audio recorded with two handheld close-talking microphones, one for each participant. The audio power from the handheld microphones is thresholded to yield a binary variable (speaking/not speaking) for each recording, which is then used as the "ground truth" labels for the classifier training and evaluation.").  Bergh teaches training an adaptive beamformer improves accuracy (Section IV, lines 1-3, "The method we have presented has a higher accuracy than comparable methods, is robust to multiple simultaneous speakers, and works well even on moderately sized arrays.").
Zheng, Kechichian, Zhou, and Bergh are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian and further in view of Zhou to incorporate the teachings of Bergh to train an adaptive beamformer.  Doing so would improve accuracy.
Claims 7, 13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Zheng in view of Kechichian and further in view of Mehta et al. ("Relationships Between Vocal Function Measures Derived from an Acoustic Microphone and a Subglottal Neck-Surface Accelerometer"), hereinafter Mehta.
Regarding claim 7, Zheng in view of Kechichian discloses the audio system as claimed in claim 1.  
Kechichian teaches:
determine, based on the tissue based vibrations, the spectral correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array (Section 2, lines 1-4, “Consider an additive noise model where the AC microphone signal is given by z(n) = d(n) + u(n), with d(n) denoting the desired signal, in this case the user’s speech, u(n) the undesired signal, such as background noise, and n the discrete-time index.”; Section 2, lines 7-16, “Let the short-time Fourier transform (STFT) of the microphone signal be denoted by Z(ω) = D(ω) + U(ω), where D(ω) and U(ω) denote the STFT of the desired and undesired signal, respectively. Also, let P̂z(ω), P̂d(ω), and P̂u(ω) denote estimated PSDs of the microphone, desired, and undesired signals, respectively. Typically, P̂u(ω) is obtained using a noise estimation algorithm that estimates the noise PSD from the noisy PSD. P̂d(ω) is then obtained as max (P̂z(ω) - P̂u(ω), 0). The gain function G(ω) is usually expressed in terms of these PSDs, and the enhanced signal is given by Q(ω) = G(ω) Z(ω)”; Section 1, lines 22-28, “An example of such a sensor is a bone-conduction (BC) microphone which is placed in contact with a small area of the user’s skin. The location of the BC microphone can affect the quality and intelligibility of the captured BC speech, a preferred location being an area near the larynx. Unlike air-conduction (AC) microphones which also capture external background noise present in the user’s environment, BC sensors are immune to this background noise since the captured signal is transmitted through bone and tissue directly to the sensor.”; Section 3, lines 1-11, “The proposed method relies on trained codebooks of AC and BC clean speech PSDs, and a mapping between the two that is generated during an offline training procedure. During the actual online noise reduction, the AC microphone observes both the speech signal and the background noise whereas the BC microphone observes only the BC speech signal. For each short-time segment of the noisy speech, the BC codebook vector that is closest (with respect to a selected distortion criterion) to the PSD of the observed BC signal is identified in a first step. Though the BC codebook is trained on clean BC speech signals, this step is justified as the BC signal is noise-free even in the presence of background noise. In the second step, the AC codebook vector that is mapped to the selected BC codebook vector is used in the codebook-based speech enhancement algorithm, instead of the entire AC codebook.”; The gain function G(ω) expressed in terms of the power spectral densities (PSDs) reads on a spectral correlation between the voice of the user detected by the microphone array and the sounds from the local area detected by the microphone array, where Q(ω) is the frequency domain representation of the voice of the user detected by the microphone array and Z(ω) is the frequency domain representation of the sounds from the local area detected by the microphone array.  Using the air-conduction (AC) microphone and the bone-conduction (BC) microphone to determine the background noise reads on determining the spectral correlation based on the tissue based vibrations.);
wherein the model is trained based on both the determined spectral correlation (Section 1, lines 11-16, “Model-based approaches to speech enhancement, which rely on trained codebooks of gain-normalized speech and noise PSDs, have been shown to provide better performance in nonstationary noise environments In such methods, the speech and noise PSD codebook candidates and their respective gain factors that describe the observed noisy PSD best (according to a certain criterion), are obtained for each short-time segment.).
Kechichian teaches determining a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and training a model to perform speech enhancement that determines the voice of the user from the sounds from the local area in order to improve the performance of a codebook-based speech enhancement system (Section 5, lines 1-3, “This paper has investigated the use of a reference signal provided by a bone-conducting microphone to improve the performance of a codebook-based speech enhancement system.”).
Zheng and Kechichian are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Kechichian to determine a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and train a model to perform speech enhancement that determines the voice of the user from the sounds from the local area.  Doing so would allow for improving the performance of a codebook-based speech enhancement system.
Zheng in view of Kechichian does not specifically disclose: determine the spectral correlation and the spatial correlation.
Mehta teaches:
determine the spectral correlation and the spatial correlation (Section I, lines 1-6, "Body surface vibrations generated during speaking often provide robust signals that can be related to the underlying physiological mechanisms of voice and speech production. Accelerometer (ACC) sensors can measure these signals by taking advantage of the piezoelectric effect to transduce mechanical forces into electrical signals."; Section 3A, lines 1-6, "Table I reports the correlation coefficients between ACC based and MIC-based measures of jitter (JCV, Jlocal), shimmer (SCV, Slocal), harmonics-to-noise ratio (HNRtime, HNRspec), spectral tilt (TL8), and cepstral peak prominence (CPP). All correlation coefficients achieved statistical significance for each subject group."; The jitter, shimmer, and harmonics-to-noise ratio correlations can potentially be affected by spatial differences and read on the spatial correlations, and the spectral tilt and cepstral peak prominence correlations read on the spectral correlations.).  Mehta teaches determining spectral and spatial correlations allows for tracking vocal function deterioration (Section IV, lines 37-41, "For example, due to the high correlation between ACC- and MIC-based estimates of CPP, future work could track changes in CPP from a speaker’s ambulatory accelerometer signal to reveal deterioration of vocal function over the course of a day due to vocal fatigue").
Zheng, Kechichian, and Mehta are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Mehta to determine spectral and spatial correlations.  Doing so would allow for tracking vocal function deterioration.
Regarding claim 13, Zheng in view of Kechichian discloses the method as claimed in claim 9.
Kechichian teaches:
determining, based on the tissue based vibrations, the spectral correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array (Section 2, lines 1-4, “Consider an additive noise model where the AC microphone signal is given by z(n) = d(n) + u(n), with d(n) denoting the desired signal, in this case the user’s speech, u(n) the undesired signal, such as background noise, and n the discrete-time index.”; Section 2, lines 7-16, “Let the short-time Fourier transform (STFT) of the microphone signal be denoted by Z(ω) = D(ω) + U(ω), where D(ω) and U(ω) denote the STFT of the desired and undesired signal, respectively. Also, let P̂z(ω), P̂d(ω), and P̂u(ω) denote estimated PSDs of the microphone, desired, and undesired signals, respectively. Typically, P̂u(ω) is obtained using a noise estimation algorithm that estimates the noise PSD from the noisy PSD. P̂d(ω) is then obtained as max (P̂z(ω) - P̂u(ω), 0). The gain function G(ω) is usually expressed in terms of these PSDs, and the enhanced signal is given by Q(ω) = G(ω) Z(ω)”; Section 1, lines 22-28, “An example of such a sensor is a bone-conduction (BC) microphone which is placed in contact with a small area of the user’s skin. The location of the BC microphone can affect the quality and intelligibility of the captured BC speech, a preferred location being an area near the larynx. Unlike air-conduction (AC) microphones which also capture external background noise present in the user’s environment, BC sensors are immune to this background noise since the captured signal is transmitted through bone and tissue directly to the sensor.”; Section 3, lines 1-11, “The proposed method relies on trained codebooks of AC and BC clean speech PSDs, and a mapping between the two that is generated during an offline training procedure. During the actual online noise reduction, the AC microphone observes both the speech signal and the background noise whereas the BC microphone observes only the BC speech signal. For each short-time segment of the noisy speech, the BC codebook vector that is closest (with respect to a selected distortion criterion) to the PSD of the observed BC signal is identified in a first step. Though the BC codebook is trained on clean BC speech signals, this step is justified as the BC signal is noise-free even in the presence of background noise. In the second step, the AC codebook vector that is mapped to the selected BC codebook vector is used in the codebook-based speech enhancement algorithm, instead of the entire AC codebook.”; The gain function G(ω) expressed in terms of the power spectral densities (PSDs) reads on a spectral correlation between the voice of the user detected by the microphone array and the sounds from the local area detected by the microphone array, where Q(ω) is the frequency domain representation of the voice of the user detected by the microphone array and Z(ω) is the frequency domain representation of the sounds from the local area detected by the microphone array.  Using the air-conduction (AC) microphone and the bone-conduction (BC) microphone to determine the background noise reads on determining the spectral correlation based on the tissue based vibrations.);
wherein the model is trained based on both the determined spectral correlation (Section 1, lines 11-16, “Model-based approaches to speech enhancement, which rely on trained codebooks of gain-normalized speech and noise PSDs, have been shown to provide better performance in nonstationary noise environments In such methods, the speech and noise PSD codebook candidates and their respective gain factors that describe the observed noisy PSD best (according to a certain criterion), are obtained for each short-time segment.).
Kechichian teaches determining a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and training a model to perform speech enhancement that determines the voice of the user from the sounds from the local area in order to improve the performance of a codebook-based speech enhancement system (Section 5, lines 1-3, “This paper has investigated the use of a reference signal provided by a bone-conducting microphone to improve the performance of a codebook-based speech enhancement system.”).
Zheng and Kechichian are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Kechichian to determine a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and train a model to perform speech enhancement that determines the voice of the user from the sounds from the local area.  Doing so would allow for improving the performance of a codebook-based speech enhancement system.
Zheng in view of Kechichian does not specifically disclose: determining the spectral correlation and the spatial correlation.
Mehta teaches:
determining the spectral correlation and the spatial correlation (Section I, lines 1-6, "Body surface vibrations generated during speaking often provide robust signals that can be related to the underlying physiological mechanisms of voice and speech production. Accelerometer (ACC) sensors can measure these signals by taking advantage of the piezoelectric effect to transduce mechanical forces into electrical signals."; Section 3A, lines 1-6, "Table I reports the correlation coefficients between ACC based and MIC-based measures of jitter (JCV, Jlocal), shimmer (SCV, Slocal), harmonics-to-noise ratio (HNRtime, HNRspec), spectral tilt (TL8), and cepstral peak prominence (CPP). All correlation coefficients achieved statistical significance for each subject group."; The jitter, shimmer, and harmonics-to-noise ratio correlations can potentially be affected by spatial differences and read on the spatial correlations, and the spectral tilt and cepstral peak prominence correlations read on the spectral correlations.).  Mehta teaches determining spectral and spatial correlations allows for tracking vocal function deterioration (Section IV, lines 37-41, "For example, due to the high correlation between ACC- and MIC-based estimates of CPP, future work could track changes in CPP from a speaker’s ambulatory accelerometer signal to reveal deterioration of vocal function over the course of a day due to vocal fatigue").
Zheng, Kechichian, and Mehta are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Mehta to determine spectral and spatial correlations.  Doing so would allow for tracking vocal function deterioration.
Regarding claim 19, Zheng in view of Kechichian discloses the computer readable medium as claimed in claim 17.
Kechichian teaches:
determining, based on the tissue based vibrations, the spectral correlation between: (i) the voice of the user detected by the microphone array, and (ii) the sounds from the local area detected by the microphone array (Section 2, lines 1-4, “Consider an additive noise model where the AC microphone signal is given by z(n) = d(n) + u(n), with d(n) denoting the desired signal, in this case the user’s speech, u(n) the undesired signal, such as background noise, and n the discrete-time index.”; Section 2, lines 7-16, “Let the short-time Fourier transform (STFT) of the microphone signal be denoted by Z(ω) = D(ω) + U(ω), where D(ω) and U(ω) denote the STFT of the desired and undesired signal, respectively. Also, let P̂z(ω), P̂d(ω), and P̂u(ω) denote estimated PSDs of the microphone, desired, and undesired signals, respectively. Typically, P̂u(ω) is obtained using a noise estimation algorithm that estimates the noise PSD from the noisy PSD. P̂d(ω) is then obtained as max (P̂z(ω) - P̂u(ω), 0). The gain function G(ω) is usually expressed in terms of these PSDs, and the enhanced signal is given by Q(ω) = G(ω) Z(ω)”; Section 1, lines 22-28, “An example of such a sensor is a bone-conduction (BC) microphone which is placed in contact with a small area of the user’s skin. The location of the BC microphone can affect the quality and intelligibility of the captured BC speech, a preferred location being an area near the larynx. Unlike air-conduction (AC) microphones which also capture external background noise present in the user’s environment, BC sensors are immune to this background noise since the captured signal is transmitted through bone and tissue directly to the sensor.”; Section 3, lines 1-11, “The proposed method relies on trained codebooks of AC and BC clean speech PSDs, and a mapping between the two that is generated during an offline training procedure. During the actual online noise reduction, the AC microphone observes both the speech signal and the background noise whereas the BC microphone observes only the BC speech signal. For each short-time segment of the noisy speech, the BC codebook vector that is closest (with respect to a selected distortion criterion) to the PSD of the observed BC signal is identified in a first step. Though the BC codebook is trained on clean BC speech signals, this step is justified as the BC signal is noise-free even in the presence of background noise. In the second step, the AC codebook vector that is mapped to the selected BC codebook vector is used in the codebook-based speech enhancement algorithm, instead of the entire AC codebook.”; The gain function G(ω) expressed in terms of the power spectral densities (PSDs) reads on a spectral correlation between the voice of the user detected by the microphone array and the sounds from the local area detected by the microphone array, where Q(ω) is the frequency domain representation of the voice of the user detected by the microphone array and Z(ω) is the frequency domain representation of the sounds from the local area detected by the microphone array.  Using the air-conduction (AC) microphone and the bone-conduction (BC) microphone to determine the background noise reads on determining the spectral correlation based on the tissue based vibrations.);
wherein the model is trained based on both the determined spectral correlation (Section 1, lines 11-16, “Model-based approaches to speech enhancement, which rely on trained codebooks of gain-normalized speech and noise PSDs, have been shown to provide better performance in nonstationary noise environments In such methods, the speech and noise PSD codebook candidates and their respective gain factors that describe the observed noisy PSD best (according to a certain criterion), are obtained for each short-time segment.).
Kechichian teaches determining a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and training a model to perform speech enhancement that determines the voice of the user from the sounds from the local area in order to improve the performance of a codebook-based speech enhancement system (Section 5, lines 1-3, “This paper has investigated the use of a reference signal provided by a bone-conducting microphone to improve the performance of a codebook-based speech enhancement system.”).
Zheng and Kechichian are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Kechichian to determine a gain function, in terms of the power spectral densities (PSDs), that represents the correlation of the frequency domain representation of the voice of the user detected by the microphone array and the frequency domain representation of the sounds from the local area detected by the microphone array, based on the tissue based vibrations, and train a model to perform speech enhancement that determines the voice of the user from the sounds from the local area.  Doing so would allow for improving the performance of a codebook-based speech enhancement system.
Zheng in view of Kechichian does not specifically disclose: determining the spectral correlation and the spatial correlation.
Mehta teaches:
determining the spectral correlation and the spatial correlation (Section I, lines 1-6, "Body surface vibrations generated during speaking often provide robust signals that can be related to the underlying physiological mechanisms of voice and speech production. Accelerometer (ACC) sensors can measure these signals by taking advantage of the piezoelectric effect to transduce mechanical forces into electrical signals."; Section 3A, lines 1-6, "Table I reports the correlation coefficients between ACC based and MIC-based measures of jitter (JCV, Jlocal), shimmer (SCV, Slocal), harmonics-to-noise ratio (HNRtime, HNRspec), spectral tilt (TL8), and cepstral peak prominence (CPP). All correlation coefficients achieved statistical significance for each subject group."; The jitter, shimmer, and harmonics-to-noise ratio correlations can potentially be affected by spatial differences and read on the spatial correlations, and the spectral tilt and cepstral peak prominence correlations read on the spectral correlations.).  Mehta teaches determining spectral and spatial correlations allows for tracking vocal function deterioration (Section IV, lines 37-41, "For example, due to the high correlation between ACC- and MIC-based estimates of CPP, future work could track changes in CPP from a speaker’s ambulatory accelerometer signal to reveal deterioration of vocal function over the course of a day due to vocal fatigue").
Zheng, Kechichian, and Mehta are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Mehta to determine spectral and spatial correlations.  Doing so would allow for tracking vocal function deterioration.
Claims 8, 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zheng in view of Kechichian and further in view of Shang et al. ("Enabling Secure Voice Input on Augmented Reality Headsets using Internal Body Voice"), hereinafter Shang.
Regarding claim 8, Zheng in view of Kechichian discloses the audio system as claimed in claim 1, but does not specifically disclose wherein, the controller is further configured to: determine, based on the tissue based vibrations, one or more functions describing the voice of the user detected by the microphone array within the local area, wherein the functions are selected from a group comprising: a temporal response of the voice of the user detected by the microphone array within the local area, a spectral response of the voice of the user detected by the microphone array within the local area, and a spatial response of the voice of the user detected by the microphone array within the local area; and train the model using the determined one or more functions to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array.
Shang teaches:
determine, based on the tissue based vibrations, one or more functions describing the voice of the user detected by the microphone array within the local area, wherein the functions are selected from a group comprising: a temporal response of the voice of the user detected by the microphone array within the local area, a spectral response of the voice of the user detected by the microphone array within the local area, and a spatial response of the voice of the user detected by the microphone array within the local area (Section 3A, lines 20-27, "After collecting the user’s voices at two channels, we first segment the voice for each word to remove the internal between neighboring words. For the voice signals of each pair of words, we transform the signals from the time domain to the time-frequency domain. Since both raw voice signals contain background noise, we further leverage spectrogram enhancement techniques to remove the noise and extract the information of the voices."; The spectrogram enhancement reads on a spectral response function.);
and train the model using the determined one or more functions to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array (Section 3A, lines 1-16, "The key idea underlying our system is to fully leverage two propagation paths of the human voices. When the AR user says a voice command, the normal microphone will capture the user’s voice that propagates through the air, and the contact microphone on user’s head can record the voice that only propagates through the user’s body. By comparing the information in two voices, our system can determine whether the voice is from the normal user or from two types of attackers. For a new AR user, there are two stages to use the system. In the training stage, the new user is asked to say a few words using our system. These training instances are used to quickly build a classifier. After the training stage, the system is ready to be used. In the testing stage, our system will check whether the command is from the normal user who is using the AR headset using the trained classifier.").
Shang teaches determining a function describing a user’s voice using tissue-based vibrations and training a model using the function to distinguish between the voice of the user and voice inputs received from other sources in order to determine the legitimacy of the voice received from the user (Section VII, line 10-18, "Our system leverages a contact microphone to record the internal body propagation of the voice. A user legitimacy is determined by measuring the correlation and similarity between the internal body voice and air voice. To our best knowledge, our system is the first to protect the voice input for AR headsets. Experimental results show that our system can accept normal users with average accuracy of 97% and defend against obstruction attack and replay attack with average accuracy of 99.2% and 98%, respectively.").
Zheng, Kechichian, and Shang are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Shang to determine a function describing a user’s voice using tissue-based vibrations and training a model using the function to distinguish between the voice of the user and voice inputs received from other sources.  Doing so would allow for determining the legitimacy of the voice from the user.
Regarding claim 14, Zheng in view of Kechichian discloses the method as claimed in claim 9, but does not specifically disclose further comprising: determining, based on the tissue based vibrations, one or more functions describing the voice of the user detected by the microphone array within the local area, wherein the functions are selected from a group comprising: a temporal response of the voice of the user detected by the microphone array within the local area, a spectral response of the voice of the user detected by the microphone array within the local area, and a spatial response of the voice of the user detected by the microphone array within the local area; and training the model using the determined one or more functions to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array.
Shang teaches:
determine, based on the tissue based vibrations, one or more functions describing the voice of the user detected by the microphone array within the local area, wherein the functions are selected from a group comprising: a temporal response of the voice of the user detected by the microphone array within the local area, a spectral response of the voice of the user detected by the microphone array within the local area, and a spatial response of the voice of the user detected by the microphone array within the local area (Section 3A, lines 20-27, "After collecting the user’s voices at two channels, we first segment the voice for each word to remove the internal between neighboring words. For the voice signals of each pair of words, we transform the signals from the time domain to the time-frequency domain. Since both raw voice signals contain background noise, we further leverage spectrogram enhancement techniques to remove the noise and extract the information of the voices."; The spectrogram enhancement reads on a spectral response function.);
and train the model using the determined one or more functions to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array (Section 3A, lines 1-16, "The key idea underlying our system is to fully leverage two propagation paths of the human voices. When the AR user says a voice command, the normal microphone will capture the user’s voice that propagates through the air, and the contact microphone on user’s head can record the voice that only propagates through the user’s body. By comparing the information in two voices, our system can determine whether the voice is from the normal user or from two types of attackers. For a new AR user, there are two stages to use the system. In the training stage, the new user is asked to say a few words using our system. These training instances are used to quickly build a classifier. After the training stage, the system is ready to be used. In the testing stage, our system will check whether the command is from the normal user who is using the AR headset using the trained classifier.").
Shang teaches determining a function describing a user’s voice using tissue-based vibrations and training a model using the function to distinguish between the voice of the user and voice inputs received from other sources in order to determine the legitimacy of the voice received from the user (Section VII, line 10-18, "Our system leverages a contact microphone to record the internal body propagation of the voice. A user legitimacy is determined by measuring the correlation and similarity between the internal body voice and air voice. To our best knowledge, our system is the first to protect the voice input for AR headsets. Experimental results show that our system can accept normal users with average accuracy of 97% and defend against obstruction attack and replay attack with average accuracy of 99.2% and 98%, respectively.").
Zheng, Kechichian, and Shang are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Shang to determine a function describing a user’s voice using tissue-based vibrations and training a model using the function to distinguish between the voice of the user and voice inputs received from other sources.  Doing so would allow for determining the legitimacy of the voice from the user.
Regarding claim 20, Zheng in view of Kechichian discloses the computer readable medium as claimed in claim 17, but does not specifically disclose wherein, the program code instructions, when executed by the processor, further cause the processer to perform steps comprising: determining, based on the tissue based vibrations, one or more functions describing the voice of the user detected by the microphone array within the local area, wherein the functions are selected from a group comprising: a temporal response of the voice of the user detected by the microphone array within the local area, a spectral response of the voice of the user detected by the microphone array within the local area, and a spatial response of the voice of the user detected by the microphone array within the local area; and training the model using the determined one or more functions to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array.
Shang teaches:
determine, based on the tissue based vibrations, one or more functions describing the voice of the user detected by the microphone array within the local area, wherein the functions are selected from a group comprising: a temporal response of the voice of the user detected by the microphone array within the local area, a spectral response of the voice of the user detected by the microphone array within the local area, and a spatial response of the voice of the user detected by the microphone array within the local area (Section 3A, lines 20-27, "After collecting the user’s voices at two channels, we first segment the voice for each word to remove the internal between neighboring words. For the voice signals of each pair of words, we transform the signals from the time domain to the time-frequency domain. Since both raw voice signals contain background noise, we further leverage spectrogram enhancement techniques to remove the noise and extract the information of the voices."; The spectrogram enhancement reads on a spectral response function.);
and train the model using the determined one or more functions to distinguish between the voice of the user detected by the microphone array and other sounds from the local area detected by the microphone array (Section 3A, lines 1-16, "The key idea underlying our system is to fully leverage two propagation paths of the human voices. When the AR user says a voice command, the normal microphone will capture the user’s voice that propagates through the air, and the contact microphone on user’s head can record the voice that only propagates through the user’s body. By comparing the information in two voices, our system can determine whether the voice is from the normal user or from two types of attackers. For a new AR user, there are two stages to use the system. In the training stage, the new user is asked to say a few words using our system. These training instances are used to quickly build a classifier. After the training stage, the system is ready to be used. In the testing stage, our system will check whether the command is from the normal user who is using the AR headset using the trained classifier.").
Shang teaches determining a function describing a user’s voice using tissue-based vibrations and training a model using the function to distinguish between the voice of the user and voice inputs received from other sources in order to determine the legitimacy of the voice received from the user (Section VII, line 10-18, "Our system leverages a contact microphone to record the internal body propagation of the voice. A user legitimacy is determined by measuring the correlation and similarity between the internal body voice and air voice. To our best knowledge, our system is the first to protect the voice input for AR headsets. Experimental results show that our system can accept normal users with average accuracy of 97% and defend against obstruction attack and replay attack with average accuracy of 99.2% and 98%, respectively.").
Zheng, Kechichian, and Shang are considered to be analogous to the claimed invention because they are in the same field of speech detection audio systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zheng in view of Kechichian to incorporate the teachings of Shang to determine a function describing a user’s voice using tissue-based vibrations and training a model using the function to distinguish between the voice of the user and voice inputs received from other sources.  Doing so would allow for determining the legitimacy of the voice from the user.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JAMES BOGGS/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657