DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed October 18, 2022 have been fully considered but they are not persuasive.
In response to Applicant's arguments, on page 16 of Applicant’s response, that “King, Corey, Bisio, Visser, Variani, Hershey, Chenier, Furuta, Zhan, and Watanabe, 
which the Examiner alleged as teaching other aspects of the present invention, appear to be silent about at least the above-cited features, and thus do not cure deficiencies of Parthasarathi, Le Roux, Nakadai and Rodriguez to produce the claimed invention.”, Visser et al. (US Patent Application Publication No. 2013/0282373) teaches “the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold” in Paragraph 0207, lines 1-12, "The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated.", where the bin-wise Voice Activity Detector (VAD) that eliminates peaks significantly lower than a maximum peak reads on the supervised labeling value, which is set to 1 or 0 for each spectrum frequency depending on a comparison of the frequency amplitude to the maximum frequency amplitude for the spectrum.
Applicant’s remaining arguments have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1 – 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 1, the limitation “the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample and a vector of supervised labeling value of the enrollment speech training sample in each vector dimension of the K-dimensional vector space” is indefinite because it is not clear how “a vector of supervised labeling value of the enrollment speech training sample” is used to obtain the estimated speech extractor in each vector dimension of the K-dimensional vector space.  In paragraph 0047, lines 7-16, the Specification recites “The spectrum amplitude comparison value is equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold. Specifically, a supervised labeling value             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         of the enrollment speech may be set, and the spectrum of each frame of the enrollment speech is separately compared with a difference between the largest spectrum amplitude and a spectrum threshold Γ. If a spectrum amplitude of a frame of the enrollment speech (that is, a time-frequency window) is greater than a spectrum amplitude comparison value (that is, a difference between the largest spectrum amplitude of the enrollment speech and Γ), the supervised labeling value             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         of the enrollment speech corresponding to the time-frequency window is 1, and otherwise, the value of             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         is 0.”, disclosing that the supervised labeling value is determined from the spectrum amplitude for each frame of the enrollment speech, not the K-dimensional vector representation of the spectrum.  In paragraph 0038, lines 1-7, the Specification recites “Separately map, in a case that the enrollment speech and the mixed speech are detected in the input speech, a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension. In other words, the spectrum of a frame of the enrollment speech may be represented by a K-dimensional vector, and the spectrum of a frame of the non-enrollment speech may be represented by a K-dimensional vector.”, disclosing that the spectrum for each frame of the enrollment speech is mapped to a K-dimensional vector.  In paragraph 0049, lines 7-16, the Specification recites “Use the average vector of the enrollment speech in each vector dimension as a speech extractor of the target speaker in each vector dimension and separately measure a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to estimate a mask of each frame in the mixed speech.”, disclosing that the speech extractor obtained from the average vector of the enrollment speech in each vector dimension, but does not disclose the use of the supervised labeling value in this step of determining the speech extractor.  For examination purposes, “the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample and a vector of supervised labeling value of the enrollment speech training sample in each vector dimension of the K-dimensional vector space” will be interpreted to mean that the supervised labeling value is applied to the spectrum of each frame of the enrollment speech as a separate step from mapping the spectrum for each frame of the enrollment speech to a K-dimensional vector for use in obtaining the speech extractor, where the supervised labeling value is the same for all vector dimensions in the K-dimensional vector space for a frame.
Claims 2 – 10 depend from claim 1, and thus recite the limitations of claim 1, and do not resolve the indefinite language from claim 1.
Regarding claim 11, the limitation “the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample and a vector of supervised labeling value of the enrollment speech training sample in each vector dimension of the K-dimensional vector space” is indefinite because it is not clear how “a vector of supervised labeling value of the enrollment speech training sample” is used to obtain the estimated speech extractor in each vector dimension of the K-dimensional vector space.  In paragraph 0047, lines 7-16, the Specification recites “The spectrum amplitude comparison value is equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold. Specifically, a supervised labeling value             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         of the enrollment speech may be set, and the spectrum of each frame of the enrollment speech is separately compared with a difference between the largest spectrum amplitude and a spectrum threshold Γ. If a spectrum amplitude of a frame of the enrollment speech (that is, a time-frequency window) is greater than a spectrum amplitude comparison value (that is, a difference between the largest spectrum amplitude of the enrollment speech and Γ), the supervised labeling value             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         of the enrollment speech corresponding to the time-frequency window is 1, and otherwise, the value of             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         is 0.”, disclosing that the supervised labeling value is determined from the spectrum amplitude for each frame of the enrollment speech, not the K-dimensional vector representation of the spectrum.  In paragraph 0038, lines 1-7, the Specification recites “Separately map, in a case that the enrollment speech and the mixed speech are detected in the input speech, a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension. In other words, the spectrum of a frame of the enrollment speech may be represented by a K-dimensional vector, and the spectrum of a frame of the non-enrollment speech may be represented by a K-dimensional vector.”, disclosing that the spectrum for each frame of the enrollment speech is mapped to a K-dimensional vector.  In paragraph 0049, lines 7-16, the Specification recites “Use the average vector of the enrollment speech in each vector dimension as a speech extractor of the target speaker in each vector dimension and separately measure a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to estimate a mask of each frame in the mixed speech.”, disclosing that the speech extractor obtained from the average vector of the enrollment speech in each vector dimension, but does not disclose the use of the supervised labeling value in this step of determining the speech extractor.  For examination purposes, “the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample and a vector of supervised labeling value of the enrollment speech training sample in each vector dimension of the K-dimensional vector space” will be interpreted to mean that the supervised labeling value is applied to the spectrum of each frame of the enrollment speech as a separate step from mapping the spectrum for each frame of the enrollment speech to a K-dimensional vector for use in obtaining the speech extractor, where the supervised labeling value is the same for all vector dimensions in the K-dimensional vector space for a frame.
Claims 12 – 19 depend from claim 11, and thus recite the limitations of claim 11, and do not resolve the indefinite language from claim 11.
Regarding claim 20, the limitation “the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample and a vector of supervised labeling value of the enrollment speech training sample in each vector dimension of the K-dimensional vector space” is indefinite because it is not clear how “a vector of supervised labeling value of the enrollment speech training sample” is used to obtain the estimated speech extractor in each vector dimension of the K-dimensional vector space.  In paragraph 0047, lines 7-16, the Specification recites “The spectrum amplitude comparison value is equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold. Specifically, a supervised labeling value             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         of the enrollment speech may be set, and the spectrum of each frame of the enrollment speech is separately compared with a difference between the largest spectrum amplitude and a spectrum threshold Γ. If a spectrum amplitude of a frame of the enrollment speech (that is, a time-frequency window) is greater than a spectrum amplitude comparison value (that is, a difference between the largest spectrum amplitude of the enrollment speech and Γ), the supervised labeling value             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         of the enrollment speech corresponding to the time-frequency window is 1, and otherwise, the value of             
                
                    
                        Y
                    
                    
                        f
                        ,
                        t
                    
                    
                        w
                        s
                    
                
            
         is 0.”, disclosing that the supervised labeling value is determined from the spectrum amplitude for each frame of the enrollment speech, not the K-dimensional vector representation of the spectrum.  In paragraph 0038, lines 1-7, the Specification recites “Separately map, in a case that the enrollment speech and the mixed speech are detected in the input speech, a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension. In other words, the spectrum of a frame of the enrollment speech may be represented by a K-dimensional vector, and the spectrum of a frame of the non-enrollment speech may be represented by a K-dimensional vector.”, disclosing that the spectrum for each frame of the enrollment speech is mapped to a K-dimensional vector.  In paragraph 0049, lines 7-16, the Specification recites “Use the average vector of the enrollment speech in each vector dimension as a speech extractor of the target speaker in each vector dimension and separately measure a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to estimate a mask of each frame in the mixed speech.”, disclosing that the speech extractor obtained from the average vector of the enrollment speech in each vector dimension, but does not disclose the use of the supervised labeling value in this step of determining the speech extractor.  For examination purposes, “the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample and a vector of supervised labeling value of the enrollment speech training sample in each vector dimension of the K-dimensional vector space” will be interpreted to mean that the supervised labeling value is applied to the spectrum of each frame of the enrollment speech as a separate step from mapping the spectrum for each frame of the enrollment speech to a K-dimensional vector for use in obtaining the speech extractor, where the supervised labeling value is the same for all vector dimensions in the K-dimensional vector space for a frame.
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claims 3 and 13 are rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.
Claim 3 depends from claim 1, and fails to further limit the subject matter of claim 1 because the limitation “calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension, the effective frame of the enrollment speech being a frame in the enrollment speech with a spectrum amplitude greater than a spectrum amplitude comparison value, and the spectrum amplitude comparison value being equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold” is covered by the claim 1 limitations “calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension” and “the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample and a vector of supervised labeling value of the enrollment speech training sample in each vector dimension of the K-dimensional vector space, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold”.
Claim 13 depends from claim 11, and fails to further limit the subject matter of claim 11 because the limitation “calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension, the effective frame of the enrollment speech being a frame in the enrollment speech with a spectrum amplitude greater than a spectrum amplitude comparison value, and the spectrum amplitude comparison value being equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold” is covered by the claim 11 limitations “calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension” and “the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample and a vector of supervised labeling value of the enrollment speech training sample in each vector dimension of the K-dimensional vector space, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold”.
Applicant may cancel the claims, amend the claims to place the claims in proper dependent form, rewrite the claims in independent form, or present a sufficient showing that the dependent claims complies with the statutory requirements.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3 – 4, 6, 11, 13 – 14, 16 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi in view of Le Roux (US Patent No. 10,529,349), hereinafter Le Roux, Rodriguez et al. (US Patent Application Publication No. 2015/0112682), hereinafter Rodriguez, Visser et al. (US Patent Application Publication No. 2013/0282373), hereinafter Visser, and Nakadai et al. (US Patent No. 8,392,185), hereinafter Nakadai.
Regarding claim 1, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi teaches a mixed speech recognition method, applied to a computer device (Figure 1, “Server(s) 120”), the method comprising:
monitoring speech input and detecting an enrollment speech and a mixed speech from the speech input, the enrollment speech comprising preset speech information, and the mixed speech being non-enrollment speech inputted after the enrollment speech (Abstract, lines 1-6, "A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword.");
separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space by using a deep neural network to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension, K being not less than 1 (Column 8, lines 22-28, "The AFE may reduce noise in the audio data and divide the digitized audio data into frames representing a time intervals for which the AFE determines a number of values, called features, representing the qualities of the audio data, along with a set of those values, called an audio feature vector, representing the features/qualities of the audio data within the frame."; Column 16, lines 21-29, "For ASR processing the base input is typically audio data in the form of audio feature vectors corresponding to audio frames. As noted above, typically acoustic features (such as log-filter bank energies (LFBE) features, MFCC features, or other features) are determined and used to create audio feature vectors for each audio frame. It is possible to feed audio data into an RNN, using the amplitude and (phase) spectrum of a fast-Fourier transform (FFT), or other technique that projects an audio signal into a sequence of data.");
calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension (Column 22, lines 45-47, "For the mean estimator, the system may compute the average feature values over the reference audio data.").
Parthasarathi does not specifically disclose: the deep neural network being pre-trained by minimizing an objective function that describes a spectral error between a speech of the target speaker recovered by an estimated mask and a reference speech of the target speaker, wherein the estimated mask of the target speaker is obtained by measuring a distance between a vector of each frame of a mixed speech training sample in each vector dimension and an estimated speech extractor in each vector dimension of the K-dimensional vector space.
Le Roux teaches:
the deep neural network being pre-trained by minimizing an objective function that describes a spectral error between a speech of the target speaker recovered by an estimated mask and a reference speech of the target speaker (Column 3, lines 15-17, "Some embodiments of the present disclosure include training a deep neural network (DNN)-based enhancement system through a phase reconstruction stage."; Column 3, lines 44-48, "Accordingly, embodiments of the present disclosure train the network or DNN-based enhancement system to minimize an objective function including losses defined on the outcome of one or multiple steps of such iterative procedures."; Column 4, lines 23-26, "According to an embodiment of the present disclosure, an audio signal processing system for transforming an input audio signal, wherein the input audio signal includes a mixture of one or more target audio signals."; Column 16, line 66 - Column 17, line 2, "The Error Computation module 1030 can use the outputs of the spectrogram estimation from mask module 1013 and the reference source signals 1034 to compute a spectrogram estimation loss"; The spectrogram estimation loss reads on the spectral error, the target audio signals read on the speech of the target speaker, and the reference source signal reads on the reference speech.),
wherein the estimated mask of the target speaker is obtained by measuring a distance between a vector of each frame of a mixed speech training sample in each vector dimension and an estimated speech extractor in each vector dimension of the K-dimensional vector space (Column 13, lines 21-31, "FIG. 9A is a block diagram illustrating a single-channel mask inference network architecture 900A, according to embodiments of the present disclosure. A sequence of feature vectors obtained from the input mixture, for example the log magnitude of the short-time Fourier transform of the input mixture, is used as input to a mixture encoder 910. For example, the dimension of the input vector in the sequence can be F. The mixture encoder 910 is composed of multiple bidirectional long short-term memory (BLSTM) neural network layers, from the first BLSTM layer 930 to the last BLSTM layer 935."; Column 13, lines 40-50, "For each time frame and each frequency in a time-frequency domain, for example the short-time Fourier transform domain, the linear layer 940 uses output of the last BLSTM layer 935 to output C numbers, where C is the number of target speakers. The non-linearity 945 is applied to this set of C numbers for each time frame and each frequency, leading to mask values which indicate, for each time frame, each frequency, and each target speaker, the dominance of that target speaker in the input mixture at that time frame and that frequency."; Column 27, lines 1-7, "An aspect can include the error on the target audio signal estimates includes a distance between the target audio signal estimates and the reference target audio signals. Further, an aspect can include the error on the target audio signal estimates includes a distance between the spectrograms of target audio signal estimates and the spectrograms of the reference target audio signals."; The mask values which indicate the dominance of a target speaker read on the estimated mask of the target speaker, the input mixture reads on the mixed speech training sample, and the reference target audio signals reads on the estimated speech extractor.).
Le Roux teaches training a deep neural network by minimizing an objective function including spectral loss between target audio and a reference signal and generating a mask for a target speaker by comparing the feature vectors of an audio input containing multiple speakers and a reference target audio in order to separate the speech of multiple speakers in an audio signal (Column 7, lines 59-62, "The present disclosure relates to audio signals, and more particularly to using an end-to-end approach for single-channel speaker-independent multi-speaker speech separation.").
Parthasarathi and Le Roux are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi to incorporate the teachings of Le Roux to train a deep neural network by minimizing an objective function including spectral loss between target audio and a reference signal and generate a mask for a target speaker by comparing the feature vectors of an audio input containing multiple speakers and a reference target audio.  Doing so would allow for separating the speech of multiple speakers in an audio signal.
Parthasarathi in view of Le Roux does not specifically disclose: the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample in each vector dimension of the K-dimensional vector space.
Rodriguez teaches:
the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample in each vector dimension of the K-dimensional vector space (Paragraph 0071, lines 4-7, "In speaker verification, two voice prints are compared, one of the speaker known to the system in advance (e.g. from previous enrollment) and another extracted from the received audio data."; Paragraph 0094, lines 1-8, "If the parameters describing average feature vectors of audio data are calculated in a system according to the invention, D dimensional Mel Frequency Cepstral Coefficients MFCCs are one possible option to describe feature vectors of audio data. Thus, average feature vectors of audio data may e.g. be calculated by calculating the mean µmfcc,d of D dimensional MFCCs in one, two, three, or more or each dimension"; Paragraph 0097 line 1 - Paragraph 0098 line 6, "MFCCs: mfcc j t d, may be extracted. Herein J is the number of the considered audio data files, t is the frame index with a value between 1 and Tj (tϵ[1;Tj]), wherein Tj is the total number of (speech) frames for audio data file j. Optionally, only those parts of the audio data file j comprising voice signals are taken into account for extracting the MFCCs, e.g. by using a Voice Activity Detector (VAD). d is a value between 1 and D (dϵ[1;D]) representing the considered dimension."; The feature vectors read on the enrollment speech vectors.).
Rodriguez teaches extracting speech from audio data by comparing the audio data to enrollment speech based on feature vectors of the speech in order to analyze the speech to determine if the speech matches the target speaker (Paragraph 8, lines 2-5, "This voice utterance is analyzed using biometric voice data to verify that the speaker's voice corresponds to the identity of the speaker that is to be verified.").
Parthasarathi, Le Roux, and Rodriguez are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux to incorporate the teachings of Rodriguez to extract speech from audio data by comparing the audio data to enrollment speech based on feature vectors of the speech.  Doing so would allow for analyzing the speech to determine if the speech matches the target speaker.
Parthasarathi in view of Le Roux and Rodriguez does not specifically disclose: the estimated speech extractor is obtained according to a vector of supervised labeling value of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold.
Visser teaches:
the estimated speech extractor is obtained according to a vector of supervised labeling value of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold (Paragraph 0207, lines 1-12, "The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated."; The bin-wise Voice Activity Detector (VAD) that eliminates peaks significantly lower than a maximum peak reads on the supervised labeling value, and the threshold amount reads on the spectrum threshold.).
Visser teaches comparing the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminating the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value in order to reduce speech detection errors in low signal-to-noise scenarios (Paragraph 0062, lines 1-6, "Use of a voice activity detector as described herein may be expected to reduce speech processing errors that are often experienced in traditional voice activity detection, particularly in low signal-to-noise-ratio (SNR) scenarios, in non-stationary noise and competing voices cases, and other cases where voice may be present.").
Parthasarathi, Le Roux, Rodriguez, and Visser are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux and Rodriguez to incorporate the teachings of Visser to compare the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminate the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value.  Doing so would allow for reducing speech detection errors in low signal-to-noise scenarios.
Parthasarathi in view of Le Roux, Rodriguez, and Visser does not specifically disclose: determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech; and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech.
Nakadai teaches:
determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech (Column 2, lines 39-42, "In the method for generating a soft mask for a speech recognition system according to the invention, the soft mask is determined using a function of reliability of separation, which has at least one parameter."; Column 5, lines 11-12, "An acoustic feature set and a mask are calculated for each time frame."; Column 7, lines 20-27, "Feature vector of 48 spectral-related features are used. The MFM is a vector corresponding to 24 static spectral features and 24 dynamic spectral features. Each element of a vector represents the reliability of each feature. In conventional MFM generation, a binary MFM (i.e., 1 for reliable and 0 for unreliable) was used. The mask generating section 103 generates a soft MFM whose element of vector ranges from 0.0 to 1.0.");
and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech (Column 1, lines 53-62, "A speech recognition system according to the invention includes a sound source separating section which separates mixed speeches from multiple sound sources; a mask generating section which generates a soft mask which can take continuous values between 0 and 1 for each separated speech according to reliability of separation in separating operation of the sound source separating section; and a speech recognizing section which recognizes speeches separated by the sound source separating section using soft masks generated by the mask generating section.").
Nakadai teaches comparing the features of the mixed speech with the features of the reference speech from the target speaker to generate a mask for each frame of the mixed speech that corresponds to speech from the target speaker in order to improve word recognition rate when recognizing speech from multiple sources (Column 11, lines 34-37, "use of appropriately designed and adjusted soft masks has improved word recognition rate of the speech recognition system for simultaneous recognition of multiple sources").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, and Visser to incorporate the teachings of Nakadai to compare the features of the mixed speech with the features of the reference speech from the target speaker to generate a mask for each frame of the mixed speech that corresponds to speech from the target speaker.  Doing so would allow for improving word recognition rate when recognizing speech from multiple sources.
Regarding claim 3, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition method as claimed in claim 1.
Parthasarathi further discloses:
 wherein the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension comprises: calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension (Column 22, lines 45-47, "For the mean estimator, the system may compute the average feature values over the reference audio data.").
Visser further teaches:
the effective frame of the enrollment speech being a frame in the enrollment speech with a spectrum amplitude greater than a spectrum amplitude comparison value, and the spectrum amplitude comparison value being equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold (Paragraph 0207, lines 1-12, "The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated."; The bin-wise Voice Activity Detector (VAD) that eliminates peaks significantly lower than a maximum peak reads on determining an effective frame, and the threshold amount reads on the preset spectrum threshold.).
Visser teaches comparing the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminating the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value in order to reduce speech detection errors in low signal-to-noise scenarios (Paragraph 0062, lines 1-6, "Use of a voice activity detector as described herein may be expected to reduce speech processing errors that are often experienced in traditional voice activity detection, particularly in low signal-to-noise-ratio (SNR) scenarios, in non-stationary noise and competing voices cases, and other cases where voice may be present.").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of Visser to compare the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminate the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value.  Doing so would allow for reducing speech detection errors in low signal-to-noise scenarios.
Regarding claim 4, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition method as claimed in claim 3.
Rodriguez further teaches: wherein the calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension comprises:
summing, after the vector of each frame of the enrollment speech in the corresponding vector dimension is multiplied by a supervised labeling value of the corresponding frame, vector dimensions to obtain a total vector of the effective frame of the enrollment speech in the corresponding vector dimension; and separately dividing the total vector of the effective frame of the enrollment speech in each vector dimension by the sum of the supervised labeling values of the frames of the enrollment speech to obtain the average vector of the enrollment speech in each vector dimension (Paragraph 0094, lines 1-8, "If the parameters describing average feature vectors of audio data are calculated in a system according to the invention, D dimensional Mel Frequency Cepstral Coefficients MFCCs are one possible option to describe feature vectors of audio data. Thus, average feature vectors of audio data may e.g. be calculated by calculating the mean µmfcc,d of D dimensional MFCCs in one, two, three, or more or each dimension"; Paragraph 0098, lines 1-4, "Optionally, only those parts of the audio data file j comprising voice signals are taken into account for extracting the MFCCs, e.g. by using a Voice Activity Detector (VAD)."; Using the Voice Activity Detector to determine part of the audio data to be taken into account for extracting features for calculating the average feature vectors reads on multiplying the vectors by a supervised labeling value before summing the vectors when calculating the average vector.).
Rodriguez teaches averaging the feature vectors for frames containing valid speech in order to analyze the speech to determine if the speech matches the target speaker (Paragraph 0008, lines 2-5, "This voice utterance is analyzed using biometric voice data to verify that the speaker's voice corresponds to the identity of the speaker that is to be verified.").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to further incorporate the teachings of Rodriguez to average the feature vectors for frames containing valid speech.  Doing so would allow for analyzing the speech to determine if the speech matches the target speaker.
Visser further teaches:
the supervised labeling value of a frame in the enrollment speech being 1 when a spectrum amplitude of the frame is greater than the spectrum amplitude comparison value; and being 0 when the spectrum amplitude of the frame is not greater than the spectrum amplitude comparison value (Paragraph 0207, lines 1-12, "The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated."; The bin-wise Voice Activity Detector (VAD) that eliminates peaks significantly lower than a maximum peak reads on the supervised labeling value, eliminating a peak reads on the supervised labeling value being 0, not eliminating a peak reads on the supervised labeling value being 1, and the threshold amount below the maximum peak reads on the spectrum amplitude comparison value.).
Visser teaches comparing the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminating the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value in order to reduce speech detection errors in low signal-to-noise scenarios (Paragraph 0062, lines 1-6, "Use of a voice activity detector as described herein may be expected to reduce speech processing errors that are often experienced in traditional voice activity detection, particularly in low signal-to-noise-ratio (SNR) scenarios, in non-stationary noise and competing voices cases, and other cases where voice may be present.").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to further incorporate the teachings of Visser to compare the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminate the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value.  Doing so would allow for reducing speech detection errors in low signal-to-noise scenarios.
Regarding claim 6, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition method as claimed in claim 1.
Parthasarathi further discloses:
wherein the average vector of the enrollment speech in each vector dimension is used as the speech extractor of the target speaker in each vector dimension (Column 22, lines 43-54, "The LAMS method may allow the system to keep the features in the desired range and for better distinguishing features between the desired and interfering speech. For the mean estimator, the system may compute the average feature values over the reference audio data. For the task of recognizing speech from the desired talker, this constraint is advantageous. The reference audio data may be used as an example of the desired talker's speech, and then by subtracting the LAMS, the system may shift the features corresponding to the desired speaker closer to being zero-mean. This allows the system to train a classifier, e.g., a DNN, to better classify a desired talker's speech.").
Regarding claim 11, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi teaches a mixed speech recognition apparatus (Figure 1, “Server(s) 120”), comprising:
a memory (Column 9, lines 32-34, “The device performing NLU processing 260 (e.g., server 120) may include various components, including potentially dedicated processor(s), memory, storage, etc.”);
and a processor coupled to the memory (Column 9, lines 32-34, “The device performing NLU processing 260 (e.g., server 120) may include various components, including potentially dedicated processor(s), memory, storage, etc.”) and configured to perform:
monitoring speech input and detecting an enrollment speech and a mixed speech from the speech input, the enrollment speech comprising preset speech information, and the mixed speech being non-enrollment speech inputted after the enrollment speech (Abstract, lines 1-6, "A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword.");
separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space by using a deep neural network to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension, K being not less than 1 (Column 8, lines 22-28, "The AFE may reduce noise in the audio data and divide the digitized audio data into frames representing a time intervals for which the AFE determines a number of values, called features, representing the qualities of the audio data, along with a set of those values, called an audio feature vector, representing the features/qualities of the audio data within the frame."; Column 16, lines 21-29, "For ASR processing the base input is typically audio data in the form of audio feature vectors corresponding to audio frames. As noted above, typically acoustic features (such as log-filter bank energies (LFBE) features, MFCC features, or other features) are determined and used to create audio feature vectors for each audio frame. It is possible to feed audio data into an RNN, using the amplitude and (phase) spectrum of a fast-Fourier transform (FFT), or other technique that projects an audio signal into a sequence of data.");
calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension (Column 22, lines 45-47, "For the mean estimator, the system may compute the average feature values over the reference audio data.").
Parthasarathi does not specifically disclose: the deep neural network being pre-trained by minimizing an objective function that describes a spectral error between a speech of the target speaker recovered by an estimated mask and a reference speech of the target speaker, wherein the estimated mask of the target speaker is obtained by measuring a distance between a vector of each frame of a mixed speech training sample in each vector dimension and an estimated speech extractor in each vector dimension of the K-dimensional vector space.
Le Roux teaches:
the deep neural network being pre-trained by minimizing an objective function that describes a spectral error between a speech of the target speaker recovered by an estimated mask and a reference speech of the target speaker (Column 3, lines 15-17, "Some embodiments of the present disclosure include training a deep neural network (DNN)-based enhancement system through a phase reconstruction stage."; Column 3, lines 44-48, "Accordingly, embodiments of the present disclosure train the network or DNN-based enhancement system to minimize an objective function including losses defined on the outcome of one or multiple steps of such iterative procedures."; Column 4, lines 23-26, "According to an embodiment of the present disclosure, an audio signal processing system for transforming an input audio signal, wherein the input audio signal includes a mixture of one or more target audio signals."; Column 16, line 66 - Column 17, line 2, "The Error Computation module 1030 can use the outputs of the spectrogram estimation from mask module 1013 and the reference source signals 1034 to compute a spectrogram estimation loss"; The spectrogram estimation loss reads on the spectral error, the target audio signals read on the speech of the target speaker, and the reference source signal reads on the reference speech.),
wherein the estimated mask of the target speaker is obtained by measuring a distance between a vector of each frame of a mixed speech training sample in each vector dimension and an estimated speech extractor in each vector dimension of the K-dimensional vector space (Column 13, lines 21-31, "FIG. 9A is a block diagram illustrating a single-channel mask inference network architecture 900A, according to embodiments of the present disclosure. A sequence of feature vectors obtained from the input mixture, for example the log magnitude of the short-time Fourier transform of the input mixture, is used as input to a mixture encoder 910. For example, the dimension of the input vector in the sequence can be F. The mixture encoder 910 is composed of multiple bidirectional long short-term memory (BLSTM) neural network layers, from the first BLSTM layer 930 to the last BLSTM layer 935."; Column 13, lines 40-50, "For each time frame and each frequency in a time-frequency domain, for example the short-time Fourier transform domain, the linear layer 940 uses output of the last BLSTM layer 935 to output C numbers, where C is the number of target speakers. The non-linearity 945 is applied to this set of C numbers for each time frame and each frequency, leading to mask values which indicate, for each time frame, each frequency, and each target speaker, the dominance of that target speaker in the input mixture at that time frame and that frequency."; Column 27, lines 1-7, "An aspect can include the error on the target audio signal estimates includes a distance between the target audio signal estimates and the reference target audio signals. Further, an aspect can include the error on the target audio signal estimates includes a distance between the spectrograms of target audio signal estimates and the spectrograms of the reference target audio signals."; The mask values which indicate the dominance of a target speaker read on the estimated mask of the target speaker, the input mixture reads on the mixed speech training sample, and the reference target audio signals reads on the estimated speech extractor.).
Le Roux teaches training a deep neural network by minimizing an objective function including spectral loss between target audio and a reference signal and generating a mask for a target speaker by comparing the feature vectors of an audio input containing multiple speakers and a reference target audio in order to separate the speech of multiple speakers in an audio signal (Column 7, lines 59-62, "The present disclosure relates to audio signals, and more particularly to using an end-to-end approach for single-channel speaker-independent multi-speaker speech separation.").
Parthasarathi and Le Roux are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi to incorporate the teachings of Le Roux to train a deep neural network by minimizing an objective function including spectral loss between target audio and a reference signal and generate a mask for a target speaker by comparing the feature vectors of an audio input containing multiple speakers and a reference target audio.  Doing so would allow for separating the speech of multiple speakers in an audio signal.
Parthasarathi in view of Le Roux does not specifically disclose: the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample in each vector dimension of the K-dimensional vector space.
Rodriguez teaches:
the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample in each vector dimension of the K-dimensional vector space (Paragraph 0071, lines 4-7, "In speaker verification, two voice prints are compared, one of the speaker known to the system in advance (e.g. from previous enrollment) and another extracted from the received audio data."; Paragraph 0094, lines 1-8, "If the parameters describing average feature vectors of audio data are calculated in a system according to the invention, D dimensional Mel Frequency Cepstral Coefficients MFCCs are one possible option to describe feature vectors of audio data. Thus, average feature vectors of audio data may e.g. be calculated by calculating the mean µmfcc,d of D dimensional MFCCs in one, two, three, or more or each dimension"; Paragraph 0097 line 1 - Paragraph 0098 line 6, "MFCCs: mfcc j t d, may be extracted. Herein J is the number of the considered audio data files, t is the frame index with a value between 1 and Tj (tϵ[1;Tj]), wherein Tj is the total number of (speech) frames for audio data file j. Optionally, only those parts of the audio data file j comprising voice signals are taken into account for extracting the MFCCs, e.g. by using a Voice Activity Detector (VAD). d is a value between 1 and D (dϵ[1;D]) representing the considered dimension."; The feature vectors read on the enrollment speech vectors.).
Rodriguez teaches extracting speech from audio data by comparing the audio data to enrollment speech based on feature vectors of the speech in order to analyze the speech to determine if the speech matches the target speaker (Paragraph 8, lines 2-5, "This voice utterance is analyzed using biometric voice data to verify that the speaker's voice corresponds to the identity of the speaker that is to be verified.").
Parthasarathi, Le Roux, and Rodriguez are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux to incorporate the teachings of Rodriguez to extract speech from audio data by comparing the audio data to enrollment speech based on feature vectors of the speech.  Doing so would allow for analyzing the speech to determine if the speech matches the target speaker.
Parthasarathi in view of Le Roux and Rodriguez does not specifically disclose: the estimated speech extractor is obtained according to a vector of supervised labeling value of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold.
Visser teaches:
the estimated speech extractor is obtained according to a vector of supervised labeling value of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold (Paragraph 0207, lines 1-12, "The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated."; The bin-wise Voice Activity Detector (VAD) that eliminates peaks significantly lower than a maximum peak reads on the supervised labeling value, and the threshold amount reads on the spectrum threshold.).
Visser teaches comparing the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminating the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value in order to reduce speech detection errors in low signal-to-noise scenarios (Paragraph 0062, lines 1-6, "Use of a voice activity detector as described herein may be expected to reduce speech processing errors that are often experienced in traditional voice activity detection, particularly in low signal-to-noise-ratio (SNR) scenarios, in non-stationary noise and competing voices cases, and other cases where voice may be present.").
Parthasarathi, Le Roux, Rodriguez, and Visser are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux and Rodriguez to incorporate the teachings of Visser to compare the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminate the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value.  Doing so would allow for reducing speech detection errors in low signal-to-noise scenarios.
Parthasarathi in view of Le Roux, Rodriguez, and Visser does not specifically disclose: determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech; and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech.
Nakadai teaches:
determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech (Column 2, lines 39-42, "In the method for generating a soft mask for a speech recognition system according to the invention, the soft mask is determined using a function of reliability of separation, which has at least one parameter."; Column 5, lines 11-12, "An acoustic feature set and a mask are calculated for each time frame."; Column 7, lines 20-27, "Feature vector of 48 spectral-related features are used. The MFM is a vector corresponding to 24 static spectral features and 24 dynamic spectral features. Each element of a vector represents the reliability of each feature. In conventional MFM generation, a binary MFM (i.e., 1 for reliable and 0 for unreliable) was used. The mask generating section 103 generates a soft MFM whose element of vector ranges from 0.0 to 1.0.");
and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech (Column 1, lines 53-62, "A speech recognition system according to the invention includes a sound source separating section which separates mixed speeches from multiple sound sources; a mask generating section which generates a soft mask which can take continuous values between 0 and 1 for each separated speech according to reliability of separation in separating operation of the sound source separating section; and a speech recognizing section which recognizes speeches separated by the sound source separating section using soft masks generated by the mask generating section.").
Nakadai teaches comparing the features of the mixed speech with the features of the reference speech from the target speaker to generate a mask for each frame of the mixed speech that corresponds to speech from the target speaker in order to improve word recognition rate when recognizing speech from multiple sources (Column 11, lines 34-37, "use of appropriately designed and adjusted soft masks has improved word recognition rate of the speech recognition system for simultaneous recognition of multiple sources").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, and Visser to incorporate the teachings of Nakadai to compare the features of the mixed speech with the features of the reference speech from the target speaker to generate a mask for each frame of the mixed speech that corresponds to speech from the target speaker.  Doing so would allow for improving word recognition rate when recognizing speech from multiple sources.
Regarding claim 13, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux and Nakadai discloses the mixed speech recognition apparatus as claimed in claim 11.
Parthasarathi further discloses:
wherein the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension comprises: calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension (Column 22, lines 45-47, "For the mean estimator, the system may compute the average feature values over the reference audio data.").
Visser further teaches:
the effective frame of the enrollment speech being a frame in the enrollment speech with a spectrum amplitude greater than a spectrum amplitude comparison value, and the spectrum amplitude comparison value being equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold (Paragraph 0207, lines 1-12, "The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated."; The bin-wise Voice Activity Detector (VAD) that eliminates peaks significantly lower than a maximum peak reads on determining an effective frame, and the threshold amount reads on the preset spectrum threshold.).
Visser teaches comparing the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminating the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value in order to reduce speech detection errors in low signal-to-noise scenarios (Paragraph 0062, lines 1-6, "Use of a voice activity detector as described herein may be expected to reduce speech processing errors that are often experienced in traditional voice activity detection, particularly in low signal-to-noise-ratio (SNR) scenarios, in non-stationary noise and competing voices cases, and other cases where voice may be present.").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of Visser to compare the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminate the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value.  Doing so would allow for reducing speech detection errors in low signal-to-noise scenarios.
Regarding claim 14, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition apparatus as claimed in claim 13.
Rodriguez further teaches: wherein the calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension comprises:
summing, after the vector of each frame of the enrollment speech in the corresponding vector dimension is multiplied by a supervised labeling value of the corresponding frame, vector dimensions to obtain a total vector of the effective frame of the enrollment speech in the corresponding vector dimension; and separately dividing the total vector of the effective frame of the enrollment speech in each vector dimension by the sum of the supervised labeling values of the frames of the enrollment speech to obtain the average vector of the enrollment speech in each vector dimension (Paragraph 0094, lines 1-8, "If the parameters describing average feature vectors of audio data are calculated in a system according to the invention, D dimensional Mel Frequency Cepstral Coefficients MFCCs are one possible option to describe feature vectors of audio data. Thus, average feature vectors of audio data may e.g. be calculated by calculating the mean µmfcc,d of D dimensional MFCCs in one, two, three, or more or each dimension"; Paragraph 0098, lines 1-4, "Optionally, only those parts of the audio data file j comprising voice signals are taken into account for extracting the MFCCs, e.g. by using a Voice Activity Detector (VAD)."; Using the Voice Activity Detector to determine part of the audio data to be taken into account for extracting features for calculating the average feature vectors reads on multiplying the vectors by a supervised labeling value before summing the vectors when calculating the average vector.).
Rodriguez teaches averaging the feature vectors for frames containing valid speech in order to analyze the speech to determine if the speech matches the target speaker (Paragraph 0008, lines 2-5, "This voice utterance is analyzed using biometric voice data to verify that the speaker's voice corresponds to the identity of the speaker that is to be verified.").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to further incorporate the teachings of Rodriguez to average the feature vectors for frames containing valid speech.  Doing so would allow for analyzing the speech to determine if the speech matches the target speaker.
Visser further teaches:
the supervised labeling value of a frame in the enrollment speech being 1 when a spectrum amplitude of the frame is greater than the spectrum amplitude comparison value; and being 0 when the spectrum amplitude of the frame is not greater than the spectrum amplitude comparison value (Paragraph 0207, lines 1-12, "The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated."; The bin-wise Voice Activity Detector (VAD) that eliminates peaks significantly lower than a maximum peak reads on the supervised labeling value, eliminating a peak reads on the supervised labeling value being 0, not eliminating a peak reads on the supervised labeling value being 1, and the threshold amount below the maximum peak reads on the spectrum amplitude comparison value.).
Visser teaches comparing the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminating the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value in order to reduce speech detection errors in low signal-to-noise scenarios (Paragraph 0062, lines 1-6, "Use of a voice activity detector as described herein may be expected to reduce speech processing errors that are often experienced in traditional voice activity detection, particularly in low signal-to-noise-ratio (SNR) scenarios, in non-stationary noise and competing voices cases, and other cases where voice may be present.").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to further incorporate the teachings of Visser to compare the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminate the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value.  Doing so would allow for reducing speech detection errors in low signal-to-noise scenarios.
Regarding claim 16, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition apparatus as claimed in claim 11.
Parthasarathi further discloses:
 wherein the average vector of the enrollment speech in each vector dimension is used as the speech extractor of the target speaker in each vector dimension (Parthasarathi, Column 22, lines 43- 54, "The LAMS method may allow the system to keep the features in the desired range and for better distinguishing features between the desired and interfering speech. For the mean estimator, the system may compute the average feature values over the reference audio data. For the task of recognizing speech from the desired talker, this constraint is advantageous. The reference audio data may be used as an example of the desired talker's speech, and then by subtracting the LAMS, the system may shift the features corresponding to the desired speaker closer to being zero-mean. This allows the system to train a classifier, e.g., a DNN, to better classify a desired talker's speech.").
Regarding claim 20, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi teaches a non-transitory computer-readable storage medium, storing a computer program, the computer program, when executed by a processor (Column 9, lines 32-34, “The device performing NLU processing 260 (e.g., server 120) may include various components, including potentially dedicated processor(s), memory, storage, etc.”), implementing:
monitoring speech input and detecting an enrollment speech and a mixed speech from the speech input, the enrollment speech comprising preset speech information, and the mixed speech being non-enrollment speech inputted after the enrollment speech (Abstract, lines 1-6, "A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword.");
separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space by using a deep neural network to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension, K being not less than 1 (Column 8, lines 22-28, "The AFE may reduce noise in the audio data and divide the digitized audio data into frames representing a time intervals for which the AFE determines a number of values, called features, representing the qualities of the audio data, along with a set of those values, called an audio feature vector, representing the features/qualities of the audio data within the frame."; Column 16, lines 21-29, "For ASR processing the base input is typically audio data in the form of audio feature vectors corresponding to audio frames. As noted above, typically acoustic features (such as log-filter bank energies (LFBE) features, MFCC features, or other features) are determined and used to create audio feature vectors for each audio frame. It is possible to feed audio data into an RNN, using the amplitude and (phase) spectrum of a fast-Fourier transform (FFT), or other technique that projects an audio signal into a sequence of data.");
calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension (Column 22, lines 45-47, "For the mean estimator, the system may compute the average feature values over the reference audio data.").
Parthasarathi does not specifically disclose: the deep neural network being pre-trained by minimizing an objective function that describes a spectral error between a speech of the target speaker recovered by an estimated mask and a reference speech of the target speaker, wherein the estimated mask of the target speaker is obtained by measuring a distance between a vector of each frame of a mixed speech training sample in each vector dimension and an estimated speech extractor in each vector dimension of the K-dimensional vector space.
Le Roux teaches:
the deep neural network being pre-trained by minimizing an objective function that describes a spectral error between a speech of the target speaker recovered by an estimated mask and a reference speech of the target speaker (Column 3, lines 15-17, "Some embodiments of the present disclosure include training a deep neural network (DNN)-based enhancement system through a phase reconstruction stage."; Column 3, lines 44-48, "Accordingly, embodiments of the present disclosure train the network or DNN-based enhancement system to minimize an objective function including losses defined on the outcome of one or multiple steps of such iterative procedures."; Column 4, lines 23-26, "According to an embodiment of the present disclosure, an audio signal processing system for transforming an input audio signal, wherein the input audio signal includes a mixture of one or more target audio signals."; Column 16, line 66 - Column 17, line 2, "The Error Computation module 1030 can use the outputs of the spectrogram estimation from mask module 1013 and the reference source signals 1034 to compute a spectrogram estimation loss"; The spectrogram estimation loss reads on the spectral error, the target audio signals read on the speech of the target speaker, and the reference source signal reads on the reference speech.),
wherein the estimated mask of the target speaker is obtained by measuring a distance between a vector of each frame of a mixed speech training sample in each vector dimension and an estimated speech extractor in each vector dimension of the K-dimensional vector space (Column 13, lines 21-31, "FIG. 9A is a block diagram illustrating a single-channel mask inference network architecture 900A, according to embodiments of the present disclosure. A sequence of feature vectors obtained from the input mixture, for example the log magnitude of the short-time Fourier transform of the input mixture, is used as input to a mixture encoder 910. For example, the dimension of the input vector in the sequence can be F. The mixture encoder 910 is composed of multiple bidirectional long short-term memory (BLSTM) neural network layers, from the first BLSTM layer 930 to the last BLSTM layer 935."; Column 13, lines 40-50, "For each time frame and each frequency in a time-frequency domain, for example the short-time Fourier transform domain, the linear layer 940 uses output of the last BLSTM layer 935 to output C numbers, where C is the number of target speakers. The non-linearity 945 is applied to this set of C numbers for each time frame and each frequency, leading to mask values which indicate, for each time frame, each frequency, and each target speaker, the dominance of that target speaker in the input mixture at that time frame and that frequency."; Column 27, lines 1-7, "An aspect can include the error on the target audio signal estimates includes a distance between the target audio signal estimates and the reference target audio signals. Further, an aspect can include the error on the target audio signal estimates includes a distance between the spectrograms of target audio signal estimates and the spectrograms of the reference target audio signals."; The mask values which indicate the dominance of a target speaker read on the estimated mask of the target speaker, the input mixture reads on the mixed speech training sample, and the reference target audio signals reads on the estimated speech extractor.).
Le Roux teaches training a deep neural network by minimizing an objective function including spectral loss between target audio and a reference signal and generating a mask for a target speaker by comparing the feature vectors of an audio input containing multiple speakers and a reference target audio in order to separate the speech of multiple speakers in an audio signal (Column 7, lines 59-62, "The present disclosure relates to audio signals, and more particularly to using an end-to-end approach for single-channel speaker-independent multi-speaker speech separation.").
Parthasarathi and Le Roux are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi to incorporate the teachings of Le Roux to train a deep neural network by minimizing an objective function including spectral loss between target audio and a reference signal and generate a mask for a target speaker by comparing the feature vectors of an audio input containing multiple speakers and a reference target audio.  Doing so would allow for separating the speech of multiple speakers in an audio signal.
Parthasarathi in view of Le Roux does not specifically disclose: the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample in each vector dimension of the K-dimensional vector space.
Rodriguez teaches:
the estimated speech extractor is obtained according to a vector of each frame of an enrollment speech training sample in each vector dimension of the K-dimensional vector space (Paragraph 0071, lines 4-7, "In speaker verification, two voice prints are compared, one of the speaker known to the system in advance (e.g. from previous enrollment) and another extracted from the received audio data."; Paragraph 0094, lines 1-8, "If the parameters describing average feature vectors of audio data are calculated in a system according to the invention, D dimensional Mel Frequency Cepstral Coefficients MFCCs are one possible option to describe feature vectors of audio data. Thus, average feature vectors of audio data may e.g. be calculated by calculating the mean µmfcc,d of D dimensional MFCCs in one, two, three, or more or each dimension"; Paragraph 0097 line 1 - Paragraph 0098 line 6, "MFCCs: mfcc j t d, may be extracted. Herein J is the number of the considered audio data files, t is the frame index with a value between 1 and Tj (tϵ[1;Tj]), wherein Tj is the total number of (speech) frames for audio data file j. Optionally, only those parts of the audio data file j comprising voice signals are taken into account for extracting the MFCCs, e.g. by using a Voice Activity Detector (VAD). d is a value between 1 and D (dϵ[1;D]) representing the considered dimension."; The feature vectors read on the enrollment speech vectors.).
Rodriguez teaches extracting speech from audio data by comparing the audio data to enrollment speech based on feature vectors of the speech in order to analyze the speech to determine if the speech matches the target speaker (Paragraph 8, lines 2-5, "This voice utterance is analyzed using biometric voice data to verify that the speaker's voice corresponds to the identity of the speaker that is to be verified.").
Parthasarathi, Le Roux, and Rodriguez are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux to incorporate the teachings of Rodriguez to extract speech from audio data by comparing the audio data to enrollment speech based on feature vectors of the speech.  Doing so would allow for analyzing the speech to determine if the speech matches the target speaker.
Parthasarathi in view of Le Roux and Rodriguez does not specifically disclose: the estimated speech extractor is obtained according to a vector of supervised labeling value of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold.
Visser teaches:
the estimated speech extractor is obtained according to a vector of supervised labeling value of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold (Paragraph 0207, lines 1-12, "The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated."; The bin-wise Voice Activity Detector (VAD) that eliminates peaks significantly lower than a maximum peak reads on the supervised labeling value, and the threshold amount reads on the spectrum threshold.).
Visser teaches comparing the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminating the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value in order to reduce speech detection errors in low signal-to-noise scenarios (Paragraph 0062, lines 1-6, "Use of a voice activity detector as described herein may be expected to reduce speech processing errors that are often experienced in traditional voice activity detection, particularly in low signal-to-noise-ratio (SNR) scenarios, in non-stationary noise and competing voices cases, and other cases where voice may be present.").
Parthasarathi, Le Roux, Rodriguez, and Visser are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux and Rodriguez to incorporate the teachings of Visser to compare the amplitude of a frequency bin of a spectrum to the peak frequency bin for that spectrum and eliminate the amplitude value for the frequency bin if it is more than a threshold amount below the peak frequency bin value.  Doing so would allow for reducing speech detection errors in low signal-to-noise scenarios.
Parthasarathi in view of Le Roux, Rodriguez, and Visser does not specifically disclose: determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech; and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech.
Nakadai teaches:
determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech (Column 2, lines 39-42, "In the method for generating a soft mask for a speech recognition system according to the invention, the soft mask is determined using a function of reliability of separation, which has at least one parameter."; Column 5, lines 11-12, "An acoustic feature set and a mask are calculated for each time frame."; Column 7, lines 20-27, "Feature vector of 48 spectral-related features are used. The MFM is a vector corresponding to 24 static spectral features and 24 dynamic spectral features. Each element of a vector represents the reliability of each feature. In conventional MFM generation, a binary MFM (i.e., 1 for reliable and 0 for unreliable) was used. The mask generating section 103 generates a soft MFM whose element of vector ranges from 0.0 to 1.0.");
and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech (Column 1, lines 53-62, "A speech recognition system according to the invention includes a sound source separating section which separates mixed speeches from multiple sound sources; a mask generating section which generates a soft mask which can take continuous values between 0 and 1 for each separated speech according to reliability of separation in separating operation of the sound source separating section; and a speech recognizing section which recognizes speeches separated by the sound source separating section using soft masks generated by the mask generating section.").
Nakadai teaches comparing the features of the mixed speech with the features of the reference speech from the target speaker to generate a mask for each frame of the mixed speech that corresponds to speech from the target speaker in order to improve word recognition rate when recognizing speech from multiple sources (Column 11, lines 34-37, "use of appropriately designed and adjusted soft masks has improved word recognition rate of the speech recognition system for simultaneous recognition of multiple sources").
Parthasarathi, Le Roux, Rodriguez, Visser, and Nakadai are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, and Visser to incorporate the teachings of Nakadai to compare the features of the mixed speech with the features of the reference speech from the target speaker to generate a mask for each frame of the mixed speech that corresponds to speech from the target speaker.  Doing so would allow for improving word recognition rate when recognizing speech from multiple sources.
Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King et al. (“Robust Speech Recognition Via Anchor Word Representations”), hereinafter King, Yu (US Patent No. 9,818,431), and Variani et al. (“Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification”), hereinafter Variani.
Regarding claim 5, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition method as claimed in claim 1, but does not specifically disclose: after calculating the average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, inputting the average vector of the enrollment speech in each vector dimension and the vector of each frame of the mixed speech in each vector dimension to a pre-trained feedforward neural network to obtain a normalized vector of each frame in each vector dimension.
King teaches:
after calculating the average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, inputting the average vector of the enrollment speech in each vector dimension and the vector of each frame of the mixed speech in each vector dimension to a pre-trained feedforward neural network to obtain a normalized vector of each frame in each vector dimension (Section I, lines 66-68, "Our first method, termed anchored mean subtraction (AMS), extracts the anchor word mean to normalize the speech features in the utterance."; Section 2, lines 1-2, "We briefly review two methods of embedding anchor word information into a DNN-based AM"; Section 2.2, lines 14-17, "The encoder network in Figure 3, which is a single-layer LSTM model consuming a variable length sequence of features from the anchor word segment, generates an embedding of the desired speech."; The anchor word reads on the enrollment speech, the utterance reads on the mixed speech, the Long Short Term Memory (LSTM) model reads on the feedforward neural network, and normalizing the speech features reads on obtaining a normalized vector.).
King teaches the use of a feedforward neural network and normalizing the speech features to reduce word error rate when background speech is present (Abstract, lines 6-10, "We expand on our previous work on device-directed speech detection in the far-field speech setting and introduce two approaches for robust acoustic modeling. Both methods are based on the idea of using an anchor word taken from the device directed speech."; Abstract, lines 16-19, "Results on an in-house dataset reveal that, in the presence of background speech, the proposed approaches can achieve up to 35% relative word error rate reduction.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, and King are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of King to use a feedforward neural network and normalize the speech features.  Doing so would allow for the reduction of the word error rate when background speech is present.
Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King does not disclose: wherein the step of separately measuring the distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension is replaced by separately measuring a distance between the normalized vector of each frame in each vector dimension and a preset speech extractor to obtain the mask of each frame of the mixed speech; wherein the speech belonging to the target speaker in the mixed speech is determined based on the mask of each frame of the mixed speech.
Yu teaches:
wherein the step of separately measuring the distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension is replaced by separately measuring a distance between the normalized vector of each frame in each vector dimension and a preset speech extractor to obtain the mask of each frame of the mixed speech (Column 2, line 64 - column 3, line 7, "The technology described herein uses a multiple-output layer RNN to process an acoustic signal comprising speech from multiple speakers to trace an individual speaker's speech. The multiple-output layer RNN has multiple output layers, each of which is meant to trace one speaker (or noise) and represents the mask for that speaker (or noise). The output layer for each speaker (or noise) can have the same dimensions and can be normalized for each output unit across all output layers. The rest of the layers in the multiple-output layer RNN are shared across all the output layers."; The speech from multiple speakers reads on the mixed speech and an RNN output layer tracing one speaker and representing a mask reads on measuring a distance between a normalized vector of the mixed speech and the preset speech extractor to obtain a mask.);
wherein the speech belonging to the target speaker in the mixed speech is determined based on the mask of each frame of the mixed speech (Column 11, lines 54-59, "At step 460 an acoustic mask for a speaker-specific signal within the acoustic information is generated at one of the at least two output layers. In one aspect, an acoustic mask is generated at each output layer. The acoustic mask can be used to isolate a signal associated with a single speaker for further processing.").
Yu teaches generating a mask to isolate the speech of a single speaker in speech with multiple speakers in order to determine the content of a question spoken by a person when other people are also speaking (Column 4, lines 46-51, "The ASR model using a multiple-output layer RNN model described herein can process the inputted data to determine computer-usable information. For example, a query spoken by a user into a far-field microphone while multiple people in the room are talking may be processed to determine the content of the query (i.e., what the user is asking for).").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, King, and Yu are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King to incorporate the teachings of Yu to generate a mask to isolate the speech of a single speaker in speech with multiple speakers.  Doing so would allow for determining the content of a question spoken by a person when other people are also speaking.
Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King and Yu does not disclose: the preset speech extractor being a centroid of speech extractors of the target speaker obtained during a training process of a recognition network, the recognition network including the feed forward neural network and the deep neural network.
Variani teaches:
the preset speech extractor being a centroid of speech extractors of the target speaker obtained during a training process of a recognition network, the recognition network including the feed forward neural network and the deep neural network (Abstract, lines 1-9, "In this paper we investigate the use of deep neural networks (DNNs) for a small footprint text-dependent speaker verification task. At development stage, a DNN is trained to classify speakers at the frame level. During speaker enrollment, the trained DNN is used to extract speaker specific features from the last hidden layer. The average of these speaker features, or d-vector, is taken as the speaker model. At evaluation stage, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision."; Section 3.1, lines 14-20, "Once the DNN has been trained successfully, we use the accumulated output activations of the last hidden layer as a new speaker representation. That is, for every frame of a given utterance belonging to a new speaker, we compute the output activations of the last hidden layer using standard feedforward propagation in the trained DNN, and then accumulate those activations to form a new compact representation of that speaker, the d-vector.”; The speaker model reads on the preset speech extractor, and the average of the speaker specific features reads on the centroid of speech extractors.).
Variani teaches using an average of speaker specific features, obtained through training a neural network, as a speaker model to determine if speech from a specific speaker occurs in an utterance in order to improve speech verification performance (Abstract, lines 9-16, "Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. Finally the combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, King, Yu, and Variani are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King and Yu to incorporate the teachings of Variani to use an average of speaker specific features, obtained through training a neural network, as a speaker model to determine if speech from a specific speaker occurs in an utterance.  Doing so would allow for improving speech verification performance.
Regarding claim 15, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition apparatus as claimed in claim 11, but does not specifically disclose: after calculating the average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, inputting the average vector of the enrollment speech in each vector dimension and the vector of each frame of the mixed speech in each vector dimension to a pre-trained feedforward neural network to obtain a normalized vector of each frame in each vector dimension.
King teaches:
after calculating the average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, inputting the average vector of the enrollment speech in each vector dimension and the vector of each frame of the mixed speech in each vector dimension to a pre-trained feedforward neural network to obtain a normalized vector of each frame in each vector dimension (Section I, lines 66-68, "Our first method, termed anchored mean subtraction (AMS), extracts the anchor word mean to normalize the speech features in the utterance."; Section 2, lines 1-2, "We briefly review two methods of embedding anchor word information into a DNN-based AM"; Section 2.2, lines 14-17, "The encoder network in Figure 3, which is a single-layer LSTM model consuming a variable length sequence of features from the anchor word segment, generates an embedding of the desired speech."; The anchor word reads on the enrollment speech, the utterance reads on the mixed speech, the Long Short Term Memory (LSTM) model reads on the feedforward neural network, and normalizing the speech features reads on obtaining a normalized vector.).
King teaches the use of a feedforward neural network and normalizing the speech features to reduce word error rate when background speech is present (Abstract, lines 6-10, "We expand on our previous work on device-directed speech detection in the far-field speech setting and introduce two approaches for robust acoustic modeling. Both methods are based on the idea of using an anchor word taken from the device directed speech."; Abstract, lines 16-19, "Results on an in-house dataset reveal that, in the presence of background speech, the proposed approaches can achieve up to 35% relative word error rate reduction.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, and King are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of King to use a feedforward neural network and normalize the speech features.  Doing so would allow for the reduction of the word error rate when background speech is present.
Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King does not disclose: wherein the step of separately measuring the distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension is replaced by separately measuring a distance between the normalized vector of each frame in each vector dimension and a preset speech extractor to obtain the mask of each frame of the mixed speech; wherein the speech belonging to the target speaker in the mixed speech is determined based on the mask of each frame of the mixed speech.
Yu teaches:
wherein the step of separately measuring the distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension is replaced by separately measuring a distance between the normalized vector of each frame in each vector dimension and a preset speech extractor to obtain the mask of each frame of the mixed speech (Column 2, line 64 - column 3, line 7, "The technology described herein uses a multiple-output layer RNN to process an acoustic signal comprising speech from multiple speakers to trace an individual speaker's speech. The multiple-output layer RNN has multiple output layers, each of which is meant to trace one speaker (or noise) and represents the mask for that speaker (or noise). The output layer for each speaker (or noise) can have the same dimensions and can be normalized for each output unit across all output layers. The rest of the layers in the multiple-output layer RNN are shared across all the output layers."; The speech from multiple speakers reads on the mixed speech and an RNN output layer tracing one speaker and representing a mask reads on measuring a distance between a normalized vector of the mixed speech and the preset speech extractor to obtain a mask.);
wherein the speech belonging to the target speaker in the mixed speech is determined based on the mask of each frame of the mixed speech (Column 11, lines 54-59, "At step 460 an acoustic mask for a speaker-specific signal within the acoustic information is generated at one of the at least two output layers. In one aspect, an acoustic mask is generated at each output layer. The acoustic mask can be used to isolate a signal associated with a single speaker for further processing.").
Yu teaches generating a mask to isolate the speech of a single speaker in speech with multiple speakers in order to determine the content of a question spoken by a person when other people are also speaking (Column 4, lines 46-51, "The ASR model using a multiple-output layer RNN model described herein can process the inputted data to determine computer-usable information. For example, a query spoken by a user into a far-field microphone while multiple people in the room are talking may be processed to determine the content of the query (i.e., what the user is asking for).").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, King, and Yu are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King to incorporate the teachings of Yu to generate a mask to isolate the speech of a single speaker in speech with multiple speakers.  Doing so would allow for determining the content of a question spoken by a person when other people are also speaking (Yu, Column 4, lines 46-51).
Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King and Yu does not disclose: the preset speech extractor being a centroid of speech extractors of the target speaker obtained during a training process of a recognition network, the recognition network including the feed forward neural network and the deep neural network.
Variani teaches:
the preset speech extractor being a centroid of speech extractors of the target speaker obtained during a training process of a recognition network, the recognition network including the feed forward neural network and the deep neural network (Abstract, lines 1-9, "In this paper we investigate the use of deep neural networks (DNNs) for a small footprint text-dependent speaker verification task. At development stage, a DNN is trained to classify speakers at the frame level. During speaker enrollment, the trained DNN is used to extract speaker specific features from the last hidden layer. The average of these speaker features, or d-vector, is taken as the speaker model. At evaluation stage, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision."; Section 3.1, lines 14-20, "Once the DNN has been trained successfully, we use the accumulated output activations of the last hidden layer as a new speaker representation. That is, for every frame of a given utterance belonging to a new speaker, we compute the output activations of the last hidden layer using standard feedforward propagation in the trained DNN, and then accumulate those activations to form a new compact representation of that speaker, the d-vector.”; The speaker model reads on the preset speech extractor, and the average of the speaker specific features reads on the centroid of speech extractors.).
Variani teaches using an average of speaker specific features, obtained through training a neural network, as a speaker model to determine if speech from a specific speaker occurs in an utterance in order to improve speech verification performance (Abstract, lines 9-16, "Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. Finally the combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, King, Yu, and Variani are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of King and Yu to incorporate the teachings of Variani to use an average of speaker specific features, obtained through training a neural network, as a speaker model to determine if speech from a specific speaker occurs in an utterance.  Doing so would allow for improving speech verification performance.
Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai, and further in view of Hershey et al. ("Deep Clustering: Discriminative Embeddings for Segmentation and Separation"), hereinafter Hershey.
Regarding claim 7, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition method as claimed in claim 1, but does not specifically disclose: wherein the mixed speech includes speeches of multiple speakers, and after the separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space, the method further comprises: processing the vector of each frame of the mixed speech in each vector dimension based on a clustering algorithm to determine, for each of the multiple speakers in the mixed speech, a centroid vector corresponding to the speaker in each vector dimension; and using a target centroid vector of the mixed speech in each vector dimension as the speech extractor of the target speaker in the corresponding vector dimension, the target centroid vector being a centroid vector with the smallest distance from the average vector of the enrollment speech in the same vector dimension.
Hershey teaches:
wherein the mixed speech includes speeches of multiple speakers, and after the separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space, the method further comprises: processing the vector of each frame of the mixed speech in each vector dimension based on a clustering algorithm to determine, for each of the multiple speakers in the mixed speech, a centroid vector corresponding to the speaker in each vector dimension; and using a target centroid vector of the mixed speech in each vector dimension as the speech extractor of the target speaker in the corresponding vector dimension, the target centroid vector being a centroid vector with the smallest distance from the average vector of the enrollment speech in the same vector dimension (Section 1, lines 23-25, "In this work, we consider a more open and difficult task of speaker-independent separation of two or more speakers, with no special constraint on vocabulary and grammar."; Section 1, lines 80-92,"Learned feature transformations known as embeddings have recently been gaining significant interest in many fields. Unsupervised embeddings obtained by auto-associative deep networks, used with relatively simple clustering algorithms, have recently been shown to outperform spectral clustering methods in some cases.  In our framework a deep network assigns embedding vectors to each time-frequency region of the spectrogram, according to an objective function that minimizes the distances between embeddings of time-frequency bins dominated by the same source, while maximizing the distances between embeddings for those dominated by different sources. Thus the clusters in the embedding can represent the inferred spectral masking patterns of the sources, in a permutation free way."; The embedding vectors read on the vector dimensions, and using a function that minimizes the distances between embeddings of time-frequency bins dominated by the same source, while maximizing the distances between embeddings for those dominated by different sources reads on using a target centroid vector of the mixed speech in each vector dimension as the speech extractor.).
Hershey teaches using a clustering algorithm to separate speech from multiple speakers in order to improve the signal quality of the separated speech (Abstract, lines 10-20, "This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step “decodes” the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, and Hershey are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of Hershey to use a clustering algorithm to separate speech from multiple speakers.  Doing so would improve the signal quality of the separated speech.
Regarding claim 17, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition apparatus as claimed in claim 11, but does not specifically disclose: wherein the mixed speech includes speeches of multiple speakers, and after the separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space, the method further comprises: processing the vector of each frame of the mixed speech in each vector dimension based on a clustering algorithm to determine, for each of the multiple speakers in the mixed speech, a centroid vector corresponding to the speaker in each vector dimension; and using a target centroid vector of the mixed speech in each vector dimension as the speech extractor of the target speaker in the corresponding vector dimension, the target centroid vector being a centroid vector with the smallest distance from the average vector of the enrollment speech in the same vector dimension.
Hershey teaches:
wherein the mixed speech includes speeches of multiple speakers, and after the separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space, the method further comprises: processing the vector of each frame of the mixed speech in each vector dimension based on a clustering algorithm to determine, for each of the multiple speakers in the mixed speech, a centroid vector corresponding to the speaker in each vector dimension; and using a target centroid vector of the mixed speech in each vector dimension as the speech extractor of the target speaker in the corresponding vector dimension, the target centroid vector being a centroid vector with the smallest distance from the average vector of the enrollment speech in the same vector dimension (Section 1, lines 23-25, "In this work, we consider a more open and difficult task of speaker-independent separation of two or more speakers, with no special constraint on vocabulary and grammar."; Section 1, lines 80-92,"Learned feature transformations known as embeddings have recently been gaining significant interest in many fields. Unsupervised embeddings obtained by auto-associative deep networks, used with relatively simple clustering algorithms, have recently been shown to outperform spectral clustering methods in some cases.  In our framework a deep network assigns embedding vectors to each time-frequency region of the spectrogram, according to an objective function that minimizes the distances between embeddings of time-frequency bins dominated by the same source, while maximizing the distances between embeddings for those dominated by different sources. Thus the clusters in the embedding can represent the inferred spectral masking patterns of the sources, in a permutation free way."; The embedding vectors read on the vector dimensions, and using a function that minimizes the distances between embeddings of time-frequency bins dominated by the same source, while maximizing the distances between embeddings for those dominated by different sources reads on using a target centroid vector of the mixed speech in each vector dimension as the speech extractor.).
Hershey teaches using a clustering algorithm to separate speech from multiple speakers in order to improve the signal quality of the separated speech (Abstract, lines 10-20, "This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step “decodes” the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, and Hershey are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of Hershey to use a clustering algorithm to separate speech from multiple speakers.  Doing so would improve the signal quality of the separated speech.
Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai, and further in view of Chenier et al. (US Patent No. 10,192,553), hereinafter Chenier.
Regarding claim 8, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition method as claimed in claim 1, but does not specifically disclose: wherein after the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, the method further comprises: separately comparing a distance between M preset speech extractors and the average vector of the enrollment speech in each vector dimension, M being greater than 1; and using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension in the M preset speech extractors as the speech extractor of the target speaker in the corresponding vector dimension.
Chenier teaches:
wherein after the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, the method further comprises: separately comparing a distance between M preset speech extractors and the average vector of the enrollment speech in each vector dimension, M being greater than 1; and using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension in the M preset speech extractors as the speech extractor of the target speaker in the corresponding vector dimension (Column 29, lines 34-56, "Speaker identification system 226, in some embodiments, may correspond to any suitable device/system capable of identifying a particular person's voice from an audio signal. Speaker identification system 226 may determine whether a current voice being used to speak matches known voice biometric data associated with a particular individual's voice. In some embodiments, voice biometric data may be stored within user accounts module 268 for various individuals having a user profile stored thereby. For example, individual 2 may have a user account on computing system 200 (e.g., stored within user accounts module 268), which may be associated with electronic device 100a. Stored within the user account may be voice biometric data associated with a voice of individual 2. Therefore, when an utterance, such as utterance 4, is detected by electronic device 100a, and subsequently when audio data representing that utterance is received by computing system 200, speaker identification system 226 may determine whether the voice used to speak utterance 4 matches to at least a predefined confidence threshold the stored voice biometric information associated with individual 2 stored by their user account. If so, then this may indicate that individual 2 is the likely speaker of utterance 4."; Known voice biometric data stored for various individuals reads on M preset speech extractors, and matching an utterance to the stored voice biometric information within a predefined confidence threshold reads on using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension as the speech extractor.).
Chenier teaches using stored voice biometric data for multiple speakers to identify the speaker in a speech sample by matching the stored voice information to the voice information from the speech sample in order to separate user speech from speech from other sources (Column 6, lines 31-36, "In some embodiments, determining whether the sounds correct to speech or non-speech may include performing speaker identification techniques to the audio data to determine whether the sounds correspond to a known voice, such as a voice of an individual associated with the recipient device."; Column 6, lines 44-55, " However, non-speech may also correspond to speech that originates from a non-human source, such as a television, radio, speaker, or other audio output device. As an illustrative example, a radio may be currently playing in the vicinity of a recipient device when a communications session is established. In this particular scenario, the speech activity detection system may determine that audio data representing sounds received by the recipient device may correspond to speech, as it may “hear” the speech output by the radio. However, this speech is not associated with an individual interacting with the recipient device.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, and Chenier are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of Chenier to use stored voice biometric data for multiple speakers to identify the speaker in a speech sample by matching the stored voice information to the voice information from the speech sample.  Doing so would allow for the separation of user speech from speech from other sources.
Regarding claim 18, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition apparatus as claimed in claim 11, but does not specifically disclose: wherein after the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, the method further comprises: separately comparing a distance between M preset speech extractors and the average vector of the enrollment speech in each vector dimension, M being greater than 1; and using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension in the M preset speech extractors as the speech extractor of the target speaker in the corresponding vector dimension.
Chenier teaches:
wherein after the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, the method further comprises: separately comparing a distance between M preset speech extractors and the average vector of the enrollment speech in each vector dimension, M being greater than 1; and using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension in the M preset speech extractors as the speech extractor of the target speaker in the corresponding vector dimension (Column 29, lines 34-56, "Speaker identification system 226, in some embodiments, may correspond to any suitable device/system capable of identifying a particular person's voice from an audio signal. Speaker identification system 226 may determine whether a current voice being used to speak matches known voice biometric data associated with a particular individual's voice. In some embodiments, voice biometric data may be stored within user accounts module 268 for various individuals having a user profile stored thereby. For example, individual 2 may have a user account on computing system 200 (e.g., stored within user accounts module 268), which may be associated with electronic device 100a. Stored within the user account may be voice biometric data associated with a voice of individual 2. Therefore, when an utterance, such as utterance 4, is detected by electronic device 100a, and subsequently when audio data representing that utterance is received by computing system 200, speaker identification system 226 may determine whether the voice used to speak utterance 4 matches to at least a predefined confidence threshold the stored voice biometric information associated with individual 2 stored by their user account. If so, then this may indicate that individual 2 is the likely speaker of utterance 4."; Known voice biometric data stored for various individuals reads on M preset speech extractors, and matching an utterance to the stored voice biometric information within a predefined confidence threshold reads on using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension as the speech extractor.).
Chenier teaches using stored voice biometric data for multiple speakers to identify the speaker in a speech sample by matching the stored voice information to the voice information from the speech sample in order to separate user speech from speech from other sources (Column 6, lines 31-36, "In some embodiments, determining whether the sounds correct to speech or non-speech may include performing speaker identification techniques to the audio data to determine whether the sounds correspond to a known voice, such as a voice of an individual associated with the recipient device."; Column 6, lines 44-55, " However, non-speech may also correspond to speech that originates from a non-human source, such as a television, radio, speaker, or other audio output device. As an illustrative example, a radio may be currently playing in the vicinity of a recipient device when a communications session is established. In this particular scenario, the speech activity detection system may determine that audio data representing sounds received by the recipient device may correspond to speech, as it may “hear” the speech output by the radio. However, this speech is not associated with an individual interacting with the recipient device.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, and Chenier are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of Chenier to use stored voice biometric data for multiple speakers to identify the speaker in a speech sample by matching the stored voice information to the voice information from the speech sample.  Doing so would allow for the separation of user speech from speech from other sources.
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of Chen et al. (“Deep Attractor Network for Single-microphone Speaker Separation”), hereinafter Chen, and Watanabe et al. (US Patent Application Publication No. 2018/0261225), hereinafter Watanabe.
Regarding claim 10, as best understood based on the 35 U.S.C. 112(b) issues identified above, Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai discloses the mixed speech recognition method as claimed in claim 1, but does not specifically disclose: wherein the deep neural network is composed of four layers of bidirectional long short-term memory networks, each layer of the bidirectional long short-term memory network has 600 nodes.
Chen teaches:
wherein the deep neural network is composed of four layers of bidirectional long short-term memory networks, each layer of the bidirectional long short-term memory network has 600 nodes (Section 3.1, lines 18-19, "The network contained 4 Bi-directional LSTM layers with 600 hidden units in each layer.").
Chen teaches using a neural network with four bidirectional long short-term memory (LSTM) layers with 600 nodes in each layer in order to perform speech separation with improvement over previous methods (Abstract, lines 5-8, "We propose a novel deep learning framework for single channel speech separation by creating attractor points in high dimensional embedding space of the acoustic signals which pull together the time-frequency bins corresponding to each source."; Abstract, lines 18-20, "We evaluated our system on Wall Street Journal dataset and show 5.49% improvement over the previous state-of-the-art methods.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, and Chen are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai to incorporate the teachings of Chen to use a neural network with four bidirectional long short-term memory (LSTM) layers with 600 nodes in each layer.  Doing so would allow for performing speech separation with improvement over previous methods.
Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of Chen does not disclose: a value of K is 40.
Watanabe teaches:
a value of K is 40 (Paragraph 0078, lines 1-3, "Some embodiments use 40-dimensional log Mel filterbank coefficients as an input feature vector for both noisy and enhanced speech signals (DO=40).").
Watanabe teaches using a 40-dimension vector to represent speech features to allow for speech recognition in noisy environments (Paragraph 0007, lines 1-3, "It is another object of some embodiments to provide the speech recognition system suitable for recognizing speech in noisy environments.").
Parthasarathi, Le Roux, Rodriguez, Visser, Nakadai, Chen, and Watanabe are considered to be analogous to the claimed invention because they are in the same field of speech recognition systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Parthasarathi in view of Le Roux, Rodriguez, Visser, and Nakadai and further in view of Chen to incorporate the teachings of Watanabe to use a 40-dimension vector to represent speech features.  Doing so would allow for speech recognition in noisy environments.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JAMES BOGGS/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657