DETAILED ACTION

Introduction
1.         This office action is in response to Applicant’s submission filed on 05/10/2022.   Claims 1-14 are pending in the application. As such, Claims 1-14 have been examined.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
3.         The response filed on 05/10/2022 has been correspondingly accepted and considered in this Office Action.  Claims 1-14 have been examined. With respect to the claim interpretation under 35 U.S.C. 112(f) directed to claims 1-11, said claims are herein no longer interpreted under 35 U.S.C. 112(f) in view of the corresponding amendment done to said claims. 

Response to Arguments
4.	With respect to the rejections of Claims 1-9, and 12 rejected under 35 U.S.C. 103 as being unpatentable over (a)Spengler et al., (U.S. Patent Application Publication: 2007/0288242), in view of (b)Lovekin et al., (J. M. Lovekin, R. E. Yantorno, K. R. Krishnamachari, D. S. Benincasa and S. J. Wenndt, “Developing usable speech criteria for speaker identification technology,” 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001, pp. 421-424 vol.1), hereinafter referred to as SPENGLER and LOVEKIN, Applicant appears to present the following position on Remarks pp. 7-9 filed 05/10/2022: 
	With respect to Claims 1-3, 9, and 12 Applicant appears to argue as follows:
“…For example, the Office Action asse11s that Spengler describes subdividing a speech signal into a set of frames (Office Action at p. 6). However, Spengler only includes descriptions of "framed speech" and aligning speech/utterances with an observation frame or window (Spengler at ¶¶[0061]-[0063]). There is nothing in Spengler that describes whether the observation frames or windows overlap or not. Therefore, it is believed that Spengler does not disclose or suggest the newly amended features of Claim 1. Lovekin is also silent with respect to overlapping frames and does not cure the deficiencies in Spengler. Therefore, no combination of Spengler and Lovekin discloses or suggests every feature recited in amended Claim 1, and amended Claim 1 is believed to be in condition for allowance together with any claim depending therefrom. Amended Claim 12 is likewise believed to be in condition for allowance. Withdrawal of the rejection of Claims 1-3, 9, and 12 under 35 U.S.C. 103 is respectfully requested. As all other rejections of record rely upon Spengler for describing the above distinguished features, and the above-distinguished features are not disclosed or suggested by Spengler, alone or in combination with any other art of record, it is respectfully submitted that a prima facie case of obviousness cannot be maintained. Therefore, it is respectfully requested that the rejection of Claims 4-8 under 35 U.S.C. 103 be withdrawn.
It is believed that this response renders the objection to Claims 10-11 moot, and that these claims are in condition for allowance. It is also believed that new Claims 13-14 are in condition for allowance at least by virtue of their dependency from amended Claim 1…”

In response, Examiner respectfully notes that the combination of SPENGLER and LOVEKIN clearly and unambiguously discloses the argued limitation concerning “overlapping frames.” For example, the combination of SPENGLER and LOVEKIN discloses, see e.g., “…receiv[ing] a framed speech signal…” and how “…the actual speech/utterance can be aligned in an observation frame or window using, for example, a convolution-based algorithm to enhance analysis of the speech. To perform the alignment, the user-speech template can be divided into a plurality of time slices or vectors….” (See e.g., SPENGLER paras. 61-63, Fig. 5, 6, 8-12, 23). In this direction, and as can be evidenced in SPENGLER par. 59, the capabilities for “overlapping frames” can be observed in how by performing “…an initial integrity check, for example, can include performing a dynamic range utilization analysis on the sampled (speech) data to determine if the speech is below a preselected minimum threshold level indicating the dynamic range of speech was used effectively, i.e., the utterance was too quiet. Dynamic range utilization can be performed by first over-sampling and then down-sampling the data signal to increase dynamic range and decrease noise. For example, if a sample rate of 48000 Hz is supported by the selected audio hardware, the recording software/program product, e.g., audio handler 35 or speech recognizer 31, can sample at this rate, and add 6 adjacent samples together…” Examiner initially directs Applicant to SPENGLER’s support for performing “first over-sampling and then down-sampling the data signal…” In this respect, performing over-sampling and down-sampling to one of ordinary skilled in the art in Digital Signal Processing would permit to observe that for example, in a sequence of frames after performing down-sampling, the effect of the sampled data signal (e.g., “framed speech”) can undergo and as such inherently attain overlapping capabilities. Examiner respectfully disagrees and sustains that the combination of SPENGLER and LOVEKIN is not precluded from teaching the newly amended limitations comprising “overlapping” as broadly presented.
For at least the supra provided reasons, Applicant’s arguments are found not persuasive.
Examiner respectfully disagrees, and therefore, the rejections of Claims 1-3, 9, and 12 rejected under 35 U.S.C. 103 are sustained and further updated accordingly.
In response to the art rejection(s) of the remainder of dependent Claims 4, 5, 6, 7, 8 rejected under 35 U.S.C. 103 in case said claims are correspondingly discussed and/or argued for at least the same rationale presented in Remarks filed 05/10/2022, Examiner respectfully notes as follows. For completeness, should the mentioned claim(s) is(are) likewise traversed for similar reasons to independent claim 1 correspondingly, Examiner respectfully directs Applicant to the same previous supra reasons provided in the response directed towards claims 1-3, 9, and 12 correspondingly discussed above.  For at least the same supra provided reasons, Examiner likewise respectfully disagrees, and as such, Applicant’s arguments are also found not persuasive. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

5.	Claims 1-3, 9, 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over (a)Spengler et al., (U.S. Patent Application Publication: 2007/0288242), in view of (b)Lovekin et al., (J. M. Lovekin, R. E. Yantorno, K. R. Krishnamachari, D. S. Benincasa and S. J. Wenndt, “Developing usable speech criteria for speaker identification technology,” 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001, pp. 421-424 vol.1), hereinafter referred to as SPENGLER and LOVEKIN.
	With respect to Claim 1, SPENGLER discloses:
1. A speaker recognition system for assessing the identity of a speaker through a speech signal based on speech uttered by said speaker, the system comprising: 

    PNG
    media_image1.png
    352
    344
    media_image1.png
    Greyscale
 a processor configured to (See e.g., “… when implemented by … processor,…” SPENGLER paras. 53-55, 59-64, Figs. 5, 6, 8-12, 16, 23) overlapping frames (See e.g., “…receive a framed speech signal…” and “…the actual speech/utterance can be aligned in an observation frame or window using, for example, a convolution-based algorithm to enhance analysis of the speech. To perform the alignment, the user-speech template can be divided into a plurality of time slices or vectors….” and how “overlapping frames” can be observed in performing “…an initial integrity check, for example, can include performing a dynamic range utilization analysis on the sampled (speech) data to determine if the speech is below a preselected minimum threshold level indicating the dynamic range of speech was used effectively, i.e., the utterance was too quiet. Dynamic range utilization can be performed by first over-sampling and then down-sampling the data signal to increase dynamic range and decrease noise. For example, if a sample rate of 48000 Hz is supported by the selected audio hardware, the recording software/program product, e.g., audio handler 35 or speech recognizer 31, can sample at this rate, and add 6 adjacent samples together…” See e.g., SPENGLER paras. 59-63, Fig. 5, 6, 8-12, 23); 
perform spectral analysis of (See e.g., “…A Short Time Fourier transformation is then performed on each time slice to form Fourier transformed data defining a spectrograph…taking the log of the absolute value of the complex data. The converted amplitude values are then thresholded by a centering 
    PNG
    media_image2.png
    215
    615
    media_image2.png
    Greyscale
threshold to normalize the energy values within each time slice. The Sum of each time slice, equivalent to the geometric mean of the frequency bins for the respective time slice… Mean positions of peaks of the convolution are then determined to identify the center of the speech, and the user-speech template is cyclically shifted to center the speech in the observation frame or window…,” “…convert sampled data to frequency domain…perform speech alignment…determine noise contour…perform noise removal process…,” “…to perform the operations of determining a background noise contour for 
    PNG
    media_image3.png
    573
    653
    media_image3.png
    Greyscale
noise within the observation frame or window and removing the noise from within and around speech formants of the aligned user speech template using a nonlinear noise removal process such as, for example, by thresholding bins of equalized portions of the user-speech template…by first estimating noise power (see FIG. 8) in each bin for each of a plurality of time slices, e.g., twenty, on either side of the speech near and preferably outside the boundaries of the speech for each of the frequency ranges defining the bins, and equalizing the energy values of the each bin across each of the frequency ranges in response to the estimated noise power to thereby “flatten” the spectrum…” See e.g., SPENGLER paras. 59-63, Figs. 3-6, 8-12, 23); 

    PNG
    media_image4.png
    235
    689
    media_image4.png
    Greyscale

    PNG
    media_image5.png
    285
    682
    media_image5.png
    Greyscale
(See e.g., “…develop a set of feature vectors…” “…operation of developing a set of feature vectors representing energy of the frequency content of the user-speech template to determine a unique pattern…” See e.g., SPENGLER paras. 61-64, Figs. 3-5, 6, 8-15, 23); and

    PNG
    media_image6.png
    699
    680
    media_image6.png
    Greyscale
[assessing the identity of the speaker] (See e.g., “… when implemented by a 1.6 GHZ, Pentium IV processor, Hidden Markov Model training on an utterance encapsulated within a 1.5 second frame can be performed in less than approximately 400 milliseconds for each word/utterance and recognition of such word/utterance (command annunciation) using a Hidden Markov Model recognition engine/classifier can be performed in less than 250 milliseconds…,” “…the recognize mode can include noise removal, feature extraction, speech alignment, and speech recognition functions…,” “…the speech actuated command program product 51 also provide a core speech recognizer engine/classifier which can include both Hidden Markov and Neural Net modeling and models which can recognize sound patterns of the speech/utterances…,” “…associate an index and/or function or state to the speech model…” See e.g., SPENGLER paras. 53-55, 59-64, Figs. 5, 6, 8-12, 16, 23).
SPENGLER does not explicitly, but LOVEKIN discloses a [speaker recognition system] and [assessing the identity of the speaker] (“…speaker identification (SID)…criteria for usable speech frames for SID. Voiced speech, of which usable speech is entirely comprised, is shown to be information rich for SID…performing a frame based (Target to Interferer Ratio) TIR as opposed to an overall TIR. Usable frames of speech are separated and collected into a file for each speaker by calculating the TIR for each frame individually to determine if it exceeds 
    PNG
    media_image7.png
    324
    369
    media_image7.png
    Greyscale
a predetermined threshold…,” and how “…it is meaningful to extract only voiced frames  from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…”  See e.g., LOVEKIN, Abstract, §§ 2, 3).
SPENGLER and LOVEKIN can be considered analogous art because they are from a similar field of endeavor in speech processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of SPENGLER in view of LOVEKIN’s techniques comprising, see e.g., speaker identification architectures comprising “…speaker identification (SID)…criteria for usable speech frames for SID…” such that “…voiced speech, of which usable speech is entirely comprised, is shown to be information rich for SID…performing a frame based (Target to Interferer Ratio) TIR as opposed to an overall TIR.…” in order to advantageously enhance speaker identification since, see e.g., “…it is meaningful to extract only voiced frames from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…,” (See e.g., LOVEKIN, Abstract, §§ 2, 3).

With respect to Claim 2, SPENGLER in view of LOVEKIN discloses:
2. The system of claim 1, wherein the processor is further configured to (See e.g., “…to perform the operations of determining a background noise contour for noise within the observation frame or window and removing the noise from within and around speech formants of the aligned user speech template using a nonlinear noise removal process such as, for example, by thresholding bins of equalized portions of the user-speech template…by first estimating noise power (see FIG. 8) in each bin for each of a plurality of time slices, e.g., twenty, on either side of the speech near and preferably outside the boundaries of the speech for each of the frequency ranges defining the bins, and equalizing the energy values of the each bin across each of the frequency ranges in response to the estimated noise power to thereby “flatten” the spectrum…” See e.g., SPENGLER paras. 59-63, Figs. 3-6, 8-12, 23) a flatness of [[the]] a frequency spectrum of such frame (See e.g., “…“…it is meaningful to extract only voiced frames  from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…”  See e.g., LOVEKIN, Abstract, §§ 2, 3). 

With respect to Claim 3, SPENGLER in view of LOVEKIN discloses:
3. The system of claim 2, wherein said processor is further (See e.g., “…to perform the operations of determining a background noise contour for noise within the observation frame or window and removing the noise from within and around speech formants of the aligned user speech template using a nonlinear noise removal process such as, for example, by thresholding bins of equalized portions of the user-speech template…by first estimating noise power (see FIG. 8) in each bin for each of a plurality of time slices, e.g., twenty, on either side of the speech near and preferably outside the boundaries of the speech for each of the frequency ranges defining the bins, and equalizing the energy values of the each bin across each of the frequency ranges in response to the estimated noise power to thereby “flatten” the spectrum…” See e.g., SPENGLER paras. 61-63, Figs. 3-6, 8-12, 23) (See e.g., “…“…it is meaningful to extract only voiced frames  from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…”  See e.g., LOVEKIN, Abstract, §§ 2, 3). 

With respect to Claim 9, SPENGLER in view of LOVEKIN discloses:
9. The system of claim 1, wherein the processor is further configured to (See e.g., “…using a Hidden Markov Model recognition engine/classifier…,” “…the recognize mode can include noise removal, feature extraction, speech alignment, and speech recognition functions…,” “…the speech 
    PNG
    media_image6.png
    699
    680
    media_image6.png
    Greyscale
actuated command program product 51 also provide a core speech recognizer engine/classifier which can include both Hidden Markov and Neural Net modeling and models which can recognize sound patterns of the speech/utterances…,” “…associate an index and/or function or state to the speech model…” See e.g., SPENGLER paras. 53-55, 59-64, Figs. 5, 6, 8-12, 16, 23) [a likelihood that [[the]] a speaker having uttered said speech is said known speaker, said generating the score being based on said audio features extracted from the frames which have not been discarded]. 
SPENGLER does not explicitly, but LOVEKIN discloses capabilities for the classification module of SPENGLER to be configured with speaker identification functionalities by using speaker’s models for testing and training in order to be [a likelihood that [[the]] a speaker having uttered said speech is said known speaker (See e.g., how “…speech from any of …previously trained speakers using different speech samples, which will then 
    PNG
    media_image8.png
    181
    342
    media_image8.png
    Greyscale
compare the given speech to the speaker’s models in an attempt to find a match… Voiced-only segments extracted at 37 Spectral Flatness Method (SFM) were used for this purpose in place of the actual usable segments. Table 1 shows the different training and testing situations, with accompanying speaker identification results… SID accuracy for normal when voiced only segments were used for training and testing, approximately 80% speaker ID accuracy was achieved. It was realized that less information was available when removing the unvoiced portions of the speech. Correct identification of 75.8%...,” “…38 speakers are separated into two groups of 19 speakers each. Group A contains 14 female speakers and 5 male speakers. Group B contains 19 male speakers… Group A(i) + Group A(i+l)… Group B(i) + Group B(i+l)… Group A(i) + Group B(i)…” See e.g., LOVEKIN, Abstract, §§ 2, 3, Fig. 1, Table 1), said generating the score being based on said audio features extracted from the frames which have not been discarded] (“…speaker identification (SID)…criteria for usable speech frames for SID. Voiced speech, 
    PNG
    media_image7.png
    324
    369
    media_image7.png
    Greyscale
of which usable speech is entirely comprised, is shown to be information rich for SID…performing a frame based (Target to Interferer Ratio) TIR as opposed to an overall TIR. Usable frames of speech are separated and collected into a file for each speaker by calculating the TIR for each frame individually to determine if it exceeds a predetermined threshold…,” and how “…it is meaningful to extract only voiced frames  from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…”  See e.g., LOVEKIN, Abstract, §§ 2, 3, Fig. 1, Table 1).
SPENGLER and LOVEKIN can be considered analogous art because they are from a similar field of endeavor in speech processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of SPENGLER’s core speech recognizer engine/classifier with LOVEKIN’s techniques comprising, see e.g., a speaker identification architecture with speaker’s models in an attempt to find a match comprising “…speaker identification (SID)…criteria for usable speech frames for SID…” such that “…voiced speech, of which usable speech is entirely comprised, is shown to be information rich for SID…performing a frame based (Target to Interferer Ratio) TIR as opposed to an overall TIR.…” in order to advantageously attain speaker identification and classification capabilities since, see e.g., “…it is meaningful to extract only voiced frames from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…,” (See e.g., LOVEKIN, Abstract, §§ 2, 3).
With respect to Claim 12, SPENGLER discloses:
12. A method for assessing the identity of a speaker through a speech signal based on speech uttered by said speaker, the method comprising: 

    PNG
    media_image1.png
    352
    344
    media_image1.png
    Greyscale
subdividing, with a processor, (See e.g., “… when implemented by … processor,…” SPENGLER paras. 53-55, 59-64, Figs. 5, 6, 8-12, 16, 23) said speech signal over time into a set of overlapping frames (See e.g., “…receive a framed speech signal…” and “…the actual speech/utterance can be aligned in an observation frame or window using, for example, a convolution-based algorithm to enhance analysis of the speech. To perform the alignment, the user-speech template can be divided into a plurality of time slices or vectors….” and how “overlapping frames” can be observed in performing “…an initial integrity check, for example, can include performing a dynamic range utilization analysis on the sampled (speech) data to determine if the speech is below a preselected minimum threshold level indicating the dynamic range of speech was used effectively, i.e., the utterance was too quiet. Dynamic range utilization can be performed by first over-sampling and then down-sampling the data signal to increase dynamic range and decrease noise. For example, if a sample rate of 48000 Hz is supported by the selected audio hardware, the recording software/program product, e.g., audio handler 35 or speech recognizer 31, can sample at this rate, and add 6 adjacent samples together…” See e.g., SPENGLER paras. 59-63, Fig. 5, 6, 8-12, 23)); 
spectrally analyzing, with the processor, the frames of the set and discarding frames affected by noise and frames which do not comprise a speech based on such spectral analysis of the frames (See e.g., “…A Short Time Fourier transformation is then performed on each time slice to form Fourier transformed data defining a spectrograph…taking the log of the absolute value of the complex data. The converted amplitude values are then thresholded by a centering 
    PNG
    media_image2.png
    215
    615
    media_image2.png
    Greyscale
threshold to normalize the energy values within each time slice. The Sum of each time slice, equivalent to the geometric mean of the frequency bins for the respective time slice… Mean positions of peaks of the convolution are then determined to identify the center of the speech, and the user-speech template is cyclically shifted to center 
    PNG
    media_image3.png
    573
    653
    media_image3.png
    Greyscale
the speech in the observation frame or window…,” “…convert sampled data to frequency domain…perform speech alignment…determine noise contour…perform noise removal process…,” “…to perform the operations of determining a background noise contour for noise within the observation frame or window and removing the noise from within and around speech formants of the aligned user speech template using a nonlinear noise removal process such as, for example, by thresholding bins of equalized portions of the user-speech template…by first estimating noise power (see FIG. 8) in each bin for each of a plurality of time slices, e.g., twenty, on either side of the speech near and preferably outside the boundaries of the speech for each of the frequency ranges defining the bins, and equalizing the energy values of the each bin across each of the frequency ranges in response to the estimated noise power to thereby “flatten” the spectrum…” See e.g., SPENGLER paras. 59-63, Figs. 3-6, 8-12, 23); 
extracting, with the processor, audio features from frames which have not been discarded (See e.g., “…develop a set of feature vectors…” “…operation of developing a set of feature vectors representing energy of the frequency content of the user-speech template to determine a unique pattern…” See 
    PNG
    media_image4.png
    235
    689
    media_image4.png
    Greyscale
e.g., SPENGLER paras. 61-64, Figs. 3-5, 6, 8-15, 23); and

    PNG
    media_image5.png
    285
    682
    media_image5.png
    Greyscale


    PNG
    media_image6.png
    699
    680
    media_image6.png
    Greyscale
 processing, with the processor, the audio features extracted from the frames which have not been discarded for [assessing the identity of the speaker] (See e.g., “… when implemented by a 1.6 GHZ, Pentium IV processor, Hidden Markov Model training on an utterance encapsulated within a 1.5 second frame can be performed in less than approximately 400 milliseconds for each word/utterance and recognition of such word/utterance (command annunciation) using a Hidden Markov Model recognition engine/classifier can be performed in less than 250 milliseconds…,” “…the recognize mode can include noise removal, feature extraction, speech alignment, and speech recognition functions…,” “…the speech actuated command program product 51 also provide a core speech recognizer engine/classifier which can include both Hidden Markov and Neural Net modeling and models which can recognize sound patterns of the speech/utterances…,” “…associate an index and/or function or state to the speech model…” See e.g., SPENGLER paras. 53-55, 59-64, Figs. 5, 6, 8-12, 16, 23).

    PNG
    media_image7.png
    324
    369
    media_image7.png
    Greyscale
SPENGLER does not explicitly, but LOVEKIN discloses a [speaker recognition system and method] and [assessing the identity of the speaker] (“…speaker identification (SID)…criteria for usable speech frames for SID. Voiced speech, of which usable speech is entirely comprised, is shown to be information rich for SID…performing a frame based (Target to Interferer Ratio) TIR as opposed to an overall TIR. Usable frames of speech are separated and collected into a file for each speaker by calculating the TIR for each frame individually to determine if it exceeds a predetermined threshold…,” and how “…it is meaningful to extract only voiced frames from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…”  See e.g., LOVEKIN, Abstract, §§ 2, 3).
SPENGLER and LOVEKIN can be considered analogous art because they are from a similar field of endeavor in speech processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of SPENGLER in view of LOVEKIN’s techniques comprising, see e.g., speaker identification architectures comprising “…speaker identification (SID)…criteria for usable speech frames for SID…” such that “…voiced speech, of which usable speech is entirely comprised, is shown to be information rich for SID…performing a frame based (Target to Interferer Ratio) TIR as opposed to an overall TIR.…” in order to advantageously enhance speaker identification since, see e.g., “…it is meaningful to extract only voiced frames from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…,” (See e.g., LOVEKIN, Abstract, §§ 2, 3).

6.	Claims 4, 5, 6, 7, 8, is/are rejected under 35 U.S.C. 103 as being unpatentable over (a)Spengler et al., (U.S. Patent Application Publication: 2007/0288242), in view of (b)Lovekin et al., (J. M. Lovekin, R. E. Yantorno, K. R. Krishnamachari, D. S. Benincasa and S. J. Wenndt, “Developing usable speech criteria for speaker identification technology,” 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001, pp. 421-424 vol.1), and further in view of (c)Moattar et al., (M. H. Moattar and M. M. Homayounpour, “A simple but efficient real-time Voice Activity Detection algorithm,” 2009 17th European Signal Processing Conference, 2009, pp. 2549-2553), hereinafter referred to as SPENGLER,  LOVEKIN, and MOATTAR.

With respect to Claim 4, SPENGLER in view of LOVEKIN discloses:
4. The system of claim 3, wherein the processor is further configured to (See e.g., “…to perform the operations of determining a background noise contour for noise within the observation frame or window and removing the noise from within and around speech formants of the aligned user speech template using a nonlinear noise removal process such as, for example, by thresholding bins of equalized portions of the user-speech template…by first estimating noise power (see FIG. 8) in each bin for each of a plurality of time slices, e.g., twenty, on either side of the speech near and preferably outside the boundaries of the speech for each of the frequency ranges defining the bins, and equalizing the energy values of the each bin across each of the frequency ranges in response to the estimated noise power to thereby “flatten” the spectrum…” See e.g., SPENGLER paras. 61-63, Figs. 3-6, 8-12, 23) (See e.g., “…it is meaningful to extract only voiced frames  from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…”  See e.g., LOVEKIN, Abstract, §§ 2, 3) by [generating a corresponding flatness parameter based on a ratio of: [[the]] a geometric mean of samples of [[the]] an energy density of said frame; to [[the]] an arithmetic mean of said samples of the energy density of said frame].
SPENGLER in view of LOVEKIN does not explicitly, but MOATTAR discloses [generating a corresponding flatness parameter based on a ratio of: [[the]] a geometric mean of samples of the energy density of said frame; to [[the]] an arithmetic mean of said samples of the energy density of said frame] (See e.g., flatness parameter based ratio capabilities according to see e.g., “…Spectral Flatness Measure (SFM)…a measure of the noisiness of spectrum and is a good feature in Voiced/Unvoiced/Silence detection…feature is   calculated using the following equation: 

    PNG
    media_image9.png
    713
    405
    media_image9.png
    Greyscale
SFMdb = 10log10 (Gm / Am)  where Am and Gm  are arithmetic and geometric means of speech spectrum respectively…,” See e.g., MOATTAR Abstract, §§2, 3).
SPENGLER, LOVEKIN, and MOATTAR can be considered analogous art because they are from a similar field of endeavor in speech processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of SPENGLER and LOVEKIN in view of MOATTAR’s techniques comprising, see e.g., a voice activity detection (VAD) architecture using a Spectral Flatness Measure (SFM) algorithmic implementation in order to advantageously help with the performance of speech/audio processing considering see e.g., “…measure of the noisiness of spectrum and is a good feature in Voiced/Unvoiced/Silence detection …,” as such by “…uses[ing] short-term features such as Spectral Flatness (SF) and Short-term Energy. This helps the method to be appropriate for online processing tasks…,” (See e.g., MOATTAR, Abstract, §§ 2, 3).

With respect to Claim 5, SPENGLER in view of LOVEKIN and further in view of  MOATTAR discloses:

    PNG
    media_image9.png
    713
    405
    media_image9.png
    Greyscale
5. The system of claim 4, wherein the processor is further configured to (See e.g., flatness parameter based ratio capabilities according to see e.g., “…Spectral Flatness Measure (SFM)…a measure of the noisiness of spectrum and is a good feature in Voiced/Unvoiced/Silence detection…feature is   calculated using the following equation: SFMdb = 10log10 (Gm / Am)  where Am and Gm  are arithmetic and geometric means of speech spectrum respectively…,” with the VAD Algorithmic implementation capabilities for discarding if the corresponding flatness parameter is higher than a corresponding first threshold according to instructions comprising see e.g.,  “…2- Set one primary threshold for each feature…Primary Threshold for SFM (SF_PrimThresh)… 3-4 Set Decision threshold for…SFM…Thresh_SF = SF_PrimThresh…3-5-…If ((SFM(i)-Min_SF)>=Thresh_SF) then Counter++…,” See e.g., MOATTAR, Abstract, §§ 2, 3).





With respect to Claim 6, SPENGLER in view of LOVEKIN discloses:
6. The system of claim 1, wherein the processor is further configured to  (See e.g., “…to perform the operations of determining a background noise contour for noise within the observation frame or window and removing the noise from within and around speech formants of the aligned user speech template using a nonlinear noise removal process such as, for example, by thresholding bins of equalized portions of the user-speech template…by first estimating noise power (see FIG. 8) in each bin for each of a plurality of time slices, e.g., twenty, on either side of the speech near and preferably outside the boundaries of the speech for each of the frequency ranges defining the bins, and equalizing the energy values of the each bin across each of the frequency ranges in response to the estimated noise power to thereby “flatten” the spectrum…” See e.g., SPENGLER paras. 59-63, Figs. 3-6, 8-12, 23) (See e.g., “…it is meaningful to extract only voiced frames  from the full speaker utterances, and assess the performance of the SID system with these segments to approximate the performance with usable segments. The voiced-only speech is extracted using the Spectral Flatness Method (SFM) [6]…”  See e.g., LOVEKIN, Abstract, §§ 2, 3) [whether a frame has to be discarded based on how [[the]] a spectral energy of said frame is distributed over frequency].

    PNG
    media_image10.png
    718
    416
    media_image10.png
    Greyscale
SPENGLER in view of LOVEKIN does not explicitly, but MOATTAR discloses [whether a frame has to be discarded based on how the spectral energy of said frame is distributed over frequency] (See e.g., capabilities for discarding a frame based on spectral energy of said frame is distributed over frequency according to see e.g., “…2-…Primary Threshold for Energy (Energy_PrimThresh)…3-1-Compute frame energy (E(i))…3-4-Set Decision threshold for E…and SFM…3-5-…If(E(i))-Min_E)>=Thrsh_E) then Counter ++…,” See e.g., MOATTAR, Abstract, §§ 2, 3).
SPENGLER, LOVEKIN, and MOATTAR can be considered analogous art because they are from a similar field of endeavor in speech processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of SPENGLER and LOVEKIN in view of MOATTAR’s techniques comprising, see e.g., a voice activity detection (VAD) architecture using a Spectral Flatness Measure (SFM) algorithmic implementation in order to advantageously help with the performance of speech/audio processing considering see e.g., “…measure of the noisiness of spectrum and is a good feature in Voiced/Unvoiced/Silence detection …,” as such by “…uses[ing] short-term features such as Spectral Flatness (SF) and Short-term Energy. This helps the method to be appropriate for online processing tasks…,” (See e.g., MOATTAR, Abstract, §§ 2, 3).
With respect to Claim 7, SPENGLER in view of LOVEKIN and further in view of  MOATTAR discloses:

    PNG
    media_image11.png
    719
    417
    media_image11.png
    Greyscale
7. The system of claim 6, wherein said processor is further configured to (See e.g., capabilities for filtering mode discarding a frame based on energy estimator assessment  according to substantial amount of energy above an upper frequency threshold based on see e.g., “…2-…Primary Threshold for F (F_PrimThresh)…3-2-1-Find F(i)…as the most dominant frequency component…3-4-Set Decision threshold for…F and SFM…Thresh_F=F_PrimThresh…If(F(i))-Min_F)>=Thrsh_F) then Counter ++…,” See e.g., MOATTAR, Abstract, §§ 2, 3).

With respect to Claim 8, SPENGLER in view of LOVEKIN and further in view of  MOATTAR discloses:
8. The system of claim 7, wherein the processor is further configured to of: [[the]] an energy of the frame pertaining to frequencies lower than said upper frequency threshold (See e.g., “…2-…Primary Threshold for 
    PNG
    media_image12.png
    718
    419
    media_image12.png
    Greyscale
Energy (Energy_PrimThresh)…3-1-Compute frame energy (E(i))…3-4-Set Decision threshold for E…and SFM…If(E(i))-Min_E)>=Thrsh_E) then Counter ++…,” “…2-…Primary Threshold for F (F_PrimThresh)…3-2-1-Find F(i)…as the most dominant frequency component…3-4-Set Decision threshold for…F and SFM…Thresh_F=F_PrimThresh…If(F(i))-Min_F)>=Thrsh_F) then Counter ++…,” See e.g., MOATTAR, Abstract, §§ 2, 3); to [[the]] a total energy of the frame, wherein: the processor (See e.g., “…2-…Primary Threshold for Energy (Energy_PrimThresh)…3-1-Compute frame energy (E(i))…3-4-Set Decision threshold for E…and SFM…3-5-…If(E(i))-Min_E)>=Thrsh_E) then Counter ++…3-6-If Counter > 1…3-7-If current frame is marked as silence, update the energy minimum value: Min_E…3-8-Thresh_E = Energy_PrimThresh*log(Min_E)…,” See e.g., MOATTAR, Abstract, §§ 2, 3).



7.	Claims 13 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over (a) Spengler et al., (U.S. Patent Application Publication: 2007/0288242), in view of (b) Lovekin et al., (J. M. Lovekin, R. E. Yantorno, K. R. Krishnamachari, D. S. Benincasa and S. J. Wenndt, “Developing usable speech criteria for speaker identification technology,” 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001, pp. 421-424 vol.1), and further in view of (c) Konuma et al., (T. Konuma, T. Suzuki, M. Yamada, Y. Ohno, M. Hoshimi and K. Niyada, "Speaker independent speech recognition method with constrained time alignment near phoneme discriminative frame," 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, 1997, pp. 458-465), hereinafter referred to as SPENGLER,  LOVEKIN, and KONUMA.

    PNG
    media_image13.png
    401
    397
    media_image13.png
    Greyscale
With respect to Claim 13, SPENGLER in view of LOVEKIN does not explicitly, but KONUMA discloses Claim 13. (New) The system of claim 1, wherein an overlap amount for consecutive frames of the set of overlapping frames is fixed for all frames (See e.g., how “…Acoustic model (3) … is trained by samples that the discriminative frame and successive part nearby with a certain length are averaged without time alignment and the rest of frames before or after the successive frames are time aligned using DP technique considering endpoint and discriminative frame labels (figure 4). The number of frames for DP matching part is 112 of the total syllable duration…” See e.g., KONUMA Sections 3, 3.1, 3.1.1, Figs. 2-4).  
SPENGLER, LOVEKIN, and KONUMA can be considered analogous art because they are from a similar field of endeavor in speech processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of SPENGLER and LOVEKIN in view of KONUMA’s techniques comprising, see e.g., fixed and dynamic overlapping frames in order to advantageously help “…estimate dynamic spectral features near phoneme discriminative frames sufficiently…,” (See e.g., KONUMA, Abstract, §§ 3, 3.1, 3.1.1, Figs. 2-4).

With respect to Claim 14, SPENGLER in view of LOVEKIN does not explicitly, but KONUMA discloses
    PNG
    media_image14.png
    257
    254
    media_image14.png
    Greyscale
 Claim 14. (New) The system of claim 1, wherein an overlap amount between consecutive frames of the set of overlapping frames varies among the frames (See e.g., how “…Acoustic model (3) … is trained by samples that the discriminative frame and successive part nearby with a certain length are averaged without time alignment and the rest of frames before or after the successive frames are time aligned using DP technique considering endpoint and discriminative frame labels (figure 4). The number of frames for DP matching part is 112 of the total syllable duration…” See e.g., KONUMA Sections 3, 3.1, 3.1.1, Figs. 2-4).
SPENGLER, LOVEKIN, and KONUMA can be considered analogous art because they are from a similar field of endeavor in speech processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of SPENGLER and LOVEKIN in view of KONUMA’s techniques comprising, see e.g., fixed and dynamic overlapping frames in order to advantageously help “…estimate dynamic spectral features near phoneme discriminative frames sufficiently…,” (See e.g., KONUMA, Abstract, §§ 3, 3.1, 3.1.1, Figs. 2-4).

Allowable Subject Matter
8.	Claims 10 and 11 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
9.       The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.  Yanna et al., (Ma, Yanna, and Akinori Nishihara. "Efficient voice activity detection algorithm using long-term spectral flatness measure." EURASIP Journal on Audio, Speech, and Music Processing 2013.1 (2013): 1-18.), already of record, discloses, see e.g., “…a novel and robust voice activity detection (VAD) algorithm utilizing long-term spectral flatness measure (LSFM) which is capable of working at 10 dB and lower signal-to-noise ratios(SNRs). This new LSFM-based VAD improves speech detection robustness in various noisy environments by employing a low-variance spectrum estimate and an adaptive threshold. The discriminative power of the new LSFM feature is shown by conducting an analysis of the speech/non-speech LSFM distributions. The proposed algorithm was evaluated under 12 types of noises (11 from NOISEX-92 and speech-shaped noise) and five types of SNR in core TIMIT test corpus...” (See e.g., Yanna et al., Abstract). 
Please, see PTO-892 for more details. 
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

10.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Edgar Guerra-Erazo whose telephone number is (571) 270-3708.  The examiner can normally be reached on M-F 7:30a.m.-5:00p.m. EST. If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Bhavesh Mehta can be reached on (571) 272-7453.  The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice. 
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/EDGAR X GUERRA-ERAZO/            Primary Examiner, Art Unit 2656