DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 04/05/2022 has been entered.

Claim Interpretation
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “a speaker verification module coupled with the input, the speaker verification module arranged to process the audio signal,” “an audio validation module arranged to generate an output…”, and “a gating module configured to gate the output…” in claim 99.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
 
Response to Amendment
This communication is responsive to the applicant’s amendment dated 02/21/2022.  The applicant(s) amended claims 80 and 99.

Response to Arguments
Applicant's arguments with respect to claims 80 and 99 have been considered but are moot in view of the new ground(s) of rejection because the arguments pertain to the newly amended limitations.

Regarding claims 80 and 99, the Applicant argues that Mamkina does not teach “determining a distribution of acoustic classes in the received audio.” (Remarks: pg. 10) The Examiner respectfully disagrees.
The Examiner notes that nothing in the claims recites determining a distribution of acoustic classes in the receives audio. The claims only recite outputting or providing such distribution. 
Secondly, Bocklet does teach speaker verification methods using mixture models, which reads on the distribution of acoustic speech classes. 
 
Claim Rejections - 35 USC § 103
Claims 80-84, 90-96, and 99 are rejected under 35 U.S.C. 103 as being unpatentable over Mamkina et al. (US 10032451 B1) in view of Bocklet et al. (US 20170200451 A1), further in view of Aviles-Casco Vaquero et al. (US 20170351487 A1).

Regarding claims 80 and 99, Mamkina teaches:
“A speaker verification method to provide a speaker verification output” (col. 12, lines 43-52; ‘The output from the ASR component 250 may be sent to a user recognition module 802. The user recognition module 802 performs user recognition using the audio data 111, and optionally the ASR component output.’) comprising the steps of:
“receiving audio comprising speech” (col. 4, lines 62-67; ‘During runtime, as shown in FIG. 1, the microphone 103 of the speech-controlled device 110 (or a separate microphone array depending upon implementation) captures an utterance (i.e., input audio 11) spoken by the user 5.’);
“performing a speaker verification process on the received audio” (col. 5, lines 1-62; ‘The server(s) 120 may determine (156) user recognition data indicating a confidence that a particular user(s) of the speech-controlled device 110 spoke the utterance.’), the speaker verification process configured to output:
“(i) a speaker ID score representing the likelihood that the received speech is from a particular speaker” (col. 5, lines 1-62; ‘The server(s) 120 may determine (156) user recognition data indicating a confidence that a particular user(s) of the speech-controlled device 110 spoke the utterance.’), and
 Mamkina teaches user recognition confidence (col. 24, lines 31-45; ‘The user recognition module 802 may then output user recognition confidence data 811 which reflects a certain confidence that the input utterance was spoken by one or more particular users. The user recognition confidence data 811 may not indicate access privileges of the user(s). The user recognition confidence data 811 may include an indicator of the verified user (such as a user ID corresponding to the speaker of the utterance) along with a confidence value corresponding to the user ID, such as a numeric value or binned value as discussed below.’).
However, Mamkina does not expressly teach:
“the speaker verification process configured to output: (ii) a sound classification representing a distribution of acoustic speech classes  detected in the speech of the received audio”;
“performing an audio validation process on the received audio to generate an output indicating a validity of the received audio, wherein the audio validation process is based at least in part on the sound classification of speech from the speaker verification process, wherein the output indicating the validity of the received audio is based on a likelihood that the received audio is a product of a replay attack” (emphasis added); and
“gating the output of the speaker verification process based on the output of the audio validation process, such that the speaker ID score is output only for valid received audio.”
Bocklet teaches:
“the speaker verification process configured to output: (ii) a sound classification representing a distribution of acoustic speech classes detected in the speech of the received audio” (par. 0071; ‘Also as shown, CPU 801 may include feature extraction module 202 and classifier module 204. In the example of system 800, system memory 803 may store automatic speaker verification data such as utterance recording data, features, coefficients, replay or original indicators, universal background models, mixture models, scores, super vectors, support vector machine data, or the like as discussed herein.’ The use of mixture models for speaker verification reads on the sound classification.);
“performing an audio validation process on the received audio to generate an output indicating a validity of the received audio, wherein the audio validation process is based at least in part on the sound classification of speech from the speaker verification process, wherein the output indicating the validity of the received audio is based on a likelihood that the received audio is a product of a replay attack” (par. 0023; ‘For example, when the classification is based on the statistical classification, a score for the utterance may be determined based on a ratio of a log-likelihood the utterance was produced by a replay mixture model to a log-likelihood the utterance was produced by an original mixture model. In this context, the term "produced by" indicates a likelihood the utterance has similar characteristics as the utterances used to train the pertinent mixture model (e.g., replay or original). The mixture models may be Gaussian mixture models (GMMs) pre-trained based on many recordings of original utterances and replay utterances as is discussed further herein.’; par. 0025; ‘In either case, an utterance classified as a replay or replayed utterance may cause the automatic speaker verification system to deny access to the system.’).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mamkina’s user recognition methods by incorporating Bocklet’s utterance classification in order to generate an output indicating a validity of the received audio based on the likelihood of a replayed utterance (replay attack). The combination detects replay attacks and rejects system access requests based on such detection. (Bocklet: par. 0021)
However, Mamkina and Bocklet do not expressly teach:
“gating the output of the speaker verification process based on the output of the audio validation process, such that the speaker ID score is output only for valid received audio.”
In a similar field of endeavor, Aviles-Casco Vaquero teaches:
“gating the output of the speaker verification process based on the output of the audio validation process, such that the speaker ID score is output only for valid received audio” (par. 0186; ‘In general it is a good idea to combine this approach with the use of the antispoofing output scores as a filter, so that, if the input is clearly a spoof it is directly rejected instead of using the antispoofing scores only to modify the weights.’).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to modify Mamkina’s (in view of Bocklet) user recognition methods by incorporating the antispoofing output scores taught by Aviles-Casco Vaquero to use as a filter in order to reject inputs that are spoofs. 

Regarding claim 81 (dep. on claim 80), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“wherein the audio validation process is additionally based at least in part on the speaker ID score from the speaker verification process” (Aviles-Casco Vaquero: par. 187; ‘This approach is advantageous if we expect there to be correlation between the speaker recognition and antispoofing scores.’).

Regarding claim 82 (dep. on claim 80), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“performing an anti-spoofing (AS) process based on the received audio and the sound classification” (Aviles-Casco Vaquero: par. 0019; ‘The method may further comprise performing an antispoofing process on at least one of the first and second portions of the received signal.’).

Regarding claim 83 (dep. on claim 82), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“wherein the anti-spoofing process comprise at least one of the following: an anti-spoofing method using received audio and an indication of the acoustic classes present in speech; an ultrasonic-power-level-based anti-spoofing system; a magnetic-power-level-based anti-spoofing system; and a loudspeaker-detection-based anti-spoofing system” (Aviles-Casco Vaquero: par. 0174; ‘The antispoofing feature vector is composed of different metrics, for example by the spectral ratio, low frequency ratio and feature vector squared Mahalanobis distance.’).

Regarding claim 84 (dep. on claim 83), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“performing multiple different anti-spoofing processes, and combining or fusing the outputs of such different anti-spoofing processes to provide an anti-spoofing decision” (Aviles-Casco Vaquero: par. 0180; ‘In this example, the voice first and second segments, which may be the trigger and the command as discussed previously, are subject to separate antispoofing detection processes (which may be the same or different), to obtain two antispoofing output scores, one for the trigger and one for the command.’).

Regarding claim 90 (dep. on claim 80), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“buffering the received audio; and responsive to the step of gating the output of the speaker verification process, outputting the valid speaker recognition output and the buffered audio” (Aviles-Casco Vaquero: par. 0186; ‘In general it is a good idea to combine this approach with the use of the antispoofing output scores as a filter, so that, if the input is clearly a spoof it is directly rejected instead of using the antispoofing scores only to modify the weights.’ Buffering data/audio is well-known in the art.).

Regarding claim 91 (dep. on claim 80), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“performing a plurality of different speaker recognition processes to provide a respective plurality of speaker recognition scores, and fusing the plurality of speaker recognition scores to provide the speaker ID score” (Aviles-Casco Vaquero: par. 0187; ‘A third option is to use the antispoofing scores as additional scores for the speaker recognition task, and fuse them with the speaker recognition scores.’).

Regarding claim 92 (dep. on claim 80), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“performing a classification of the received audio to identify a sound classification, the sound classification identifying acoustic classes present in the received audio” (Mamkina: col. 7, lines 40-55; ‘For example, the ASR module 250 may compare the audio data 111 with models for sounds (e.g., subword units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the spoken utterance of the audio data 111.’); and
“based on the identified sound classification, scoring the received audio against a stored template of the acoustic classes produced by enrolled speakers to identify a speaker for the received audio from the enrolled speakers” (Mamkina: col. 5, lines 1-62; ‘The server(s) 120 may determine (156) user recognition data indicating a confidence that a particular user(s) of the speech-controlled device 110 spoke the utterance.’).

Regarding claim 93 (dep. on claim 80), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“wherein the step of performing a speaker recognition process is performed responsive to receipt of a trigger signal, for example a keyword detection” (Mamkina: col. 6, lines 29-37; ‘For example, the device 110 may convert input audio 11 into audio data, and process the audio data with the wakeword detection module 220 to determine whether speech is detected, and if so, if the audio data comprising speech matches an audio signature and/or model corresponding to a particular keyword.’).

Regarding claim 94 (dep. on claim 93), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“wherein the method comprises the step of monitoring for a trigger signal, for example performing a voice keyword detection process” (Mamkina: col. 6, lines 29-37; ‘For example, the device 110 may convert input audio 11 into audio data, and process the audio data with the wakeword detection module 220 to determine whether speech is detected, and if so, if the audio data comprising speech matches an audio signature and/or model corresponding to a particular keyword.’).

Regarding claim 95, the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“performing at least a portion of the method according to claim 80 as part of a primary biometrics scoring” (Mamkina: col. 31, lines 25-51; ‘The API may also pass other data such as a source of user recognition data (e.g., whether the system recognized the user using speech analysis, a passcode, a passphrase, a fingerprint, biometric data, etc. or some combination thereof).’) and
“performing a secondary biometrics scoring based on the received audio to provide a second speaker ID score, the secondary biometrics scoring performed responsive to the step of gating of a speaker verification output for valid received audio from the primary biometrics scoring, wherein the secondary biometrics scoring is selected to be different to the primary biometrics scoring” (Mamkina: col. 31, lines 25-51; ‘The API may also pass other data such as a source of user recognition data (e.g., whether the system recognized the user using speech analysis, a passcode, a passphrase, a fingerprint, biometric data, etc. or some combination thereof).’).

Regarding claim 96 (dep. on claim 95), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero further teaches:
“wherein the method comprises the step of fusing the speaker ID score from the primary biometrics scoring with the second speaker ID score of the secondary biometrics scoring to provide a speaker authentication result” (Aviles-Casco Vaquero: par. 0187; ‘A third option is to use the antispoofing scores as additional scores for the speaker recognition task, and fuse them with the speaker recognition scores.’).

Claims 85-89 are rejected under 35 U.S.C. 103 as being unpatentable over Mamkina in view of Bocklet and Aviles-Casco Vaquero as applied to claim 80 above, and further in view of Hayakawa et al. (US 20160111112 A1).

Regarding claim 85 (dep. on claim 80), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero does not expressly teach:
“performing a speaker change detection (SCD) process based on a combination of at least one or more of the following: the speaker ID score; the sound classification; the received audio.”
Hayakawa teaches:
“performing a speaker change detection (SCD) process based on a combination of at least one or more of the following: the speaker ID score; the sound classification; the received audio” (par. 0023; ‘The processing unit 13 executes a speaker change detection process to add identification information identifying a speaker speaking in each frame to the frame on the basis of the digitized voice signal.’).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to modify the user recognition method taught by Mamkina in view of Bocklet and Aviles-Casco Vaquero by incorporating the speaker change detection methods taught by Hayakawa in order to detect fraudulent conversations. (Hayakawa: par. 0087)

Regarding claim 86 (dep. on claim 85), the combination of Mamkina in view of Bocklet, Aviles-Casco Vaquero, and Hayakawa further teaches:
“wherein the audio validation process is configured to generate an output indicating that a part of the received audio preceding a speaker change is valid” (Mamkina: col. 6, lines 29-37; ‘For example, the device 110 may convert input audio 11 into audio data, and process the audio data with the wakeword detection module 220 to determine whether speech is detected, and if so, if the audio data comprising speech matches an audio signature and/or model corresponding to a particular keyword.’).

Regarding claim 87 (dep. on claim 85), the combination of Mamkina in view of Bocklet, Aviles-Casco Vaquero, and Hayakawa further teaches:
“wherein the SCD process is based on at least one of the following: a time-windowed speaker ID score; monitoring for a change in the fundamental frequency, or F0, of the received audio; monitoring for a change in the distribution of the acoustic classes of the received audio identified from the sound classification; monitoring for a change in the fundamental frequency of the received audio for a particular acoustic class identified from the sound classification; accent tracking; emotion tracking; or any other suitable speaker change detection method” (Hayakawa: par. 0030; ‘Note that the feature extracting unit 21 may obtain an integrated value of power and a pitch frequency, which are prosodic information, from each frame as features.’).

Regarding claim 88 (dep. on claim 85), the combination of Mamkina in view of Bocklet, Aviles-Casco Vaquero, and Hayakawa further teaches:
“comprising performing multiple different SCD processes, and combining or fusing the outputs of such different SCD processes to provide an SCD decision” (Aviles-Casco Vaquero: par. 0187; ‘A third option is to use the antispoofing scores as additional scores for the speaker recognition task, and fuse them with the speaker recognition scores.’).

Regarding claim 89 (dep. on claim 85), the combination of Mamkina in view of Bocklet, Aviles-Casco Vaquero, and Hayakawa further teaches:
“wherein an output of the audio validation process is used as an input to the speaker verification process” (Aviles-Casco Vaquero: par. 0187; ‘A third option is to use the antispoofing scores as additional scores for the speaker recognition task, and fuse them with the speaker recognition scores.’).

Claims 97-98 are rejected under 35 U.S.C. 103 as being unpatentable over Mamkina in view of Bocklet and Aviles-Casco Vaquero as applied to claim 95 above, and further in view of North et al. (US 20150161370 A1).

Regarding claim 97 (dep. on claim 95), the combination of Mamkina in view of Bocklet and Aviles-Casco Vaquero does not expressly teach:
“wherein the speaker recognition method is configured such that: the primary biometrics scoring is selected to have a relatively high False Acceptance Rate (FAR), and a relatively low False Rejection Rate (FRR).”
In a similar field of endeavor (fraud prevention), North teaches:
“wherein the speaker recognition method is configured such that: the primary biometrics scoring is selected to have a relatively high False Acceptance Rate (FAR), and a relatively low False Rejection Rate (FRR)” (par. 0069; ‘Therefore, processor 126 may adjust the at least one threshold to higher or lower threshold than previous or subsequent threshold(s), i.e., the at least one threshold is adjusted to increase the probability of false-rejections or false acceptances.’).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to modify the user recognition methods taught by Mamkina in view of Bocklet and Aviles-Casco Vaquero and incorporate North’s voice authentication methods of dynamic thresholds such that the primary biometrics scoring is selected to have a relatively high False Acceptance Rate (FAR), and a relatively low False Rejection Rate (FRR). The combination would provide sufficient security for mobile site security and automation applications. (North: par. 0006)

Regarding claim 98 (dep. on claim 97), the combination of Mamkina in view of Bocklet, Aviles-Casco Vaquero, and North further teaches:
“the secondary biometrics scoring is selected to have a relatively lower FAR than the primary biometrics scoring” (North: abstract; ‘The authentication includes voice authentication having at least one threshold that may be dynamically adjustable between false-rejection and false-acceptance.’).

Conclusion
Other pertinent prior art are listed in the PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARK VILLENA whose telephone number is (571)270-3191.  The examiner can normally be reached on 10 am - 6pm EST Monday through Friday.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


MARK . VILLENA
Examiner
Art Unit 2658


/MARK VILLENA/
Examiner, Art Unit 2658