DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Introduction
This office action is in response to communications filed on 10/05/2022. Claims 1, 3-12 and 14-22 are pending, and likewise Claims 1, 3-12 and 14-22 have been examined.

Response to Amendment
Amendment filed 10/05/2022 has been fully considered by Examiner. Objections to Claims 3 and 14 have been withdrawn.

Response to Arguments
Applicant's arguments filed have been fully considered but they are not persuasive.

Examiner believes that the prior art of record still teaches the claim limitations as amended. Applicant argues that Wu(2020), Khoury and Kumar fails to teach receiving audio data characterizing speech obtained by a user device; generating using a shallow discriminator model comprising a single intelligent pooling layer and a single fully-connected layer, a score indicating a presence of synthetic speech in the audio data, and determining whether the score satisfies a synthetic speech detection threshold.
	As shown in the previous rejection of Claim 1, receiving audio data characterizing speech obtained by a user device and and determining whether the score satisfies a synthetic speech detection threshold is taught by Khoury, a shallow discriminator model is taught by Kumar, and Wu(2020) teaches generating using a discriminator model, a score indicating a presence of synthetic speech in the audio data. The newly added limitation comprising a single intelligent pooling layer and a single fully-connected layer is not taught by the combination of references. However,  Cai, which was previously added in the dependent claims, does teach the limitation comprising a single intelligent pooling layer and a single fully-connected layer(See Cai, Pg 3, Table 1, has a single fully connected output layer, as well as a single Global Average Pooling layer).
	Wang, Yamagishi, Vaquero and Wu(2019) are not relied upon to teach the above limitation.
In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning.  But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper.  See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).
For these reasons Examiner believes that the limitations are taught by the prior art of record, with the addition of Cai, which was originally in a dependent claim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 3-6, 8, 12, 14-17 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wu et al. “Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning” hereinafter Wu(2020) , and further in view of Khoury et al. (WO 2018160943 A1), and further in view of Kumar et al. “Spoof detection using time-delay shallow neural network and feature switching” hereinafter Kumar, and further in view of Cai et al “The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion” hereinafter Cai.

Regarding Claim 1:
Wu(2020) teaches a method comprising: receiving, audio data characterizing speech(Pg 2, Fig 1, (b), (picture of audio signal) -> mockingjay -> Anti-spoofing model -> Spoofing or non-spoofing. Pg 1, Introduction, Ln 1-6, Automatic speaker verification, …. Speech… ..unprotected ASV models are highly vulnerable to spoofing); 
5generating, using a trained self-supervised model, a plurality of audio features vectors each representative of audio features of a portion of the audio data(Pg 2-3, 3, Proposed method, 3.1. Mockingjay, Para 1, Ln 1-3, The Mockingjay [30] approach learns representations of speech by solving a self-supervised masked-prediction task….. The model. & Ln 5-6, The transformer encoder produces a representation vector for each time frame. & Ln 8-10, the masked-prediction task requires the model to take a sequence of frames as input. & Ln 12-13, After training, the representations produced by the transformer network are inputs to the anti-spoofing model);
generating, using a discriminator model, a score indicating a presence of synthetic speech in the audio data based on the 10corresponding audio features of each audio feature vector of the plurality of audio feature vectors(Pg 2-3, 3, Proposed method, 3.1. Mockingjay, Para 1, Ln 12-13, After training, the representations produced by the transformer network are inputs to the anti-spoofing model. Pg 2, Fig 1, (b), Anti-spoofing model -> Spoofing or non-spoofing. Pg 3, 4. Experiment, 4.1 Experiment setup, Para 2, Ln 5-8, Two high-performance anti-spoofing models are adopted: LCNN [19] and SENet [20]); 
determining, that the speech in the audio data obtained by the user device comprises synthetic speech(Pg 2, Fig 1, (b), (picture of audio signal) -> mockingjay -> Anti-spoofing model -> Spoofing or non-spoofing. Pg 1, Introduction, Ln 1-6, Automatic speaker verification, …. Speech… ..unprotected ASV models are highly vulnerable to spoofing. The audio is speech).
Wu(2020) does not teach receiving, at data processing hardware, audio data characterizing speech obtained by a user device;
generating, by the data processing hardware;
generating, by the data processing hardware;
determining, by the data processing hardware, whether the score satisfies a synthetic speech detection threshold; 
and when the score satisfies the synthetic speech detection threshold, determining, by 15the data processing hardware, that the speech in the audio data obtained by the user device comprises synthetic speech.
In the same field of Anti-spoofing models, Khoury teaches receiving, at data processing hardware(Para [0061], Ln 1-5, The present invention generally relates to an apparatus for performing the operations described herein. This apparatus may….. comprise a general-purpose computer), 
audio data characterizing speech obtained by a user device(Para [0045], Ln 4, A spoofing call, using …a computer microphone… or other unexpected audio captured device);
generating, by the data processing hardware(Para [0061], Ln 1-5, The present invention generally relates to an apparatus for performing the operations described herein. This apparatus may….. comprise a general-purpose computer);
generating, by the data processing hardware(Para [0061], Ln 1-5, The present invention generally relates to an apparatus for performing the operations described herein. This apparatus may….. comprise a general-purpose computer);
determining, by the data processing hardware, whether the score satisfies a synthetic speech detection threshold(Para [0030], Ln 10-13, A binary classifier 130 may compare the resulting classification with a predetermined threshold score, resulting in a determination that the voice sample or audio source is "genuine" or "fraudulent". Para [0061], Ln 1-5, The present invention generally relates to an apparatus for performing the operations described herein. This apparatus may….. comprise a general-purpose computer); 
and when the score satisfies the synthetic speech detection threshold, determining, by 15the data processing hardware, that the speech in the audio data obtained by the user device comprises synthetic speech(Para [0030], Ln 10-13, A binary classifier 130 may compare the resulting classification with a predetermined threshold score, resulting in a determination that the voice sample or audio source is "genuine" or "fraudulent". Para [0061], Ln 1-5, The present invention generally relates to an apparatus for performing the operations described herein. This apparatus may….. comprise a general-purpose computer).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify Wu(2020) with the computer hardware, voice input device, and binary classification score threshold of Khoury, as the Computer hardware provides an environment for the system to be executed(Para [0061], Ln 1-2), the voice input device provides the system with speech input to operate on(Para [0045], Ln 2-6), and the binary classification score threshold enables the system to output the final decision of “genuine” or “spoofed”(Para [0035], Ln 10-12).
The combination of Wu(2020) and Khoury does not teach a shallow discriminator model.
In the same field of Anti-Spoofing models, Kumar teaches a shallow discriminator model(Pg 1, Introduction, Para 3, Ln 10-11, The network was made shallow since this is binary classification problem with limited data).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020) and Khoury with the shallow discriminator model of Kumar, because of its performance improvements on a binary classification problem with limited data(Pg 1, Introduction, Para 3, Ln 10-13).
The combination of Wu(2020), Khoury and Kumar does not teach discriminator model comprising a single intelligent pooling layer and a single fully-connected layer.
In the same field of Anti-spoofing models, Cai teaches discriminator model comprising a single intelligent pooling layer and a single fully-connected layer(Pg 2, Fig 1, and Pg 3, Table 1, GAP Layer and Output Fully-connected Layer. Pg 2, Para 5, Ln 4-5, we adopt a global average pooling (GAP) layer).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the fully connected output layer and Global Average Pooling(GAP) layer of Cai, as the output layer produces a score for the possible classes of the utterance, identifying if the input is spoofed or not and it assists in increases performance of the system(Pg 2, Para 7, Ln 3-4 bona fide and spoof categories. & Ln 5-6, the final utterance-level score can be directly fetched from the DNN output. Abstract, Ln 17-20) and the GAP allows the whole sequence to be aggregate together over time(P2, Para 5(4th whole), Ln 3-4) as well as increases performance by reducing overfitting.

Regarding Claim 3:
	The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 1, but does not teach further comprising: generating, by the data processing hardware, using the intelligent pooling layer of the shallow discriminator model, a single final audio feature vector based on each audio feature vector of the plurality of audio feature vectors, wherein generating the score indicating the presence of the synthetic speech in the audio data is based on the single final audio feature vector.
In the same field of Anti-spoofing models, Cai teaches further comprising: generating, by the data processing hardware, using the intelligent pooling layer of the shallow discriminator model, a single final audio feature vector based on each audio feature vector of the plurality of audio feature vectors(Pg 2, Para 5(4th whole), Ln 4-8, global average pooling (GAP) layer on the top of CNN [25]. Given CNN learned feature maps F ∈ R C×H×W , the GAP layer accumulates mean statistics along with the time–frequency axis, and the corresponding output is defined as: . & Pg 2, eq (1) . & Pg Para 6, Ln 1-2, fixed-dimensional utterance-level representation V = [v1, v2, · · · , vC ] from the output of GAP), 
25wherein generating the score indicating the presence of the synthetic speech in the audio data is based on the single final audio feature vector(Pg 2, Fig 1, GAP -> Bona fide or Spoof).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the Global Average Pooling layer of Cai, as it allows them to  aggregate the whole sequence together over time(Pg 2, Para 5(4th whole), Ln 3-4).

Regarding Claim 4:
	The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 3, but does not teach wherein the single final audio feature vector comprises an averaging of each audio feature vector of the plurality of audio feature vectors.
	In the same field of Anti-spoofing models, Cai teaches wherein the single final audio feature vector comprises an averaging of each audio feature vector of the plurality of audio feature vectors(Pg 2, Para 4, Ln 3-5, For a given feature sequence of size D ×L…… block of shape C×H×W. & Pg 2, Para 5, Ln 4-5, Concerning about that, we adopt a global average pooling (GAP) layer on the top of CNN [25].  & Pg 2, eq (1) shows averaging. & Pg 2 Para 6, Ln 1-2, fixed-dimensional utterance-level representation V = [v1, v2, · · · , vC ] from the output of GAP).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the Global Average Pooling layer of Cai, as it allows them to  aggregate the whole sequence together over time(Pg 2, Para 5(4th whole), Ln 3-4).

Regarding Claim 5:
	The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 3, but does not teach wherein the single final audio feature vector comprises an aggregate of each audio feature vector of the plurality of audio feature vectors.
	In the same field of Anti-spoofing models, Cai teaches wherein the single final audio feature vector comprises an aggregate of each audio feature vector of the plurality of audio feature vectors(Pg 2, Para 5, Ln 3, how to aggregate the whole sequence together. & Ln 6-8, the GAP layer accumulates mean statistics along with the time–frequency axis, and the corresponding output is defined as:. Eq (1). & Pg 2, Para 6, Ln 1-2, fixed-dimensional utterance-level representation V = [v1, v2, · · · , vC ] from the output of GAP).
	It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the Global Average Pooling layer of Cai, as it allows them to  aggregate the whole sequence together over time(P2, Para 5(4th whole), Ln 3-4).

Regarding Claim 6:
	The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 3, but does not teach wherein the single 5fully-connected layer configured to receive, as input, the single final audio feature vector and generate, as output, the score.
	In the same field of Anti-spoofing models, Cai teaches wherein the single 5fully-connected layer configured to receive, as input, the single final audio feature vector and generate, as output, the score(Pg 2, Para 6, Ln 1-2, fixed-dimensional utterance-level representation V = [v1, v2, · · · , vC ] from the output of GAP. & Pg 2, Para 7, Ln 1-4(next column), further process the utterance-level representation through a fully-connected feed-forward network and build an output layer on top. The two units in the output layer are represented as bona fide and spoof categories. Pg 3, Table 1).
	It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the fully connected layer and output, after the Global Average Pooling layer, of Cai, as it produces a score for the possible classes of the utterance, identifying if the input is spoofed or not and it assists in increases performance of the system(Pg 2, Para 7, Ln 3-4 bona fide and spoof categories. & Ln 5-6, the final utterance-level score can be directly fetched from the DNN output. Abstract, Ln 17-20).

Regarding Claim 8:
The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 1, and Wu(2020) teaches wherein the trained self-supervised model is trained on a first training dataset comprising only training samples of human-originated speech(Pg 3, 4. Experiment, 4.1. Experiment setup, Para 2, Ln 4-5, where we pre-train our Mockingjay model on 360 hours of speech on the LibriSpeech dataset [36]).

Regarding Claim 12:
Wu(2020) teaches 30receiving audio data characterizing speech in audio data(Pg 2, Fig 1, (b), (picture of audio signal) -> mockingjay -> Anti-spoofing model -> Spoofing or non-spoofing. Pg 1, Introduction, Ln 1-6, Automatic speaker verification, …. Speech… ..unprotected ASV models are highly vulnerable to spoofing); 19 36846031.1Attorney Docket No: 231441-475745 
generating using a trained self-supervised model, a plurality of audio features vectors each representative of audio features of a portion of the audio data(Pg 2-3, 3, Proposed method, 3.1. Mockingjay, Para 1, Ln 1-3, The Mockingjay [30] approach learns representations of speech by solving a self-supervised masked-prediction task….. The model. & Ln 5-6, The transformer encoder produces a representation vector for each time frame. & Ln 8-10, the masked-prediction task requires the model to take a sequence of frames as input. & Ln 12-13, After training, the representations produced by the transformer network are inputs to the anti-spoofing model); 
generating using a discriminator model, a score indicating a presence of synthetic speech in the audio data based on the corresponding audio features 5of each audio feature vector of the plurality of audio feature vectors(Pg 2-3, 3, Proposed method, 3.1. Mockingjay, Para 1, Ln 12-13, After training, the representations produced by the transformer network are inputs to the anti-spoofing model. Pg 2, Fig 1, (b), Anti-spoofing model -> Spoofing or non-spoofing. Pg 3, 4. Experiment, 4.1 Experiment setup, Para 2, Ln 5-8, Two high-performance anti-spoofing models are adopted: LCNN [19] and SENet [20]); 
determining that the speech in the audio data obtained by the user device comprises 10synthetic speech(Pg 2, Fig 1, (b), (picture of audio signal) -> mockingjay -> Anti-spoofing model -> Spoofing or non-spoofing. Pg 1, Introduction, Ln 1-6, Automatic speaker verification, …. Speech… ..unprotected ASV models are highly vulnerable to spoofing. The audio is speech).
Wu(2020) does not teach a system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: 
obtained by a user device;
determining whether the score satisfies a synthetic speech detection threshold; 
and when the score satisfies the synthetic speech detection threshold.
In the same field of Anti-spoofing models, Khoury teaches a system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising(Para [0061], Ln 1-6, The present invention generally relates to an apparatus for performing the operations described herein. This apparatus may….. GPU… comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer): 
obtained by a user device(Para [0045], Ln 4, A spoofing call, using …a computer microphone… or other unexpected audio captured device);
determining whether the score satisfies a synthetic speech detection threshold(Para [0030], Ln 10-13, A binary classifier 130 may compare the resulting classification with a predetermined threshold score, resulting in a determination that the voice sample or audio source is "genuine" or "fraudulent"); 
and when the score satisfies the synthetic speech detection threshold(Para [0030], Ln 10-13, A binary classifier 130 may compare the resulting classification with a predetermined threshold score, resulting in a determination that the voice sample or audio source is "genuine" or "fraudulent").
	It would have been obvious for one skilled in the art, at the effective time of filling, to modify Wu(2020) with the computer hardware, voice input device, and binary classification score threshold of Khoury, as the Computer hardware provides an environment for the system to be executed(Para [0061], Ln 1-2), the voice input device provides the system with speech input to operate on(Para [0045], Ln 2-6), and the binary classification score threshold enables the system to output the final decision of “genuine” or “spoofed”(Para [0035], Ln 10-12).
The combination of Wu(2020) and Khoury does not teach a shallow discriminator model.
In the same field of Anti-Spoofing models, Kumar teaches a shallow discriminator model(Pg 1, Introduction, Para 3, Ln 10-11, The network was made shallow since this is binary classification problem with limited data).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020) and Khoury with the shallow discriminator model of Kumar, because of its performance improvements on a binary classification problem with limited data(Pg 1, Introduction, Para 3, Ln 10-13).
The combination of Wu(2020), Khoury and Kumar does not teach discriminator model comprising a single intelligent pooling layer and a single fully-connected layer.
In the same field of Anti-spoofing models, Cai teaches discriminator model comprising a single intelligent pooling layer and a single fully-connected layer(Pg 2, Fig 1, and Pg 3, Table 1, GAP Layer and Output Fully-connected Layer. Pg 2, Para 6, Ln 1-2, fixed-dimensional utterance-level representation V = [v1, v2, · · · , vC ] from the output of GAP. & Pg 2, Para 7, Ln 1-4(next column), further process the utterance-level representation through a fully-connected feed-forward network and build an output layer on top. The two units in the output layer are represented as bona fide and spoof categories).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the output layer and Global Average Pooling(GAP) layer of Cai, as it produces a score for the possible classes of the utterance, identifying if the input is spoofed or not and it assists in increases performance of the system(Pg 2, Para 7, Ln 3-4 bona fide and spoof categories. & Ln 5-6, the final utterance-level score can be directly fetched from the DNN output. Abstract, Ln 17-20) and the GAP allows the whole sequence to be aggregate together over time(P2, Para 5(4th whole), Ln 3-4).

Regarding Claim 14:
Claim 14 contains similar limitations as Claim 3, and is therefore rejected for the same reasons.

Regarding Claim 15:
	Claim 15 contains similar limitations as Claim 4, and is therefore rejected for the same reasons.

	Regarding Claim 16:
	Claim 16 contains similar limitations as Claim 5, and is therefore rejected for the same reasons.

	Regarding Claim 17:
	Claim 17 contains similar limitations as Claim 6, and is therefore rejected for the same reasons.

Regarding Claim 19:
Claim 19 contains similar limitations as Claim 8, and is therefore rejected for the same reasons.

Claim(s) 7 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wu(2020), Khoury, Kumar and Cai as applied to claim 1 above, and further in view of Wang (WO 2021075063 A1).

Regarding Claim 7:
	The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 1, but does not teach wherein the shallow discriminator model comprises one of a logistic regression model, a linear discriminant analysis model, or a random forest 10model.
	In the same field of Anti-spoofing models Wang teaches the shallow discriminator model comprises one of a logistic regression model, a linear discriminant analysis model, or a random forest 10model(Para [0023], Ln 1-6, spoofing detection,….NN evaluation unit 50 calculates the posterior of node “spoof” as the score…. NN evaluation unit 50 can also output hidden layers as a new feature set for the input audio. Then the feature set can be used together with any classifiers, such as…. probabilistic linear discriminant analysis (PLDA). Shallow is taught by combination with Kumar in Claim 1 above).
	It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the linear discriminant analysis classifier of Wang, as it is a common well known classifier that can be used to pick an output class for the input(Para [0023], Ln 1-6, calculates….the score…Note that…..hidden layers as a new feature set…..can be used together with classifiers).

	Regarding Claim 18:
	Claim 18 contains similar limitations as Claim 7, and is therefore rejected for the same reasons.

Claim(s) 9 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wu(2020), Khoury, Kumar and Cai as applied to claim 8 above, and further in view of Yamagishi et al. “ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan” hereinafter Yamagishi.

Regarding Claim 9:
The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 8, wherein the shallow discriminator model is trained on a second training dataset comprising training samples of synthetic speech, the second training dataset smaller than the first training dataset(Pg 3, 4.1. Experiment setup, para 1, Ln 1-4, we use the LA partition of the ASVspoof 2019 challenge, which contains fake audios generated by text to speech and voice conversion. The dataset is itself divided into three parts: training, development, and evaluation. Para 2, Ln 4-5, where we pre-train our Mockingjay model on 360 hours of speech on the LibriSpeech dataset [36]).
The combination of Wu(2020), Khoury, Kumar and Cai does not specifically teach the second training dataset smaller than the first training dataset.
It the same field of Anti-spoofing models, Yamagishi teaches the second training dataset smaller than the first training dataset(Yamagishi does not teach two data sets where the second is smaller, Yamagishi teaches the size of the ASVspoof 2019 data that is used in Wu(2020). This shows that LibriSpeech is larger than the ASVspoof 2019 data set. See Yamagishi Pg 3, Table 1, number of utterances in training and development sets of the ASVspoof 2019 database. The duration of each utterance is in the order of one to two seconds. See Logical access (LA). ASVspoof is smaller because Training: Bona fide + Spoof = 25380 utterances of max length 2 seconds = 50760 seconds. 50760s < 360hrs of LibriSpeech. Yamagishi is from the official website for ASVspoof 2019 challenge).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the ASVspoof 2019 dataset of Yamagishi, as it provides the Anti-spoofing model with data to be trained on(Yamagishi, Pg 3, 3 Data conditions, Para 2, ln 1-5), and Wu(2020) states that it uses that specific data set(Pg 3, 4.1. Experiment setup, para 1, Ln 1-4).

	Regarding Claim 20:
	Claim 20 contains similar limitations as Claim 9, and is therefore rejected for the same reasons.

Claim(s) 10 and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wu(2020), Khoury, Kumar and Cai as applied to claim 1 above, and further in view of Vaquero et al. (US 20200279568 A1).

Regarding Claim 10:
The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 1, but does not teach wherein the data processing hardware resides on the user 20device.
In the same field Anti-spoofing models, Vaquero teaches wherein the data processing hardware resides on the user 20device(Para [0076], Ln 1-5, In this embodiment, the smartphone 10 is provided with speaker recognition functionality, and with control functionality. Thus, the smartphone 10 is able to perform various functions in response to spoken commands from an enrolled user. Speaker recognition functionality is on the same device that receives the spoken input).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the singular device for the system of Vaquero, as it removes the need for an additional device, if the voice input is a command intended for the original device(Para [0076], Ln 8-20, Thus, certain embodiments of the invention relate to operation of a smartphone….. in which the speaker recognition functionality is performed in the device that is intended to carry out the spoken command. Certain other embodiments relate to systems in which the speaker recognition functionality is performed on a smartphone or other device, which then transmits the commands to a separate device).

Regarding Claim 21:
Claim 21 contains similar limitations as Claim 10, and is therefore rejected for the same reasons.

Claim(s) 11 and 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wu(2020), Khoury, Kumar and Cai as applied to claim 1 above, and further in view of Wu et al. “INCREASING COMPACTNESS OF DEEP LEARNING BASED SPEECH ENHANCEMENT MODELS WITH PARAMETER PRUNING AND QUANTIZATION TECHNIQUES” hereinafter Wu(2019).

Regarding Claim 11:
The combination of Wu(2020), Khoury, Kumar and Cai teaches the method of claim 1, wherein the trained self-supervised model comprises a representation model derived from a trained self-supervised model(Pg 2-3, 3, Proposed method, 3.1. Mockingjay, Para 1, Ln 1-3, The Mockingjay [30] approach learns representations of speech by solving a self-supervised masked-prediction task….. The model. & Ln 5-6, The transformer encoder produces a representation vector for each time frame. Pg 2, fig 1, (c) and (b), MockingJay -> Pre-trained-> (b)).
The combination of Wu(2020), Khoury, Kumar and Cai does no teach derived from a larger model.
In the same field of Speech models, Wu(2019) teaches derived from a larger model(Abstract, Ln 5-9, In this study, we propose a novel parameter pruning (PP) technique, which removes redundant channels in a neural network. In addition, a parameter quantization (PQ) technique was applied to reduce the size of a neural network. Pg 2, 3.1.3 Channel pruning, Ln 1-2, In our proposed parameter pruning (PP) technique, the pruning mechanism contains a retraining step. The model is a trained model before its size is reduced due to pruning).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Wu(2020), Khoury, Kumar and Cai with the pruning technique of Wu(2019), as it lowers the size of the model, with minimal effects to performance(Abstract, Ln 12-17).

	Regarding Claim 22:
	Claim 22 contains similar limitations as Claim 11, and is therefore rejected for the same reasons.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER G MARLOW whose telephone number is (571)272-4536. The examiner can normally be reached Monday - Thursday 10:00 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richmond Dorvil can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ALEXANDER G MARLOW/           Assistant Examiner, Art Unit 2658            



/RICHEMOND DORVIL/           Supervisory Patent Examiner, Art Unit 2658