DETAILED ACTION
This communication is in response to the Application filed on 03 March 2020. Claims 1-20 are pending and have been examined.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The two information disclosure statements (IDS) submitted on 03 March 2020 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Compact Prosecution
In the interest of compact prosecution, the examiner suggests that the applicant amend the independent claims so as to require using both the first subset and the second subset.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-2, 4-5, 10-13, and 16-18 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 9542948, hereinafter referred to as Roblek et al. 

Regarding claim 1, Roblek et al. discloses a computer-implemented method comprising: 

obtaining, using a hardware processor (Roblek et al., col. 18, lines 48-52), training data stored on one or more computer readable storage mediums (Roblek et al., col. 18, lines 58-62), the training data including a plurality of utterances of a plurality of speakers (“During stage (A), the computing system 120 obtains a set of training utterances 122,” Roblek et al., col. 4, lines 38-39.); and 

performing a plurality of tasks to train a machine learning model that converts an utterance of the plurality of utterances into a feature vector, each task using one of a plurality of subsets of training data (“As refers to in this Specification, a text-dependent speaker verification task refers to a computation task where a user speaks specific words or phrase that is predetermined. In other words, the input used for verification may be a predetermined word or phrase expected by the speaker verification model. The speaker verification model 600 may be based on a neural network trained to classify training speakers with distinctive feature vectors. The trained neural network may be used to extract one or more speaker-specific feature vectors from one or more utterances. The speaker-specific feature vectors may be used for speaker verification, for example, to verify the identity of a previously enrolled speaker,” Roblek et al., col. 12, lines 2-14.), wherein the plurality of subsets of training data includes: 

During stage (A), the computing system 120 obtains a set of training utterances 122, and inputs the set of training utterances 122 to a supervised neural network 140. In some implementations, the training utterances 122 may be one or more predetermined words spoken by the training speakers that were recorded and accessible by the computing system 120,” Roblek et al., col. 4, lines 38-44.), and 

at least one second subset of training data, each second subset including utterances of a number of speakers among the plurality of speakers that is less than the first number of speakers among the plurality of speakers (No requirement to map under claim interpretation of requiring only one (i.e., first) subset of training data.).

As to claim 11, product claim 11 and method claim 1 are related as method and product of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 11 is similarly rejected under the same rationale as applied above with respect to method claim. And, Roblek et al., col. 18, lines 48-62, teach processor(s), CRM, computer code, and memory. 
As to claim 16, apparatus claim 16 and method claim 1 are related as method and apparatus of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 16 is similarly rejected under the same rationale as applied above with respect to method claim. And, Roblek et al., col. 18, lines 48-62, teach processor(s), CRM, computer code, and memory. 

Regarding claim 2, Roblek et al. discloses the computer-implemented method of claim 1, wherein performing the plurality of tasks to train the machine learning model includes performing the plurality of tasks according to a multi-task training technique (“FIG. 6 is a block diagram of an example speaker verification model 600 for verifying the identity of an enrolled user. As discussed above, a neural network-based speaker verification method may be used for a small footprint text-dependent speaker verification task. As refers to in this Specification, a text-dependent speaker verification task refers to a computation task where a user speaks specific words or phrase that is predetermined. In other words, the input used for verification may be a predetermined word or phrase expected by the speaker verification model. The speaker verification model 600 may be based on a neural network trained to classify training speakers with distinctive feature vectors. The trained neural network may be used to extract one or more speaker-specific feature vectors from one or more utterances. The speaker-specific feature vectors may be used for speaker verification, for example, to verify the identity of a previously enrolled speaker,” Roblek et al., col. 11, line 65 – col. 12, line 14. The speaker verification on multiple speakers is a multi-task training technique.).  

Regarding claim 10, discloses the computer-implemented method of claim 1, wherein the utterances of the first subset of training data are obtained by combining two or more audio recordings (“Each training speaker may speak a predetermined utterance to a computing device, and the computing device may record an audio signal that includes the utterance. For example, each training speaker may be prompted to speak the training phrase “Hello Phone.” In some implementations, each training speaker may be The recorded audio signal of each training speaker may be transmitted to the computing system 120, and the computing system 120 may collect the recorded audio signals and select the set of training utterances 122. In other implementations, the various training utterances 122 may include utterances of different words,” Roblek et al., col. 4, lines 44-56.).

Regarding claim 4, Roblek et al. discloses the computer-implemented method of claim 1, wherein 

the machine learning model includes a first model for converting an utterance of the plurality of utterances into a feature vector (“In certain aspects, generating the set of labeled pairs of feature vectors includes inputting speech data that corresponds to a first utterance spoken by the particular speaker to the first neural network, in response to inputting the speech data that corresponds to the first utterance spoken by the particular speaker to the first neural network, determining a first feature vector based on output at the hidden layer of the first neural network, inputting speech data that corresponds to a second utterance spoken by the particular speaker to the first neural network, in response to inputting the speech data that corresponds to the second utterance spoken by the particular speaker to the first neural network, determining a second feature vector based on output at the hidden layer of the first neural network, and labeling the first feature vector and the second feature vector with an indication that the second neural network is to output that the utterances corresponding to the first feature vector and the second feature vector were likely spoken by the same speaker,” Roblek For example, the neural network 140 may include an input layer for inputting information about the training utterances 122, several hidden layers for processing the training utterances 122, and an output layer for providing output. The weights or other parameters of one or more hidden layers may be adjusted so that the trained neural network produces the desired target vector corresponding to each training utterance 122,” Roblek et al., col. 5, lines 4-11.) and a second model for identifying a speaker of the plurality of speakers from a feature vector (Roblek et al., fig. 8, is a block diagram of an example of speaker verification using an evaluation vector similarity model.), and 

each utterance of the plurality of utterances in the training data is paired with an identification of a speaker of the plurality of speakers corresponding thereto (“During stage (C), the computing system 120 obtains labeled pairs of feature vectors 126, and inputs the labeled pairs of feature vectors 126 to a second supervised neural network 141. The labeled pairs of feature vectors may represent characteristics of voices of multiple different speakers. In some implementations, the labeled pairs of feature vectors 126 may be outputs from inputting speech data corresponding to utterances from multiple different speakers to the speaker verification model 144,” Roblek et al., col. 4, lines 27-35.).  
As to claim 12, product claim 12 and method claim 4 are related as method and product of using same, with each claimed element’s function corresponding to the method step. 
As to claim 17, apparatus claim 17 and method claim 4 are related as method and apparatus of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to method claim. And, Roblek et al., col. 18, lines 48-62, teach processor(s), CRM, computer code, and memory. 

Regarding claim 5, Roblek et al. discloses the computer-implemented method of claim 4, further comprising producing, using the hardware processor, a converter that converts an utterance of a speaker of the plurality of speakers into a feature vector by training the first model (Roblek et al., col. 2, lines 12-31).  
As to claim 13, product claim 13 and method claim 5 are related as method and product of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 13 is similarly rejected under the same rationale as applied above with respect to method claim. And, Roblek et al., col. 18, lines 48-62, teach processor(s), CRM, computer code, and memory. 
As to claim 18, apparatus claim 18 and method claim 5 are related as method and apparatus of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 18 is similarly rejected under the same rationale as . 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 6-7, 14, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 9542948, hereinafter referred to as Roblek et al., in view of “Ensemble Additive Margin Softmax for Speaker Verification”, hereinafter referred to as Yu et al. 

Regarding claim 6, Roblek et al. discloses the computer-implemented method of claim 4, but not wherein performing the plurality of tasks includes using a value of at least one hyperparameter of the task using the first subset of training data that is different from the value of the at least one hyperparameter of the task using the at least one second subset of training data. Yu et al. is cited to disclose wherein performing the plurality of tasks includes using a value of at least one hyperparameter of the task using the first subset of training data that is different from the value of the at least one hyperparameter of the task using the at least one second subset of training data (Yu et al., sec. 3.1, 3rd Yu et al. benefits Roblek et al. by incorporating the AM-Softmax loss function to improve speaker verification (Yu et al., Abstract). Therefore, it would be obvious for one skilled in the art to combine the teachings of Roblek et al. with those of Yu et al. to improve the speaker identification of Roblek et al. 
As to claim 14, product claim 14 and method claim 6 are related as method and product of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 14 is similarly rejected under the same rationale as applied above with respect to method claim. And, Roblek et al., col. 18, lines 48-62, teach processor(s), CRM, computer code, and memory. 
As to claim 19, apparatus claim 19 and method claim 6 are related as method and apparatus of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 19 is similarly rejected under the same rationale as applied above with respect to method claim. And, Roblek et al., col. 18, lines 48-62, teach processor(s), CRM, computer code, and memory. 

Regarding claim 7, Roblek et al., as modified by Yu et al., discloses the computer-implemented method of claim 6, wherein the at least one hyperparameter is a margin of loss function (Yu et al., sec. 3.1, 3rd para).  


Claim 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 9542948, hereinafter referred to as Roblek et al., in view of US 20180293988, hereinafter referred to as Huang et al.

Regarding claim 8, Roblek et al. discloses the computer-implemented method of claim 1, but not wherein the utterances of the second subset of training data are recorded in a substantially similar acoustic environment. Huang et al. is cited to disclose wherein the utterances of the second subset of training data are recorded in a substantially similar acoustic environment (“For example, the corpus of speech samples may include a large corpus of speech samples corresponding to multiple speakers (e.g., 30-40 or more speakers) from a diverse population in terms of gender, ethnicity, language, etc., but also may include emotional state (where anger may change voice inflections by one example), health state (where coughing sneezing, raspy voice, and so forth may affect audio), or any other factor that could affect the prediction accuracy. Furthermore, the corpus of speech samples may include several training and test utterances. In an implementation, the corpus of speech samples may be recorded in a clean lab environment with high quality microphones such that there is minimal ambient noise,” Huang et al., para [0082].). Huang et al. benefits Roblek et al. by considering the actual run-time current noisy acoustic environment in which the audio was captured in order to allow accurate results where true speakers are provided access to things locked by speaker verification while imposters are permitted access (Huang et al., para [0002]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Roblek et al. with those of Huang et al. to improve the speaker identification of Roblek et al.


Claims 9, 15, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 9542948, hereinafter referred to as Roblek et al., in view of US 20180254051, hereinafter referred to as Church et al.

Regarding claim 9, Roblek et al. discloses the computer-implemented method of claim 1, but not wherein the utterances of the second subset of training data are obtained from a single continuous recording. utterances of the second subset of training data are obtained from a single continuous recording. Church et al. utterances of the second subset of training data are obtained from a single continuous recording (“In one or more embodiments, the speaker clustering module 208 can utilize a speaker recognition engine to label or assign roles to the different speakers for the audio conversations in the audio data 202. In the call center example, the speaker recognition engine can start with a training set of k=10 audio conversations where there is a single agent that speaks on all k calls, and there are k different customers that speak on each of the k calls. Two- speaker diarization 206 is applied to each of the k calls. Speaker models or speaker representations (such as I-vectors) are trained on all clusters to produce a total of 20 models (or, in this case, i-vectors). Using agglomerative clustering, the 10 closest models are found after a constraint is considered. A constraint, for example, can be that only one i-vector from each call can be assigned in the 10 closest models group. This i-vector representation is used to directly detect (using speaker recognition techniques across a database of conversations) which speaker is the agent in the diarized text files,” Church et al., para [0021]. Here, conversation recordings are captured which included utterances from multiple speakers. This is in accordance with para [0017] of the applicant’s specification.). Church et al. benefits Roblek et al. by incorporating speaker diarization (Church et al., para [0002]) into the speaker identification method of Roblek et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Church et al. with those of Roblek et al. to allow Roblek to answer “who spoke when?”.   
As to claim 15, product claim 15 and method claim 9 are related as method and product of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 15 is similarly rejected under the same rationale as applied above with respect to method claim. And, Roblek et al., col. 18, lines 48-62, teach processor(s), CRM, computer code, and memory. 
As to claim 20, apparatus claim 20 and method claim 5 are related as method and apparatus of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 20 is similarly rejected under the same rationale as applied above with respect to method claim. And, Roblek et al., col. 18, lines 48-62, teach processor(s), CRM, computer code, and memory. 

Allowable Subject Matter
Claim 3 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Other related prior art are listed in the attached PTO-892. Of particular interest is Wang et al. which describes a two sets of training data.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 5712727453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2659