DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 10/24/2022 have been fully considered but they are not persuasive.
Applicant argues that “It appears that neither one of Ma, Wang, or Qin, nor any of the other cited prior art documents disclose any details relating to the detection of a mimicked voice input, nor the diarizing of a second voice data into the constituent words of the voice signal and the training of a network using a composite voice signal and the diarized constituent words of the second voice data to determine whether a voice input signal is a mimicked voice input signal. It is not clear that Ma, Wang, Qin, or indeed any of the other cited prior art is concerned with the detection of a mimicked voice input, nor indeed the training of a neural network using a composite voice signal and the diarized words of an input second voice signal”.
Regarding applicant’s arguments, the examiner respectfully disagrees. Firstly, the examiner contends that prior art Wang is related to detecting abnormality of a caller. More specifically, as described in p. 0010, Wang is related to improving the accuracy of identifying a voice forgery. This is directly related to a mimicked voice signal. The claim fails to provide any detail about how the training of the network is unique to the determination of a mimicked voice input signal. Furthermore, Qin is related to the diarization of different meeting attendees in a meeting conference. Diarization is a common term in the art related to the partitioning or segmentation of a voice signal based on the identity of the speaker who is speaking at a particular moment. This is specifically described in Qin in p. 0090-0091. Finally, Ma is capable of bringing all of these features together. Ma provides the environment such as shown in Fig. 2A, in which the system is capable of taking voice from different speakers and based on their voice features of each speaker and historical voice features, can classify the audio signal into the identity of each speaker. Although Ma does not explicitly mention diarization, Ma provides for classification of voice signals into the identities of their speakers. The combination of these prior art reference provide the language of the claim as detailed in the rejection below. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-4, 10-11, 13-14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ma (US PG Pub 2020043471) in view of Qin (US PG Pub 20200349953) and further in view of Wang (US PG Pub 20200228648).

	As per claims 1 and 11, Ma discloses:
	A method and system for training a network to detect mimicked voice input, the method comprising: 		receiving first voice data comprising at least a voice signal of a first individual and a voice signal of a second individual (Ma; p. 0026 - the voice interaction device 100a may receive voice data of a plurality of users, and each piece of the voice data may be forwarded to the backend server 100b); 	combining the voice signal of the first individual and the voice signal of the second individual to create a composite voice signal (Ma; p. 0027 - the backend server 100b may perform clustering on all historical voice data to obtain a voice feature cluster 1, a voice feature cluster 2, a voice feature cluster 3, and a voice feature cluster 4, each voice feature cluster including at least one historical voice feature vector with a similar feature); 	receiving second voice data comprising at least another voice signal of the first individual (Ma; p. 0031 - the backend server 100b may keep receiving voice data sent by the voice interaction device 100a, to form more historical voice data. In order to ensure that the backend server 100b may continually find a new high-frequency user, the backend server 100b may perform re-clustering on the historical voice data regularly or quantitatively). 
	Ma, however, fails to disclose diarizing the second voice data into the constituent words of the voice signal.	Qin does teach diarizing the second voice data into the constituent words of the voice signal (Qin; p. 0062 - the conversion of the audio signals to text that is used in conjunction with speaker identification, and generation of a transcript that is diarized to identify speakers, are provided by meeting server 135. The functions performed by the server include the synchronization, recognition, fusion, and diarization functions).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Ma to include diarizing the second voice data into the constituent words of the voice signal, as taught by Qin, in order to accurately attribute speech of each attendee in the transcript of a meeting (Qin; p. 0001).	Furthermore, although Ma teaches the use of a neural network. Ma in view of Qin fails to explicitly disclose training the network using the composite voice signal and the diarized constituent words of the second voice data to determine whether a voice input signal is a mimicked voice input signal.	Wang does teach training the network using the composite voice signal and the diarized constituent words of the second voice data to determine whether a voice input signal is a mimicked voice input signal (Wang; p. 0076-0078 - in a first-stage training, a single detection of a corresponding detection type is performed using a face classification detection model, a voiceprint classification detection model, a limb movement classification detection model, and/or a lip language classification detection model, and input data for a second stage is generated according to the acquired corresponding feature data; in a second-stage training, the feature data input at this stage is detected using a fully connected convolutional network, and a current training parameter of the two-stage neural network detection model is adjusted using a training result of this stage).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Ma in view of Qin to include training the network using the composite voice signal and the second voice data to determine whether a voice input signal is a mimicked voice input signal, as taught by Wang, in order to efficiently detect when an Artificial Intelligence voice is imitating the voice of a person by detecting particular abnormalities in the audio because human instincts may not accurately identify a voice forgery (Wang; p. 0004-0008).
	And further, Ma in view of Wang, fail to disclose diarizing the second voice data into the constituent words of the voice signal.	Qin does teach diarizing the second voice data into the constituent words of the voice signal (Qin; p. 0062 - the conversion of the audio signals to text that is used in conjunction with speaker identification, and generation of a transcript that is diarized to identify speakers, are provided by meeting server 135. The functions performed by the server include the synchronization, recognition, fusion, and diarization functions).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Ma and Wang to include diarizing the second voice data into the constituent words of the voice signal, as taught by Qin, in order to accurately attribute speech of each attendee in the transcript of a meeting (Qin; p. 0001).

	As per claims 3 and 13, Ma in view of Qin and Wang disclose:	The method and system according to claims 1 and 11, wherein the first individual and the second individual are from a first household (Ma; p. 0090 - The smart speaker usually does not belong to a specific user but is jointly used by a plurality of users with a limited scale. For example, a number of users using a speaker device in a home usually does not exceed 10 persons. In addition, because family members are different in age and gender, etc., differences in their voiceprint characteristics are relatively obvious).

	As per claims 4 and 14, Ma in view of Qin and Wang disclose:	The method and system according to claims 1 and 11, wherein the first individual is from a first household and the second individual is from a second household (Ma; p. 0090 - The smart speaker usually does not belong to a specific user but is jointly used by a plurality of users with a limited scale. For example, a number of users using a speaker device in a home usually does not exceed 10 persons. In addition, because family members are different in age and gender, etc., differences in their voiceprint characteristics are relatively obvious).

	As per claims 6 and 16, Ma in view of Qin and Wang disclose:	The method and system according to claims 1 and 11, upon which claims 6 and 16 depends.	And further, Qin does teach diarizing the first voice data (Qin; p. 0062 - the conversion of the audio signals to text that is used in conjunction with speaker identification, and generation of a transcript that is diarized to identify speakers, are provided by meeting server 135. The functions performed by the server include the synchronization, recognition, fusion, and diarization functions).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Ma and Wang to include diarizing the first voice data, as taught by Qin, in order to accurately attribute speech of each attendee in the transcript of a meeting (Qin; p. 0001).

	As per claims 10 and 20, Ma in view of Qin and Wang disclose:
	The method and system according to claims 1 and 11, the method and system comprising matching like words from the voice signal of the first individual and a voice signal of the second individual (Ma; p. 0086 - Afterwards, the target clustering model parameter may be updated regularly or quantitatively. For example, 20 groups of wake-up word voice data (that is, sample voice data) containing actual identity labels (that is, sample user identity labels) of speakers, each group containing 10 speakers, and each speaker containing 10 pieces of wake-up word voice data. Wake-up word voice data of 7 speakers is randomly selected from each group as a training set, and wake-up word voice data of the remaining 3 speakers is used as a verification set. For each group of data, after an i-vector of the wake-up word voice data is extracted and dimensionality reduction is performed on the i-vector, the training set is used to train a DBSCAN clustering model to maximize the JC).


	Claims 2 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Qin and Wang, and further in view of Zadeh (US PG Pub 20140201126).

	As per claims 2 and 12, Ma in view of Qin and Wang discloses:	The method and system according to claims 1 and 11, upon which claims 2 and 12 depend.	 Ma in view of Qin and Wang, however, fail to disclose computing a cartesian product of the voice signals of the first voice data. 	Zadeh does teach computing a cartesian product of the voice signals of the first voice data (Zadeh; p. 0506 - The membership function of A, .mu..sub.A, may be elicited by asking a succession of questions of the form: To what degree does the number, a, fit your perception of A? Example: To what degree does 50 minutes fit your perception of about 45 minutes? The same applies to B. The fuzzy set, A, may be interpreted as the possibility distribution of X. The concept of a Z-number may be generalized in various ways. In particular, X may be assumed to take values in R.sup.n, in which case A is a Cartesian product of fuzzy numbers; also see p. 0537, 0544, 0912 and 0968).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Ma in view of Qin and Wang to include computing a cartesian product of the voice signals of the first voice data, as taught by Zadeh, in order to provide efficient correlation of data and pattern recognition between two sets of data (Zadeh; p. 0506 & p. 0182).

	Claims 5, 9, 15 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Qin and Wang, and further in view of Tsujikawa (US PG Pub 20170263257).

	As per claims 5 and 15, Ma in view of Qin and Wang disclose:
	The method and system according to claims 1 and 11, upon which claims 5 and 15 depend.	 Ma in view of Wang, however, fail to disclose wherein combining the voice signal of the first individual and the voice signal of the second individual comprises a superimposition operation. 	Tsujikawa does teach wherein combining the voice signal of the first individual and the voice signal of the second individual comprises a superimposition operation (Tsujikawa; p. 0030 - According to this configuration, the voices of the plurality of unspecified speakers are acquired, and the noise is acquired. The noise is superimposed onto the voices of the plurality of unspecified speakers. The unspecified speaker voice dictionary, which is used for generating the personal voice dictionary for identifying the speaker to be identified, is generated on the basis of the features of the voices of the plurality of unspecified speakers onto which the noise has been superimposed).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Ma and Wang to include wherein combining the voice signal of the first individual and the voice signal of the second individual comprises a superimposition operation, as taught by Zadeh, in order to improve the accuracy of speaker identification (Tsujikawa; p. 0031).

	As per claims 9 and 19, Ma in view of Qin and Wang disclose:
	The method and system according to claims 1 and 11, upon which claims 9 and 19 depend.	 Ma in view of Wang, however, fail to disclose combining at least one of the voice signal of the first individual or the voice signal of the second individual with a reference voice signal. 	Tsujikawa does teach combining at least one of the voice signal of the first individual or the voice signal of the second individual with a reference voice signal (Tsujikawa; p. 0030 - According to this configuration, the voices of the plurality of unspecified speakers are acquired, and the noise is acquired. The noise is superimposed onto the voices of the plurality of unspecified speakers. The unspecified speaker voice dictionary, which is used for generating the personal voice dictionary for identifying the speaker to be identified, is generated on the basis of the features of the voices of the plurality of unspecified speakers onto which the noise has been superimposed).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Ma and Wang to include combining at least one of the voice signal of the first individual or the voice signal of the second individual with a reference voice signal, as taught by Zadeh, in order to improve the accuracy of speaker identification (Tsujikawa; p. 0031).

	Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Qin and Wang, and further in view of Seto (US PG Pub 20160240188).

	As per claims 8 and 18, Ma in view of Qin and Wang disclose:
	The method and system according to claims 1 and 11, upon which claims 8 and 18 depend.	Ma in view of Qin and Wang, however, fails to disclose adjusting the tempo of at least one of the voice signal of the first individual and the voice signal of a second individual.	Seto does teach adjusting the tempo of at least one of the voice signal of the first individual and the voice signal of a second individual (Seto; p. 0023 - An example of the processing method includes superimposition of an environmental noise expected in an environment where the speech recognition device is used, change of volume, change of speed, or a combination thereof, and the processing method may be any method as long as the method does not erase features of an utterance of a user. On the other hand, superimposition of a speech in which a voice of a person is mixed and change of a frequency are avoided).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Ma in view of Qin and Wang to include adjusting the tempo of at least one of the voice signal of the first individual and the voice signal of a second individual, as taught by Seto, in order to provide a speech recognition device and a speech recognition method that automatically switch to a proper acoustic model without requiring a user to perform special operations such as registration and utterance of a word (Seto; p. 0006).

	Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes:	Enzinger (US PG Pub 20210193174) discloses various voice phishing (vishing) detectors detect respective type of threats and can be used or activated individually or in various combinations. A tampering detector utilizes deep scattering spectra and shifted delta cepstra features to detect tampering in the form of voice conversion, speech synthesis, or splicing. A content detector predicts a likelihood that word patterns on an incoming voice signal are indicative of a vishing threat. A spoofing detector authenticates or repudiates a purported speaker based on comparison of voice profiles. The vishing detectors can be provided as an authentication service or embedded in communication equipment. Machine learning and signal processing aspects are disclosed, along with applications to mobile telephony and call centers (Enzinger; Abstract).	Malik (US PG Pub 20210279427) discloses a method and system for automated voice casting compares candidate voices samples from candidate speakers in a target language with a primary voice sample from a primary speaker in a primary language. Utterances in the audio samples of the candidates speakers and the primary speaker are identified and typed and voice samples generated that meet applicable utterance type criteria. A neural network is used to generate an embedding for the voice samples. A voice sample can include groups of different utterance types and embeddings generated for each utterance group in the voice sample and then combined in a weighted form wherein the resulting embedding emphasizes selected utterance types. Similarities between embeddings for the candidate voice samples relative to the primary voice sample are evaluated and used to select a candidate speaker that is a vocal match (Malik; Abstract).
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139. The examiner can normally be reached Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 5712727602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RODRIGO A CHAVEZ/Examiner, Art Unit 2658
/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658