DETAILED ACTION

Introduction
This office action is in response to Applicant’s submission filed on 3/31/2021. Claims
1-20 are pending in the application. As such, claims 1-20 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 3/31/2021.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Drawings
The drawings filed on 3/31/2021 is accepted and considered by the Examiner.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 is rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (US Patent No.: US 20220122615 A1) hereinafter as Chen, in view of Parada et al. (US Patent Application Publication No: US 20180075860 A1) hereinafter as Parada, and further in view of Lesso (US Patent Application Publication No: US 20200075028 A1) hereinafter as Lesso.
Regarding claim 1, Wang discloses: A method, comprising: receiving, by a device, ([0120] The apparatus 900 may comprise at least one processor 910 and a memory 920 storing computer-executable instructions.)
audio data identifying a conversation including a plurality of speakers; ([0001] Speaker diarization is a process of partitioning an input audio stream into a number of portions corresponding to different speakers respectively. Speaker diarization aims to determine “who spoke when?” within an audio stream, and time intervals during which each speaker is active. Through applying speaker diarization to an audio stream involving speeches from multiple speakers, it may be determined that each speech utterance in the audio stream is spoken by which speaker, or what speech utterances have been spoken by each speaker.)
processing, by the device, the audio data, with a plurality of clustering models, to identify a plurality of speaker segments; ([0024] At 130, the speech segments obtained at 120 may be clustered into a plurality of clusters. Through the clustering operation at 130, the speech segments may be merged based on similarity, such that there is a one-to-one correspondence between the resulted clusters and the speakers. Firstly, speaker feature vectors or embedding vectors may be obtained for the speech segments, e.g., i-vectors, x-vectors, etc. Then speech similarity scoring may be performed with the speaker feature vectors among the speech segments. For example, the speech similarity scoring may be based on. e.g., probabilistic linear discriminant analysis (PLDA), Bayesian information criterion (BIC), generalized likelihood ratio (GLR), Kullback-Leibler divergence (KLD), etc. Thereafter, the speech segments may be merged based on similarity scores under a predetermined clustering strategy. e.g., agglomerative hierarchical clustering (AHC), etc. For example, those speech segments having high similarity scores among each other may be merged into a cluster.)
selecting, by the device, a rectification model to rectify each of the plurality of errors based on a cause of a corresponding one of the plurality of errors and based on features of a corresponding one of the plurality of speaker segments; ([0016] In some cases, the clusters may be further used for establishing or initializing a speaker classification model, e.g., a hidden Markov model (HMM), for refining frame alignment between the audio stream and the speakers. Since it is possible to miss or fail to detect some speaker change points during the speech segmentation. e.g., a speech segment may comprise speech utterances from different speakers, the resulted clusters will sometimes be impure in terms of speakers accordingly, e.g., a cluster may comprise speech utterances or speech segments from different speakers, thus reducing accuracy of speaker diarization. Moreover, the speaker classification model established based on such impure clusters would further result in inaccurate frame alignment.[0032] In some implementations, the process 200 may adopt speaker acoustic features, as the traditional speaker diarization does, for the following operations. e.g., segmentation, clustering, modeling, alignment, etc. In this case, a speaker acoustic feature vector may be extracted for each speech frame in the audio stream. The extracted speaker acoustic feature vectors of the speech frames in the audio stream may be further used for the following operations.) 
re-segmenting, by the device, the audio data with the rectification models to generate re- segmented audio data; ([0028] At 150, frame alignment may be performed through the speaker classification model. The frame alignment may also be referred to as frame re-segmentation. Speech frames in the audio stream may be provided to the speaker classification model, e.g., HMM, to align to respective HMM states of the HMM, and accordingly to align to respective speakers. The final result of the speaker diarization would be provided after the frame alignment.)
determining, by the device, a plurality of modified diarization error rates for the plurality of speaker segments based on the re-segmented audio data; ([0045] Through continuously performing the above updating operation of the speaker classification model along with aligning speech frames to different speakers, the speaker classification model may be continuously improved in terms of its performance of discriminating different speakers during the speaker diarization, and accordingly the accuracy of the speaker diarization is also improved.)
Chen does not explicitly, but Parada discloses: determining, by the device, a plurality of diarization error rates for the plurality of speaker segments; ([0042] At step 250, a microphone pair may be selected based on the confidence measure. For example, the pair with the lowest uncertainty may be selected. This may comprise a microphone pair that yields the lowest Diarization Error Rate (DER) in the model. This method is described further below with regards to “Alignment” and “Channel selection.”)
identifying, by the device, a plurality of errors in the plurality of speaker segments based on comparing each of the plurality of diarization error rates to one or more thresholds; (Fig. 17 shows a table of error rates, including missed speech, false alarm, speaker error and DER.  Also see [0146].)
Chen and Parada are considered analogous art because they are both in the related art of diarization. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen to combine the teaching of Parada, to incorporate the above mentioned claim limitations, because the combination of the disclosures would, based on the clustering, audio sources may be identified for each audio signal, and the audio signals may be segmented (Parada, background/summary).
Chen in view of Parada does not explicitly, but Lesso discloses: selecting, by the device, a speaker segment, of the plurality of speaker segments, based on the plurality of modified diarization error rates; ([0038] Obtaining information about times at which there may have been a speaker change, based on the speaker recognition scores obtained from the biometric process, may comprise: [0039] examining successive speaker recognition scores for the sections of the audio signal in order to determine a series of difference values between pairs of consecutive speaker recognition scores for the successive sections of the audio signal; and [0040] determining that there may have been a speaker change when one of said difference values exceeds a threshold value.)
and performing, by the device, one or more actions based on the speaker segment. ([0077] FIG. 3 illustrates a possible attack on a system using speech recognition, and specifically a voice assistant system that allows an enrolled user to speak commands that the system will act upon. The voice assistant system uses speaker recognition to ensure that a command was spoken by the enrolled user, before it acts upon the command.)
Chen, Parada and Lesso are considered analogous art because they are in the related art of diarization and/or speaker recognition.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, to combine the teaching of Lesso, to incorporate the above mentioned claim limitations, because the combination of the disclosures would allow user to issue voice commands, causing the system to perform some action, or retrieve some requested information (Lesso, background/summary).

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, and furthermore in view of (Bassiou, N., Moschou, V., & Kotropoulos, C. (2010). Speaker diarization exploiting the eigengap criterion and cluster ensembles. IEEE transactions on audio, speech, and language processing, 18(8), 2134-2144.) hereinafter as Bassiou.
	Regarding claim 2, Chen in view of Parada, and further in view of Lesso discloses: The method of claim 1, 
Chen further discloses: an agglomerative clustering model, ([0024]  e.g., agglomerative hierarchical clustering (AHC), etc.
	Chen in view of Parada, and further in view of Lesso does not explicitly, but Bassiou discloses: wherein the plurality of clustering models includes: a k-means clustering model, a spectral clustering model, ([Sect 1, introduction] Many speaker clustering methods have been developed ranging from hierarchical ones, such as the bottom-up (or agglomerative) methods and the top-down (or divisive) ones, to optimization methods including the -means algorithm [10] or the autoassociative neural networks [11] to mention a few. ... The proposed method exploits concepts from spectral graph theory and cluster ensembles in speaker diarization.)
	and an ensemble model. ([sect III, part c, clustering module] Thus, it is necessary to use multiple algorithms in order to reveal the natural groupings of the data [40]. Cluster ensembles are collections of clusterings, which are of the same “kind,” e.g., collections of partitions or collections of hierarchies [41]. Here, we are interested in collections of partitions. They combine the different partitions in order to improve the clustering performance, as is assessed in Section V-B, and increase the robustness to outliers.)
Chen, Parada, Lesso, and Bassiou are considered analogous art because they are in the related art of diarization and/or speaker recognition.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, to combine the teaching of Bassiou, to incorporate the above mentioned claim limitations, because the combination of the disclosures would improve clustering quality (Bassiou, Sect VI, discussion and conclusion).

Claims 4 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, and furthermore in view of Gorodetski et al. (US Patent Application Publication No: US 20160217793 A1) hereinafter as Gorodetski.
	Regarding claim 4, Chen in view of Parada, and further in view of Lesso discloses: The method of claim 1, 
	Chen additionally discloses: wherein processing the audio data, with the plurality of clustering models, to identify the plurality of speaker segments comprises: processing the audio data, with one or more machine learning models, to extract features from the audio data; ([0034] According to an exemplary process of extracting speaker bottleneck features, for each speech frame in the audio steam, a speaker acoustic feature of the speech frame may be extracted, and then a speaker bottleneck feature of the speech frame may be generated based on the speaker acoustic feature through a neural network. e.g., deep neural network (DNN). The DNN may be trained to classify among a number of N speakers with the loss function to be cross-entropy. The DNN may comprise an input layer, a plurality of hidden layers, an output layer, etc.)
	Parada additionally discloses: processing the features, with the plurality of clustering models, to generate a list of class labels; (See fig. 11 where there is a table for class label. [0124] In one example, the simulated room 900 may comprise a doctor's office. Audio source 910 may be a doctor and another audio source, such as audio source 940, may be a patient. Microphones may be distributed around the room 900. In this example, to achieve a robust ASR performance, the best microphone channel may be selected to use for ASR. Received audio may be segmented to distinguish between the doctor's speech and the patient's speech. The methods described above may be used to detect the most reliable microphone pair and to robustly segment the audio, labeling it with who spoke and when. This method may not require knowledge of the microphone positions.)
	Chen in view of Parada, and further in view of Lesso does not explicitly, but Gorodetski discloses: generating timestamp segments based on the list of class labels; ([0036] In addition to the transcription 106 from the STT server 104, STT server 104 may also output time stamps associated with particular transcription segments, words, or phrases, and may also include a confidence score in the automated transcription. The transcription 106 may also identify homogeneous speaker speech segments. Homogenous speech segments are those segments of the transcription that have a high likelihood of originating from a single speaker.)
	and identifying the plurality of speaker segments based on the timestamp segments. ([0036] In addition to the transcription 106 from the STT server 104, STT server 104 may also output time stamps associated with particular transcription segments, words, or phrases, and may also include a confidence score in the automated transcription. The transcription 106 may also identify homogeneous speaker speech segments. Homogenous speech segments are those segments of the transcription that have a high likelihood of originating from a single speaker.)
Chen, Parada, Lesso, and Gorodetski are considered analogous art because they are in the related art of diarization and/or speaker recognition.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, to combine the teaching of Gorodetski, to incorporate the above mentioned claim limitations, because the combination of the disclosures would provide acoustic signature for a common speaker based only on statistical models of the speakers (Gorodetski, background).
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, furthermore in view of Gorodetski, and furthermore in view of Provost el al. (US Patent Application Publication No: US 20200075040 A1) hereinafter as Provost.  
	Regarding claim 5, Chen in view of Parada, further in view of Lesso, and furthermore in view of Gorodetski discloses: The method of claim 4, 
	Chen in view of Parada, further in view of Lesso, and furthermore in view of Gorodetski does not explicitly, but Provost discloses: wherein the features of the audio data include Mel- frequency cepstral (MFC) coefficients and first order MFC coefficients. ([0043] Mel Frequency Cepstral Coefficients (MFCC) may be extracted from the audio data, as shown in bubble 212. For example, the first 13 MFCCs and first-order delta-coefficients may be extracted. For speaker verification, i-vectors (identity vectors) may be extracted from the MFCCs, as depicted in FIG. 2A. For emotion recognition, five statistics may be computed over the MFCCs and first-order deltas (e.g., mean, standard deviation, maximum, minimum, range), as shown in bubble 214.)
Chen, Parada, Lesso, Gorodetski, and Provost are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio/emotion analysis.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, and furthermore in view of Gorodetski, to combine the teaching of Provost, to incorporate the above mentioned claim limitations, because the combination of the disclosures would enable measuring and interpreting the relationship between an individual's social environment and mood, in particular, emotion and/or mood recognition based on analysis of speech data included in audio signals (Provost, field of the disclosure).

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, furthermore in view of Gorodetski, and furthermore in view of applicant supplied reference, Haukioja el al. (US Patent Application Publication No: US 20190253558 A1) hereinafter as Haukioja.  
	Regarding claim 6, Chen in view of Parada, further in view of Lesso, and furthermore in view of Gorodetski discloses: The method of claim 4, 
	Chen in view of Parada, further in view of Lesso, and furthermore in view of Gorodetski does not explicitly, but Haukioja discloses: wherein processing the audio data, with the plurality of clustering models, to identify the plurality of speaker segments comprises: reducing a quantity of the features extracted from the audio data. ([0059] The system may additionally utilize Bayesian feature selection in order to reduce the number of features, strain on processing resources, lower the error rate, and optimize SLA metric prediction. The system may also use neural networks for feature generation and selection.)
	Chen, Parada, Lesso, Gorodetski, and Haukioja are considered analogous art because they are in the related art of diarization and/or speaker recognition.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, and furthermore in view of Gorodetski, to combine the teaching of Haukioja, to incorporate the above mentioned claim limitations, because the combination of the disclosures would reduce the strain on the processing resources (Haukioja, [0059]).

Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, furthermore in view of McLaren el al. (US Patent Application Publication No: US 20160283185 A1) hereinafter as McLaren, furthermore in view of (Redonnet, S., & Cunha, G. (2015). An advanced hybrid method for the acoustic prediction. Advances in Engineering Software, 88, 30-52.) hereinafter as Redonnet, and furthermore in view of (Lopez-Otero, P., Docio-Fernandez, L., & Garcia-Mateo, C. (2010, March). Novel strategies for reducing the false alarm rate in a speaker segmentation system. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4970-4973). IEEE.) hereinafter as Lopez-Otero.  
	Regarding claim 7, Chen in view of Parada, further in view of Lesso discloses: The method of claim 1, 
	Chen additionally discloses: wherein re-segmenting the audio data with the rectification models to generate the re-segmented audio data comprises one or more of: processing the audio data, with a hidden Markov model, to generate the re-segmented audio data; ([0016] The number of speakers may be already known or estimated. In some cases, the clusters may be further used for establishing or initializing a speaker classification model, e.g., a hidden Markov model (HMM), for refining frame alignment between the audio stream and the speakers.)
	Chen in view of Parada, further in view of Lesso does not explicitly, but McLaren discloses: processing the audio data, with a median filtering model, to generate the re-segmented audio data; ([0047] The illustrative feature scoring module 410 utilizes an LLR generation module 412, to generate the seed and non-seed scores using a measure of model log likelihood ratio (LLR). The illustrative LLR generation module 412 generates an LLR value for each feature 318, which is computed as a ratio of the seed score to the non-seed score in the log domain. In some embodiments, an LLR smoothing module 414 may be used to ignore relatively insignificant fluctuations in the LLR values. The LLR smoothing module 414 may employ, for example, a median filtering algorithm.)
	Chen, Parada, Lesso, and McLaren are considered analogous art because they are in the related art of diarization and/or speaker recognition.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, to combine the teaching of McLaren, to incorporate the above mentioned claim limitations, because the combination of the disclosures would achieve high precision in speaker diarization with only a minimal amount of user interaction with the system (McLaren, [0015]).
	Chen in view of Parada, further in view of Lesso, and furthermore in view of McLaren does not explicitly, but Redonnet discloses: processing the audio data, with a cluster time-interpolated reconstruction model, to generate the re-segmented audio data; ([sect II. 4) Cluster marginal reconstruction] In other solution the Pre-estimating of the missing components should be done in some manner, and then use the complete vector to identify cluster membership. These are used for the interpolation along time or frequency for pre-estimating the missing features at first. These methods are called cluster time interpolated reconstruction or cluster frequency interpolated reconstruction.
Regarding claim 15, Wang discloses: A computing device comprising: at least one processor; ([col. 22, lines 56-57] The device 100 may include one or more controllers/processors 1404, …))
	Chen, Parada, Lesso, McLaren, and Redonnet are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio feature restoration.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of McLaren, to combine the teaching of Redonnet, to incorporate the above mentioned claim limitations, because the combination of the disclosures would obtain good reconstruction results and increase recognition accuracy (Redonnet, Sect I).
	Chen in view of Parada, further in view of Lesso, furthermore in view of McLaren, and furthermore in view of Redonnet does not explicitly, but Lopez-Otero discloses: or processing the audio data, with a false alarm reduction model, to generate the re- segmented audio data. ([sect 3. False Alarm Rejection Strategies] See section 3.2 for details.)
	Chen, Parada, Lesso, McLaren, Redonnet, and Lopez-Otero are considered analogous art because they are in the related art of diarization and/or speaker recognition/segmentation and/or audio feature restoration.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of McLaren, furthermore in view of Redonnet, to combine the teaching of Lopez-Otero, to incorporate the above mentioned claim limitations, because the combination of the disclosures would achieve expected results of reducing number of false alarms (Lopez-Otero, Conclusion and future work).

Regarding claim 14, although different in scope from claim 7, they recite elements of the method of claim 7 as a device.  Thus, the analysis in rejecting claim 7 is equally applicable to claim 14.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of (Xiao, B., Georgiou, P. G., Imel, Z. E., Atkins, D. C., & Narayanan, S. S. (2013, September). Modeling therapist empathy and vocal entrainment in drug addiction counseling. In Interspeech (pp. 2861-2865).) hereinafter as Xiao, and furthermore in view of Quan et al. (US Patent Application Publication No: US 20190253558 A1) hereinafter as Quan.
Regarding claim 8, Wang discloses: A device, comprising: one or more memories; and one or more processors, communicatively coupled to the one or more memories, configured to: ([0120] The apparatus 900 may comprise at least one processor 910 and a memory 920 storing computer-executable instructions.)
receive audio data identifying a conversation including a plurality of speakers; ([0001] Speaker diarization is a process of partitioning an input audio stream into a number of portions corresponding to different speakers respectively. Speaker diarization aims to determine “who spoke when?” within an audio stream, and time intervals during which each speaker is active. Through applying speaker diarization to an audio stream involving speeches from multiple speakers, it may be determined that each speech utterance in the audio stream is spoken by which speaker, or what speech utterances have been spoken by each speaker.)
process the audio data, with a plurality of clustering models, to identify a plurality of speaker segments; ([0024] At 130, the speech segments obtained at 120 may be clustered into a plurality of clusters. Through the clustering operation at 130, the speech segments may be merged based on similarity, such that there is a one-to-one correspondence between the resulted clusters and the speakers. Firstly, speaker feature vectors or embedding vectors may be obtained for the speech segments, e.g., i-vectors, x-vectors, etc. Then speech similarity scoring may be performed with the speaker feature vectors among the speech segments. For example, the speech similarity scoring may be based on. e.g., probabilistic linear discriminant analysis (PLDA), Bayesian information criterion (BIC), generalized likelihood ratio (GLR), Kullback-Leibler divergence (KLD), etc. Thereafter, the speech segments may be merged based on similarity scores under a predetermined clustering strategy. e.g., agglomerative hierarchical clustering (AHC), etc. For example, those speech segments having high similarity scores among each other may be merged into a cluster.)
	an agglomerative clustering model, ([0024]  e.g., agglomerative hierarchical clustering (AHC), etc.
selecting a rectification model to rectify each of the plurality of errors based on a cause of a corresponding one of the plurality of errors and based on features of a corresponding one of the plurality of speaker segments; ([0016] In some cases, the clusters may be further used for establishing or initializing a speaker classification model, e.g., a hidden Markov model (HMM), for refining frame alignment between the audio stream and the speakers. Since it is possible to miss or fail to detect some speaker change points during the speech segmentation. e.g., a speech segment may comprise speech utterances from different speakers, the resulted clusters will sometimes be impure in terms of speakers accordingly, e.g., a cluster may comprise speech utterances or speech segments from different speakers, thus reducing accuracy of speaker diarization. Moreover, the speaker classification model established based on such impure clusters would further result in inaccurate frame alignment.[0032] In some implementations, the process 200 may adopt speaker acoustic features, as the traditional speaker diarization does, for the following operations. e.g., segmentation, clustering, modeling, alignment, etc. In this case, a speaker acoustic feature vector may be extracted for each speech frame in the audio stream. The extracted speaker acoustic feature vectors of the speech frames in the audio stream may be further used for the following operations.) 
re-segment the audio data with the rectification models to generate re- segmented audio data; ([0028] At 150, frame alignment may be performed through the speaker classification model. The frame alignment may also be referred to as frame re-segmentation. Speech frames in the audio stream may be provided to the speaker classification model, e.g., HMM, to align to respective HMM states of the HMM, and accordingly to align to respective speakers. The final result of the speaker diarization would be provided after the frame alignment.)
determine a plurality of modified diarization error rates for the plurality of speaker segments based on the re-segmented audio data; ([0045] Through continuously performing the above updating operation of the speaker classification model along with aligning speech frames to different speakers, the speaker classification model may be continuously improved in terms of its performance of discriminating different speakers during the speaker diarization, and accordingly the accuracy of the speaker diarization is also improved.)
Chen does not explicitly, but Parada discloses: determining a plurality of diarization error rates for the plurality of speaker segments; ([0042] At step 250, a microphone pair may be selected based on the confidence measure. For example, the pair with the lowest uncertainty may be selected. This may comprise a microphone pair that yields the lowest Diarization Error Rate (DER) in the model. This method is described further below with regards to “Alignment” and “Channel selection.”)
Identify a plurality of errors in the plurality of speaker segments based on comparing each of the plurality of diarization error rates to one or more thresholds; (Fig. 17 shows a table of error rates, including missed speech, false alarm, speaker error and DER.  Also see [0146].)
Chen and Parada are considered analogous art because they are both in the related art of diarization. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen to combine the teaching of Parada, to incorporate the above mentioned claim limitations, because the combination of the disclosures would, based on the clustering, audio sources may be identified for each audio signal, and the audio signals may be segmented (Parada, background/summary).
Chen in view of Parada does not explicitly, but Lesso discloses: select one of the plurality of speaker segments based on the plurality of modified diarization error rates; ([0038] Obtaining information about times at which there may have been a speaker change, based on the speaker recognition scores obtained from the biometric process, may comprise: [0039] examining successive speaker recognition scores for the sections of the audio signal in order to determine a series of difference values between pairs of consecutive speaker recognition scores for the successive sections of the audio signal; and [0040] determining that there may have been a speaker change when one of said difference values exceeds a threshold value.)
and performing, by the device, one or more actions based on the speaker segment. ([0077] FIG. 3 illustrates a possible attack on a system using speech recognition, and specifically a voice assistant system that allows an enrolled user to speak commands that the system will act upon. The voice assistant system uses speaker recognition to ensure that a command was spoken by the enrolled user, before it acts upon the command.)
Chen, Parada and Lesso are considered analogous art because they are in the related art of diarization and/or speaker recognition.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, to combine the teaching of Lesso, to incorporate the above mentioned claim limitations, because the combination of the disclosures would allow user to issue voice commands, causing the system to perform some action, or retrieve some requested information (Lesso, background/summary).
	Chen in view of Parada, and further in view of Lesso does not explicitly, but Bassiou discloses: wherein the plurality of clustering models includes: a k-means clustering model, a spectral clustering model, ([Sect 1, introduction] Many speaker clustering methods have been developed ranging from hierarchical ones, such as the bottom-up (or agglomerative) methods and the top-down (or divisive) ones, to optimization methods including the -means algorithm [10] or the autoassociative neural networks [11] to mention a few. ... The proposed method exploits concepts from spectral graph theory and cluster ensembles in speaker diarization.)
	Chen in view of Parada, further in view of Lesso, and furthermore in view of Bassiou does not explicitly, but Xiao discloses: calculate an empathy score based on the one of the plurality of speaker segments; (See section 3.3 and 3.4 regarding empathy scoring.)
Chen, Parada, Lesso, Bassiou, and Xiao are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio analysis.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, to combine the teaching of Xiao, to incorporate the above mentioned claim limitations, because the combination of the disclosures would enable modeling the relation of entrainment and empathy (Xiao, Sect I).
	Chen in view of Parada, further in view of Lesso, furthermore in view of Bassiou, and furthermore in view of Xiao does not explicitly, but Quan discloses: and perform one or more actions based on the empathy score. ([0019] As shown in FIG. 2, a method 200 for remotely coaching a set of participants includes: collecting a set of inputs associated with a set of one or more participants S210; for each of the set of participants, determining a set of one or more scores (e.g., impact score) associated with the participant based on the set of inputs S220; and organizing the set of participants based on the scores S240. Additionally, the method 200 can include any or all of: assigning a set of tags and/or labels to any or all of the set of participants (e.g., based on the set of scores) S230; determining a set of conversation topics associated with a participant of the set of participants S250; recommending the set of one or more conversation topics to a coach associated with the participant at a dashboard S260; receiving an input S270; updating the dashboard based on the input S280; triggering an action S290; and/or any other suitable process(es). Also see [0166-0167].  Impact score here is interpreted as equivalences to the empathy score.)
Chen, Parada, Lesso, Bassiou, Xiao, and Quan are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio emotion analysis.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, and furthermore in view of Xiao, to combine the teaching of Quan, to incorporate the above mentioned claim limitations, because the combination of the disclosures would prioritize relationship between two parties through a scoring system (Quan, [0099]).


Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, furthermore in view of Quan, and furthermore in view of (Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on audio, speech, and language processing, 20(2), 356-370.) hereinafter as Anguera.
Regarding claim 9, Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, and furthermore in view of Quan discloses: The device of claim 8,
Parada additionally discloses: wherein the plurality of errors includes one or more of: an improper identification of a speaker error, a false alarm speech error, a missed speech error, (Fig. 17 shows a table of error rates, including missed speech, false alarm, speaker error and DER.  Also see [0146].)
Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, and furthermore in view of Quan does not explicitly, but Anguera discloses: or an overlapping speaker error. ([sect V. Performance Evaluation] Speaker overlap error refers to the case when the wrong number of speakers is hypothesized when multiple speakers speak at the same time. See fig. 3(a)&(b))
Chen, Parada, Lesso, Bassiou, Xiao, Quan, and Anguera are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio emotion analysis.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, and furthermore in view of Quan, to combine the teaching of Anguera, to incorporate the above mentioned claim limitations, because it can be advantageous to automatically determine the number of speakers involved in addition to the periods when each speaker is active (Anguera, Sect I, Intro).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, furthermore in view of Quan, and furthermore in view of Krishnan (US Patent Application Publication No: US 20210306457 A1) hereinafter as Krishnan.
Regarding claim 11, Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, and furthermore in view of Quan discloses: The device of claim 8,
Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, and furthermore in view of Quan does not explicitly, but Krishnan discloses: wherein the empathy score provides an indication of whether one of the plurality of speakers, associated with the empathy score, is empathetic, neutral, or non-empathetic. ([0025] The behavioral analysis module 128 analyzes the identified emotion(s) and the sentiment(s) to determine one or more behavior of the customer on the call. In some embodiments, a determined emotion and a determined sentiment are used to predict a behavior of the customer. The behaviors determined by the behavior analysis module 128 include polite, impolite, friendly, rude, empathetic and neutral.)
Chen, Parada, Lesso, Bassiou, Xiao, Quan, and Krishnan are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio emotion analysis.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, and furthermore in view of Quan, to combine the teaching of Krishnan, to incorporate the above mentioned claim limitations, because it fills a need for techniques to analyzing the behavior of the speaking parties in a conversation (Krishnan, background).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, furthermore in view of Quan, furthermore in view of Martin et al. (US Patent Application Publication No: US 20210344636 A1) hereinafter as Martin, and furthermore in view of Ganesh et al. (US Patent Application Publication No: US 20120260201 A1).
Regarding claim 12, Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, and furthermore in view of Quan discloses: The device of claim 8,
Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, and furthermore in view of Quan does not explicitly, but Martin discloses: wherein the one or more processors, when performing the one or more actions, are configured to one or more of: provide the empathy score for display; (see fig. 9, one of the metric displayed on the third to the left from the bottom display is an empathy metric.)
schedule training for one of the plurality of speakers associated with the empathy score; ([0036] The server 110 may also include alert system 165. The alert system 165 may be configured to provide real-time alerts that enable an organization to act on problematic trends. The alert system 165 in conjunction with the scoring engine 130 can be configured to provide alerts on individuals (e.g., customers in need, agents needing training) as well as on broader emerging trends. Put another way, in some implementations, the alert system 165 may enable an organization to identify, mitigate, and address various problematic trends earlier and more effectively.)
Chen, Parada, Lesso, Bassiou, Xiao, Quan, and Martin are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio emotion analysis and/or data analytics.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, and furthermore in view of Quan, to combine the teaching of Martin, to incorporate the above mentioned claim limitations, because it enable the organization to develop an automated objective scoring process that uses any combination of classifiers in the library of classifiers (Martin, summary).
Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, furthermore in view of Quan, and furthermore in view of Martin does not explicitly, but Ganesh discloses: or cause a salary increase or a promotion to be implemented for one of the plurality of speakers associated with the empathy score. ([0066] Data clouds indicating positive feedback for the Reliability, Crisis scenarios and Empathy to individual service attributes can aid decisions relating to employee promotions or raises.)
Chen, Parada, Lesso, Bassiou, Xiao, Quan, Martin and Ganesh are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio emotion analysis and/or data analytics.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, furthermore in view of Quan, and furthermore in view of Martin, to combine the teaching of Ganesh, to incorporate the above mentioned claim limitations, because the collected data can be used to evaluate employee and customer engagements (Ganesh, summary).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, furthermore in view of Quan, furthermore in view of Martin, furthermore in view of Feast (US Patent Application Publication No: US 20190158671 A1) hereinafter as Feast, and furthermore in view of Ni et al. (US Patent Application Publication No: US 20200089767 A1).
Regarding claim 13, Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, and furthermore in view of Quan discloses: The device of claim 8,
Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, and furthermore in view of Quan does not explicitly, but Martin discloses: or retrain one or more of the rectification models based on the empathy score. ([0036] In some implementations, the coaching alert UI 167 or another user interface may enable a supervisor to override a particular score, e.g., for a component in the rubric. For example, a classifier may have incorrectly tagged a scoring unit and this mistake may be discovered during a collaboration, audit, or other review of an alert. The incorrect tag may be marked and saved, e.g., in the data store 140. The server 110 may be configured to use these incorrectly tagged examples to update one or more of the classifiers. For example, for machine-learned classifiers, the incorrectly tagged examples may be used to periodically retrain or update training of the classifier. Thus, the alert system 165 may provide a feedback loop that makes the classifier library 132 more accurate over time.)
Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, furthermore in view of Quan, and furthermore in view of Martin does not explicitly, but Feast discloses: wherein the one or more processors, when performing the one or more actions, are configured to one or more of: cause a reward to be implemented for one of the plurality of speakers associated with the empathy score; ([0064] Metrics associated with voice inputs may be used to identify emotional exhaustion of an agent or a customer. Such voice inputs may be analyzed to identify current voice behavioral data that may be compared to historical trends. Such voice metrics may be related to a pitch, a tone, a spoken pace/pace change, or a vocal effort. ... For example, that fact that agent is continually readjusting and coordinating their effort to deal with a caller by remaining calm, actively listening, increasing patience, or showing empathy may be used to identify an emotional state or metric associated with that agent. ... This data may also be used to make other determinations or calculations that may relate to identifying incentives, rewards, or performance metrics to associate with a particular agent.)
Chen, Parada, Lesso, Bassiou, Xiao, Quan, Martin and Feast are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio emotion analysis and/or data analytics.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, furthermore in view of Quan, and furthermore in view of Martin, to combine the teaching of Feast, to incorporate the above mentioned claim limitations, because it would improve customer and employee interactions (Feast, background of invention).
Chen in view of Parada, further in view of Lesso, furthermore in view of Xiao, furthermore in view of Quan, furthermore in view of Martin, and furthermore in view of Feast does not explicitly, but Ni discloses: cause a refund to be provided to one of the plurality of speakers associated with the empathy score; ([0120] Examples of agent issues include, but are not limited to, communication-related issues (e.g., is the agent hard to understand?, is the agent able to explain the issue coherently?, etc.), attentiveness of the agent, agent behavior (friendliness, rudeness, empathy (or lack thereof) towards the customer), etc. Examples of product issues include, but are not limited to, product (or product feature) satisfaction/dissatisfaction, integration with other devices, etc. Examples of policy issues include, but are not limited to, payment offerings, refund/exchange policy, etc.)
Chen, Parada, Lesso, Bassiou, Xiao, Quan, Martin, Feast and Ni are considered analogous art because they are in the related art of diarization and/or speaker recognition and/or audio emotion analysis and/or data analytics.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Chen, in view of Parada, further in view of Lesso, furthermore in view of Bassiou, furthermore in view of Xiao, furthermore in view of Quan, furthermore in view of Martin, and furthermore in view of Feast, to combine the teaching of Ni, to incorporate the above mentioned claim limitations, because it would enable classifying and quantifying of sentiment between customer and agent (Ni, summary).

Allowable Subject Matter
Claims 3 and 10 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.  Similarly, claims 15-20 are found allowable over the prior art of record for at least the following rationale: 
Notwithstanding, the prior art of record is respectfully considered and found to fail to teach or fairly suggest either individually or in a reasonable combination to discloses the underlined portion of: “calculate an empathy score based on the emotion score, the intent score, and the sentiment score;” Furthermore, it would not have been obvious to one of the ordinary skill in the art to modify the prior art in order to arrive at the claimed invention.  Therefore, claim 15-20 are allowed.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Ghaemmaghami et al. (US Patent Application Publication No: US 20190304470 A1) hereinafter as Ghaemmaghami.  Ghaemmaghami discloses a method and system of automatically diarizing a sound recording.
Yeon et al. (US Patent Application Publication No: US 20200051558 A1) hereinafter as Yeon.  Yeon teaches a method and device for detecting utterance, may identify speaker based on speech recognition, and may perform an action corresponding to the voice data.  
	McGarvey et al. (US Patent Application Publication No: US 20200020454 A1) hereinafter as McGarvey.  McGarvey teaches a method and system that receives rating and personality assessment information from patent and records interaction variables and emotional reaction information from care interactions and use that information to match them with healthcare providers and determine an empathy meter score for the care interactions.
	Chaudhuri (US Patent Application Publication No: US 20200279279 A1) hereinafter as Chaudhuri. Chaudhuri teaches a method and system of identify detection and emotion detection.
	 (Zhou, L., Gao, J., Li, D., & Shum, H. Y. (2020). The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1), 53-93.) hereinafter as Zhou.  Zhou discloses the development of Microsoft XiaoIce, a popular social chatbot.  
	(Unit, R. H. (2006). Empathy in health care providers–validation study of the Polish version of the Jefferson Scale of Empathy. Advances in medical sciences, 51, 219-225.) hereinafter as Kliszcz.  Kliszcz discuss the meaning of empathy as a critical component in the interpersonal relationship that needs to be measured. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Phillip H Lam whose telephone number is (571)272-1721. The examiner can normally be reached 10 AM-6 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PHILIP H LAM/Examiner, Art Unit 2656                                                                                                                                                                                                        
/HUYEN X VO/Primary Examiner, Art Unit 2656