DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 05/27/2022 has been entered.
This communication is in response to the Amendments and Arguments filed on   05/09/2022. 
Claims 1-14 are pending and have been examined.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the examiner. 
	Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 
Response to Arguments
Applicant's arguments filed 05/09/2022 have been fully considered but they are not persuasive. 
Regarding Applicant's arguments, on pages 8-9, Applicant asserts that Balasubramaniam does not teach a change vector, as a vector requires both magnitude and direction. The Examiner respectfully disagrees with this assertion. While it is true that a vector can identify magnitude and direction, this specific requirement is not recited in the claim language. Thus, the broader interpretation of a vector as a set of values is available to be used when interpreting the claim. This is not the same as what Applicant alleges as the interpretation, i.e. that the interpretation is only a scalar value.  As Balasubramaniam teaches a set of values, the reference teaches a change vector as presented in the previous Office Action (see Balasubramaniam (13:55-14:20)). 
Applicant further asserts on pages 8-9 that Balasubramaniam does not utilize speech feature data to separate audio data representing the customer’s voice from the audio data representing the agent’s voice. In response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., using speech feature data to separate audio of a customer voice from audio of an agent voice) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). The claim language recites “clustering the plurality of feature amounts for each speaker.” This limitation is broadly interpreted to mean that the feature amounts for each speaker are clustered, which is supported by the recited claim language. If the Applicant intends interpretation of feature clustering being used for speaker diarization, it is suggested that the language be amended to reflect the desired interpretation. 
On page 10, regarding the 101 rejection, Applicant asserts that the claims being distinguishable over cited references leads the features to be "unconventional in combination", thereby providing significantly more than the judicial exception. Aside from the first statement not being confirmed, as previously discussed, the consideration of whether or not claims are patent eligible subject matter is distinct from issues of novelty. Just because a claim may overcome cited art does not automatically mean it is patent eligible. It must also meet the separate standards of a 101 analysis. Please see MPEP 2106.04 for further detail regarding judicial exceptions, especially the statement “even newly discovered or novel judicial exceptions are still exceptions.” In this case, regardless of the standing of the claims with respect to prior art, the claim language still falls under the Abstract - Mental Process classification, which is a judicial exception. Regarding the specific argument that the claims amount to significantly more than the judicial exception, the Examiner respectfully disagrees. A human is capable of performing all of the claim limitations, and the additional elements of generalized computer components amounts to mere instructions to implement an abstract idea on a computer. As per MPEP 2106.05, this is not enough to qualify as “significantly more” and falls under the well-understood, routine, and conventional activity designation. 
Hence, Applicant’s arguments are not persuasive.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-14 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim(s) 1, 13, and 14, the limitation(s) of “detecting a plurality of voice sections”, “calculating a plurality of feature amounts”, “determining a plurality of emotions”, “classifying a plurality of first feature amounts”, “generating a change vector”, and “clustering the plurality of feature amounts”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. More specifically, the mental process of a human hearing another human speak and recognizing discreet portions of the speech, writing down specific acoustic features heard in each portion of speech, recognizing and writing down an emotion for each portion, and writing the acoustic features in groups according to the associated emotion and the differences between the acoustic features of one emotion versus those of the other emotion, determining a set of values that describes a relationship between the different groups, and finalizing groupings of the acoustic features based on the relationship descriptor. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application because the recitation of a “computer-readable recording medium” in claim 1, and an “apparatus”, “memory”, and “processor” in claim 14, reads to generalized computer components, based upon the claim interpretation wherein the structure is interpreted using [0052] in the specification. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. Claim 13 does not recite any additional limitations. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to detect, calculate, determine, classify, generate, and cluster amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible.

With respect to claim(s) 2, the claim(s) recite(s) “specifies a combination”, which reads on a human identifying vectors that are the most similar. No additional limitations are present.

With respect to claim(s) 3, the claim(s) recite(s) “correcting the plurality of second feature amounts” and “clusters the plurality of first feature amounts”, which reads on a human changing the values for the features associated with second emotion by the vector amount connecting the features for the first and second emotions, and clustering the features for the first emotion with the new feature values. No additional limitations are present.

With respect to claim(s) 4, the claim(s) recite(s) “generates the change vector”, which reads on a human calculating a vector connecting the cluster of the first emotion, which is neutral, and the cluster of the second emotion, which is not neutral. No additional limitations are present.

With respect to claim(s) 5, the claim(s) recite(s) “associating the voice section”, which reads on a human recognizing that a particular grouping of acoustic features is associated with the voice of a particular speaker. No additional limitations are present.

With respect to claim(s) 6, the claim(s) recite(s) “evaluates the similarity”, which reads on a human using pen and paper to calculate a cosine similarity between vectors. No additional limitations are present.

With respect to claim(s) 7, the claim(s) recite(s) “determines the plurality of emotions”, which reads on a human recognizing the emotions associated with acoustic features of a voice. No additional limitations are present.

With respect to claim(s) 8, the claim(s) recite(s) “determines the plurality of emotions…based on a face image”, which reads on a human looking at a speaker and recognizing the emotions associated with different facial expressions. No additional limitations are present.

With respect to claim(s) 9, the claim(s) recite(s) “determines the plurality of emotions…based on a biological information”, which reads on a human recognizing a speaker’s breathing pattern or heartbeat and associating it with a particular emotion. No additional limitations are present.

With respect to claim(s) 10, the claim(s) recite(s) “calculates the plurality of feature amounts”, which reads on a human writing down the pitch or volume of a speaker. No additional limitations are present.

With respect to claim(s) 11, the claim(s) recite(s) “extracts one of…”, which reads on a human recognizing and writing down a pitch or volume of a speaker. No additional limitations are present.

With respect to claim(s) 12, the claim(s) recite(s) “calculates the plurality of feature amounts”, which reads on a human calculating acoustic features using model equations developed from known data and the recognized features of the speaker’s voice. No additional limitations are present.
These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 4, 7, 10, 11, 13, and 14 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Balasubramaniam et al. (US Patent No. 10896428), hereinafter Balasubramaniam.

Regarding claims 1, 13, and 14, Balasubramaniam teaches
(claim 1) A non-transitory computer-readable recording medium having stored therein a program that causes a computer to execute a procedure, the procedure comprising (an example computing device includes one or more computer readable medium drives, i.e. non-transitory computer-readable recording medium (19:2-10), that include computer program instructions, i.e. having stored therein a program, that one or more processors execute, i.e. causes a computer to execute a procedure (19:18-25)):
(claim 13) A voice processing method comprising (a method (20:29-33)):
(claim 14) A voice processing apparatus comprising (an example computing device, i.e. apparatus (19:1-5)):
(claim 14) a memory (one or more computer readable medium drives, i.e. memory (19:2-10)); and
(claim 14) a processor coupled to the memory and the processor configured to (an example computing device includes one or more computer processors and one or more computer readable memories that include instructions that the processors execute, i.e. coupled to the memory (19:1-25)):

detecting a plurality of voice sections from an input sound that includes voices of a plurality of speakers (the audio data signal, i.e. an input sound, may be windowed into a succession of frames to be processed individually, i.e. detecting a plurality of voice sections, and the audio data may be further separated into data including the voice of the customer separate from data including the voice of the agent, i.e. includes voices of a plurality of speakers (12:9-30));
calculating a plurality of feature amounts  of each of the plurality of voice sections (the speech analyzer extracts audio features, i.e. calculating a plurality of feature amounts, from the frames of the audio data, i.e. of each of the plurality of voice sections (12:35-44));
determining a plurality of emotions, corresponding to the plurality of voice sections respectively, of a speaker of the plurality of speakers (the feature vectors generated from the extracted audio features for each frame, i.e. corresponding to the plurality of voice sections respectively (12:45-62), where the audio data was also separated into data for just the customer and just the agent, i.e. of a speaker of the plurality of speakers (12:27-34), are classified into particular classifications, such as a classification associated with a particular emotion, i.e. determining a plurality of emotions (13:17-26));
classifying a plurality of first feature amounts of a first voice section determined as a first emotion of the plurality of emotions of the speaker and a plurality of second feature amounts of a second voice section determined as a second emotion of the plurality of emotions of the speaker into a plurality of first clusters and a plurality of second clusters, respectively (audio features are extracted from the frame, where each frame is processed separately, i.e. a plurality of first feature amounts of a first voice section...and a plurality of second feature amounts of a second voice section (12:9-17,35-52), where a feature vector for the frame is generated using the features, and the feature vectors are classified as being associated with a particular emotion, i.e. determined as a first emotion of the plurality of emotions of the speaker... determined as a second emotion of the plurality of emotions of the speaker (12:63-12:16),(13:18-24), where the feature vectors are classified as being associated with particular emotions using a k-means clustering model, i.e. classifying...into...first clusters and...second clusters (13:16-32), and the audio data representing the customer is separated from audio data representing the agent, and processing is performed on the frames of the audio data, i.e. plurality of first and second clusters, respectively (12:18-34));
generating a change vector coupled to one of the plurality of first clusters and one of the plurality of second clusters, based on a combination of each of the plurality of first clusters and each of the plurality of second clusters (the feature vectors generated from the extracted audio features for each frame (12:45-62), are classified, such as through a k-means clustering trained model into classifications associated with a particular emotion, i.e. based on a combination of each of the plurality of first clusters and each of the plurality of second clusters (13:17-26), and the different classifications have speech feature data that includes different values, with each value representing a degree to which a different emotion of the n different emotions corresponds to the voice represented by the audio data, where the audio data is separated into data representing the customer and data representing the agent, i.e. generates the change vector coupled to one of the plurality of first clusters and one of the plurality of second clusters (12:18-34),(13:55-14:20)); and
clustering the plurality of feature amounts for each speaker, based on the change vector (the score generator may generate state score data that represents a classification, such as an emotion, i.e. clustering the plurality of feature amounts, using an input vector that includes the speech feature data, i.e. based on the change vector, and the emotions correspond to the voice represented by the audio data, where the audio data is separated into data representing the customer and data representing the agent, i.e. for each speaker (12:18-34),(13:55-14:20),(14:44-67)).  

Regarding claim 4, Balasubramaniam teaches
generates the change vector coupled to one of the plurality of first clusters of the first voice section determined as a neutral emotion of the plurality of emotions and one of the plurality of second clusters of the second voice section determined as an emotion other than the neutral emotion (the feature vectors generated from the extracted audio features for each frame, where the audio data was also separated into data for just the customer and just the agent, are classified, such as through a k-means clustering trained model into classifications associated with a particular emotion, i.e. one of the plurality of first clusters of the first voice section...and one of the plurality of second clusters of the second voice section  (12:27-34,45-62),(13:17-26), where the model may differentiate between a set of emotions, including neutral, i.e. determined as a neutral emotion of the plurality of emotions, and other emotions, such as anger, boredom, disgust, anxiety, happiness, and sadness, i.e. determined as an emotion other than the neutral emotion (13:48-54), and the different classifications have speech feature data that includes different values, with each value representing a degree to which a different emotion of the n different emotions corresponds to the voice, i.e. generates the change vector coupled to one of the plurality of first clusters…neutral…and one of the plurality of second clusters...emotion other than the neutral emotion (13:55-14:20)).  

	Regarding claim 7, Balasubramaniam teaches claim 1, and further teaches
determines the plurality of emotions of the speaker, based on the plurality of feature amounts of the voices included in each of the plurality of voice sections (the feature vectors generated from the extracted audio features for each frame, i.e. based on the plurality of feature amounts of the voices included in each of the plurality of voice sections (12:45-62), where the audio data was also separated into data for just the customer and just the agent, i.e. of the speaker (12:27-34), are classified into particular classifications, such as a classification associated with a particular emotion, i.e. determines the plurality of emotions (13:17-26)).  


	Regarding claim 10, Balasubramaniam teaches claim 1, and further teaches
calculates the plurality of feature amounts related to a harmonicity, periodicity or signal strength as the plurality of feature amounts of each of the plurality of voice sections (extracted audio features include, i.e. calculates the plurality of feature amounts related to, pitch metrics, i.e. periodicity, and the average energy of the speech, i.e. signal strength, where each frame is processed individually to generate speech feature data, i.e. each of the plurality of voice sections (12:09-17,35-60)).  

Regarding claim 11, Balasubramaniam teaches claim 1, and further teaches
extracts one of a spectrum correlation of the input sound, a formant frequency, an autocorrelation coefficient of the input sound, a pitch frequency, power of the input sound, an SNR (Signal-Noise Ratio) and spectrum power, as the plurality of feature amounts of each of the plurality of voice sections (extracted audio features include, i.e. extracts...as the plurality of feature amounts, pitch metrics, i.e. pitch frequency, and the average energy of the speech, i.e. power of the input sound (12:35-60), where each frame is processed individually to generate speech feature data, i.e. each of the plurality of voice sections (12:09-17)).  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Balasubramaniam, in view of Chen et al. (Increasing Accuracy of Bottom-Up Speaker Diarization by Cluster Selection), as found in the IDS, hereinafter Chen.

Regarding claim 5, Balasubramaniam teaches claim 1.
While Balasubramaniam provides the association of voice sections to specific speakers, Balasubramaniam does not specifically teach that the association is based on the results of clustering, and thus does not teach
associating the voice section that corresponds to the plurality of feature amounts with the speaker, based on a result of the clustering the plurality of feature amounts.  
Chen, however, teaches associating the voice section that corresponds to the plurality of feature amounts with the speaker, based on a result of the clustering the plurality of feature amounts (voice frames are clustered using features, i.e. a result of the clustering the plurality of feature amounts, and each cluster is associated with a speaker, i.e. associating the voice section that corresponds to the plurality of feature amounts with the speaker (Sec. 2, para. 1-4),(Sec. 5, para. 1)). 
Balasubramaniam and Chen are analogous art because they are from a similar field of endeavor in processing speech input associated with a particular speaker. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the association of voice sections to specific speakers teachings of Balasubramaniam with each voice frame belongs to a cluster associated with a speaker as taught by Chen. It would be obvious to combine the references so that cluster merging in speaker diarization could be improved (Chen Sec. 1).
 
Claim(s) 8, 9, and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Balasubramaniam, in view of Shrivastav et al. (U.S. PG Pub No. 2012/0116186), hereinafter Shrivastav.

Regarding claim 8, Balasubramaniam teaches claim 1. 
While Balasubramaniam provides the determination of emotions of a speaker, Balasubramaniam does not specifically teach the use of a facial image to determine emotion, and thus does not teach
determines the plurality of emotions of the speaker, based on a face image of the speaker.  
Shrivastav, however, teaches determines the plurality of emotions of the speaker, based on a face image of the speaker (the emotions of a speaker can be categorized from a list of different emotions, where the estimation is done in segments of the speech signal, i.e. determines the plurality of emotions of the speaker [0031],[0101], where physiological characteristics from an input device, such as the facial expression, can be used to determine a subject’s emotional state, and where the input device is a camera, i.e. based on a face image of the speaker Fig. 12, [0100],[0121:21-27]).  
Balasubramaniam and Shrivastav are analogous art because they are from a similar field of endeavor in identifying user emotions through user speech and other input. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the determination of emotions of a speaker teachings of Balasubramaniam with the use of a facial expression to determine emotions as taught by Shrivastav. It would have been obvious to combine the references to enable the use of physiological characteristics to determine a subject’s emotional state (Shrivastav [0100]).

Regarding claim 9, Balasubramaniam teaches claim 1.
While Balasubramaniam provides the determination of emotions of a speaker, Balasubramaniam does not specifically teach the use of biological information to determine emotion, and thus does not teach
determines the plurality of emotions of the speaker, based on biological information of the speaker.  
Shrivastav, however, teaches determines the plurality of emotions of the speaker, based on biological information of the speaker (the emotions of a speaker can be categorized from a list of different emotions, where the estimation is done in segments of the speech signal, i.e. determines the plurality of emotions of the speaker [0031],[0101], where physiological characteristics, such as heartbeat, respiration, temperature, and galvanic skin response, i.e. biological information of the speaker, can be used to determine a subject’s emotional state Fig. 12,[0100]).  
Balasubramaniam and Shrivastav are analogous art because they are from a similar field of endeavor in identifying user emotions through user speech and other input. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the determination of emotions of a speaker teachings of Balasubramaniam with the use of a physiological characteristics to determine emotions as taught by Shrivastav. It would have been obvious to combine the references to enable the use of physiological characteristics to determine a subject’s emotional state (Shrivastav [0100]).

Regarding claim 12, Balasubramaniam teaches claim 1.
While Balasubramaniam provides the use of a gender-specific model to process audio data to extract features, Balasubramaniam does not specifically teach that the model is a learning model, and thus does not teach
calculates the plurality of feature amounts, based on a deep learning model learned using learning data that associate the information of each of the plurality of voice sections with the speaker.  
Shrivastav, however, teaches calculates the plurality of feature amounts, based on a deep learning model learned using learning data that associate the information of each of the plurality of voice sections with the speaker (an acoustic model, i.e. deep learning model, can be trained to select features to determine the acoustic features in segments of the speech signal that correspond to each dimension in the model, i.e. calculates the plurality of feature amounts, based on a…model [0031-2], where the training data can include one or more utterances spoken by the speaker, with characteristics associated with specific stimuli for the speaker, i.e. using learning data that associate the information of each of the plurality of voice sections with the speaker [0103-4],[0106]).  
Balasubramaniam and Shrivastav are analogous art because they are from a similar field of endeavor in identifying user emotions through user speech and other input. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of a gender-specific model to process audio data to extract features teachings of Balasubramaniam with the use of an acoustic model to determine specific acoustic features, where the model can be trained using speaker data, as taught by Shrivastav. It would be obvious to combine the references to enable monitoring of how emotion may change across a conversation relative to a specific user baseline (Shrivastav [0106]).
Allowable Subject Matter
Claims 2, 3, and 6 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. More specifically, none of the prior art, either alone or in combination, teaches or makes obvious specifying a combination of clusters with a maximum similarity between directions of a plurality of change vectors.  Further, none of the prior art, either alone or in combination, teaches or makes obvious correcting the feature amounts of voice sections determined as the second emotion, based on the change vector, and in the clustering, clusters the feature amounts of the first voice section determined as the first emotion and a plurality of corrected feature amounts.  
Conclusion
	
	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NICOLE A K SCHMIEDER/Examiner, Art Unit 2659                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        

08/08/2022