DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on August 13, 2021 has been entered.
 Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-6, 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran et al. (US 2017/0160813 A1, “Divakaran”) in view of Oudeyer et al. (US 2002/0198717 A1, “Oudeyer”).
As to claim 1, 19-20, Divakaran discloses an intelligent interactive method, comprising: 

performing an intention analysis according to a text content of the user voice message to obtain corresponding basic intention information (speech recognition 1212, para. 0164-0165; virtual personal assistant uses automatic speech recognition and natural language understanding to determine basic intent, para. 0328-0329, 0333-0334); and 
determining corresponding emotional intention information according to the emotion recognition result and the basic intention information (combined analysis of speech recognition and speech emotion detection, para. 0165, to determine what a person wants, para. 0039-0040), wherein the emotional intention information refers to intention information having emotional meaning, which can reflect the emotional needs of the user message while reflecting the basic inventions (e.g. user message is a request to know the meaning of a yellow light next to the speedometer, para. 0328, and anxiety is detected in the tone of voice which reflects a user’s emotion need for reassurance, para. 0329); and 
determining a corresponding interactive instruction according to the emotional intention information, or determining the corresponding interactive instruction according to the emotional intention information and the basic intention information (virtual personal assistant responds in a reassuring manner, para. 0329, such as “I’m sorry. Did you try aisle six?” to alleviate frustration, para. 0333-0334).
Divakaran differs from claim 1 in that although it teaches outputting a voice broadcast in a reassuring manner according to the emotional intention information, i.e. request for information with anxiety (para. 0329), it does not specifically teach the reassuring manner as being a determined intonation and a speaking speed.

As to claim 3, Divakaran in view of Oudeyer discloses: determining a message type according to interactive scenario, the message type comprises one or more of following: facial expression (Divakaran: facial expressions, para. 0040, 0079, 0131, 0139), action posture (Divakara: body language or gesture, para. 0079, 0139), voice (Divakara: speech and voice, Fig. 12, para. 0156), and text (Divakaran: typed text, Fig. 14, para. 0156): obtaining the user message corresponding to the message type (Divakara: para. 0266).
As to claim 4, Divakaran in view of Oudeyer discloses: wherein the interactive instruction comprises one or more of the following sentiment presentation modes: a text output sentiment presentation mode, a music play sentiment presentation mode, a voice sentiment presentation mode, an image sentiment presentation mode, and a mechanical action sentiment presentation mode (Divakaran: output 106 includes vocalized output, display of text, graphics or video, action, para. 0051, 0081, 0143).
As to claim 5, Divakaran in view of Oudeyer discloses: wherein the emotional intention information comprises sentiment need information corresponding to the emotion recognition result (Divakaran: detection of anxiety indicates a need for reassurance, para. 0329); or the emotional intention information comprises the sentiment need information corresponding to the emotion recognition result and an association relationship between the emotion recognition 
As to claim 6, Divakaran in view of Oudeyer discloses: wherein the user message comprises at least a user voice message; and the obtaining an emotion recognition result according to an obtained user message comprises: obtaining the emotion recognition result according to the user voice message (Divakaran: detect frustration from user’s tone of voice, para. 0334).
Claim 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran in view of Oudeyer, as applied to claim 1 above, and further in view of Kalinli-Akbacak (US 2014/0112556 A1, “Kalinli-Akbacak ‘556”).
Divakaran in view of Oudeyer discloses: obtaining an audio emotion recognition result according to audio data of the user voice message (Divakaran: emotional state is identified from verbal cues, such as the manner in which words were spoken and/or verbalizations that were not words, para. 0139; speech emotion detection engine 1214, 0165), but differs from claim 8 in that it does not disclose: 
obtaining a text emotion recognition result according to the text content of the user voice message; 
obtaining an emotion recognition result according to the audio emotion recognition result and the text emotion recognition result, 
wherein the audio emotion recognition result and the text emotion recognition result respectively correspond to one coordinate point in a multi-dimensional emotion space, 
wherein each dimension in the multi-dimensional emotion space corresponds to a psychologically defined sentiment factor, and each of the emotion classifications comprises a plurality of emotion intensity levels; and 
the obtaining an emotion recognition result according to the audio emotion recognition result and the text emotion recognition result comprises: 

using the coordinate points as the emotion recognition result.
Kalinli-Akbacak ‘556 teaches determining an emotional state of a user from analysis of a combination of two or more different types of features, including acoustic and linguistic (para. 0017, 0021, 0026, 0039), a three-dimensional valence-arousal-dominance model (para. 0031), a plurality of emotion intensity levels (para. 0031-0032, Table I), fusing process may take the average of probability scores for estimating emotion class and weighting (para. 0043), and using the coordinate points as the emotion recognition result (para. 0031).  
It would have been obvious to one of ordinary skill in the art before the effective date of the claimed invention to modify Divakaran in view of Oudeyer with the above teaching of Kalinli-Akbacak ‘556 in order to provide for reliable emotion recognition by fusing multi-modal inputs, as taught by Kalinli-Akbacak ‘556 (para. 0017).
Claims 9-10, 14-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran in view of Oudeyer, as applied to claim 1 above, and further in view of Kalinli-Akbacak (US 2014/0114655 A1, “Kalinli-Akbacak ‘655”).
Divakaran in view of Oudeyer differs from claim 9 in that it does not disclose: wherein the obtaining an audio emotion recognition result according to audio data of the user voice message comprises: 
extracting an audio feature vector of the user voice message, wherein the user voice message corresponds to a segment of a to-be-identified audio; 
matching the audio feature vector of the user voice message with a plurality of emotional feature models, wherein the plurality of emotional feature models respectively correspond to one of a plurality of emotion classifications; and 

Kalinli-Akbacak ‘655 teaches extracting an audio feature vector for matching with a plurality of emotional features models and classification (Fig. 1A, para. 0040-0044).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Divakaran in view of Oudeyer with the above feature of Kalinli-Akbacak ‘655 in order to improve emotion recognition by analyzing only salient parts of a speech signal, as taught by Akbacak ‘655 (para. 0018).
As to claim 10, Divakaran in view of Oudeyer and Kalinli-Akbacak ‘655 discloses: wherein the plurality of emotional feature models are established by pre-learning respective audio feature vector sets of a plurality of preset voice segments comprising emotion classification labels corresponding to the plurality of emotion classifications (Kalinli-Akbacak ‘655: para. 0042-0044).
As to claims 14, 15, 16, Divakaran in view of Oudeyer and Kalinli-Akbacak ‘655 discloses: wherein the audio feature vector comprises one or more of the following audio features: an energy feature, a speech frame number feature, a pitch frequency feature, a formant feature, a harmonic to noise ratio feature, and a mel-frequency cepstral coefficient feature (Kalinli-Akbacak ‘655: para. 0053-0063).
Claim 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran in view of Oudeyer and Kalinli-Akbacak ‘655, as applied to claim 10 above, and further in view of Kalinli-Akbacak et al. (US 2016/0027452 A1, “Kalinli-Akbacak ‘452”).
Divakaran in view of Oudeyer and Kalinli-Akbacak ‘655 differs from claim 11 in that it does not specifically disclose: 
performing clustering processing on the respective audio feature vector sets of the plurality of preset voice segments comprising the emotion classification labels corresponding to 
training, according to the clustering result, an audio feature vector set of the preset voice segment in each cluster to be one of the emotional feature models.
Kalinli-Akbacak ‘452 teaches the use of a clustering process for generating emotion recognition models (para. 0012-0018).  It would have been obvious to modify Divakaran in view of Oudeyer and Kalinli-Akbacak ‘655 with the above teaching of Kalinli-Akbacak ‘452 in order to provide improved emotion recognition models adaptive to different speaking styles to maximize accuracy, as taught by Kalinli-Akbacak ‘452 (para. 0043-0045).
Claims 12-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran in view of Oudeyer and Kalinli-Akbacak ‘655, as applied to claim 9 above, and further in view of Tsiartas et al. (US 2017/0084295 A1, “Tsiartas”).
Divakaran in view of Oudeyer and Kalinli-Akbacak ‘655 differs from claim 12 in that it does not specifically disclose: determining a voice start frame and a voice end frame in the to-be-identified audio stream; and extracting an audio stream portion between the voice start frame and the voice end frame as the user voice message.
Tsiartas teaches determining a time window starting when the user first starts to speak (para. 0028, 0037), start and end times (para. 0048) and audio segmentation which identifies segments defined by start and end points of each portion of speech (para. 0088).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Divakaran in view of Oudeyer and Kalinli-Akbacak ‘655 with the above teaching of Tsiartas in order to identify the audio input window which contains speech for subsequent analysis, which would have been recognized by one of ordinary skill in the art as a predictable result.
As to claim 13, Divakaran in view of Oudeyer, Kalinli-Akbacak ‘655 and Tsiartas discloses: 

after the voice end frame of a previous voice segment, or a first voice segment is not yet identified, and when a first preset quantity of voice frames are consecutively determined as speech frames, using the first voice frame of the first preset quantity of the voice frames as the voice start frame of a current voice segment (Tsiartas: para. 0088, 0116); and 
after the voice start frame of the current voice segment, and when a second preset quantity of voice frames are consecutively determined as non-speech frames, using the first voice frame of the second preset quantity of the voice frames as the voice end frame of the current voice segment (Tsiartas: para. 0116).
Claims 17-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran in view of Oudeyer, as applied to claim 1, and further in view of Wang et al. (US 2018/0314689 A1, “Wang”).
Divakaran in view of Oudeyer differs from claim 17 in that it does not specifically disclose: 
matching the text content of the user voice message with a plurality of preset semantic templates in a semantic knowledge repository to determine a matched semantic template; and 
obtaining the basic intention information corresponding to the matched semantic template, 
wherein a correspondence between the semantic template and the basic intention information is pre-established in the semantic knowledge repository, and same intention information corresponds to one or more semantic templates.
Wang teaches applying sematic rules and/or models to determine intent associated with verbal input (para. 0003, 0169).  It would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to modify Divakaran in view of Oudeyer with the above feature of Wang in order to more accurately identify user intent.
.
Claims  21-22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran in view of Oudeyer, as applied to claim 1 above, and further in view of Tsiartas.
Divakaran in view of Oudeyer differs from claim 21 in that it does not specifically disclose: calculating confidence of an emotion classification in the audio emotion recognition result and confidence of an emotion classification in the text emotion recognition result; obtaining the emotion recognition result according to the confidence in the audio emotion recognition result and the confidence in the text emotion recognition result.
Tsiartas teaches the calculation of a measure of degree or confidence in an emotion classification (para. 0027).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Divakaran in view of Oudeyer with the above teaching of Tsiartas by applying the confidence calculation to both audio and text emotion recognition results in order to provide a more accurate indication of speaker state.
As to claim 22, Divakaran in view of Oudeyer and Tsiartas discloses: wherein the determining corresponding emotional intention information according to the emotion recognition result and the basic intention information comprises: 
determining corresponding emotional intention information according to the emotion recognition result and the basic intention information, in combination with an emotion recognition result and basic intention information of a previous user voice message and/or a subsequent user voice message (Tsiartas: para. 0047, 0086, 0098-0099).
Response to Arguments
Applicant's arguments filed August 13, 2021 have been fully considered but they are not persuasive.  
Applicant argues that “in Oudeyer, parameters used in voice synthesis are related to the environment and internal state” whereas “in claim 1, the intonation and the speaking speed of the voice broadcast are determined based on the emotional intention information.”
However, Divakaran was relied upon to teach adjusting the voice broadcast based on emotional state, i.e. by providing a voice response in a reassuring manner based on the detection of anxiety in a user request.  Divakaran differs from claim 1 in that it does not specify the reassuring manner as including a change in intonation and speaking speed.  Oudeyer teaches the desirability of outputting a synthesized voice using particular prosodic parameters in order to express calm and comfort.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to provide the reassuring voice response of Divakaran using adjusted prosodic parameters, as taught by Oudeyer, in order to express comfort and reassurance in a more natural, human-like manner.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  McCord et al. (US 2018/0082679 A1) teach a method for emotion-enhanced natural speech audio generation.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to STELLA L WOO whose telephone number is (571)272-7512.  The examiner can normally be reached on Monday - Friday, 9 a.m. to 3 p.m.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ahmad Matar can be reached on 571-272-7488.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/Stella L. Woo/Primary Examiner, Art Unit 2652