Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION

Response to Arguments
Applicant's arguments with respect to claims 1 and 8-11 have been considered but are moot in view of the new ground(s) of rejection. Applicant’s arguments are directed to the amended subject matter; new prior art citations from Phillips and Kim are provided below. 
Philips and Kim have been maintained and new citations provided to show how prosodic non-linguistic features are present as well as response history information are utilized in conjunction with user emotion. For instance Phillips teaches feature space as well as phonetics in the form of stored acoustic sounds or audio samples where such data is acoustic and not linguistic, furthermore response history is analogous to usage history since usage comprises responses critical for updating intents with confidence/probabilistic scores 0151, 0160-0162, 0080, 0084, 0097-0100. One or more models is selected and such one or more models is used for adaptation/learning i.e. the plurality. Regarding the output of one of three outputs, both Phillips and Kim teach text-to-speech (TTS) output, however Kim teaches responding with voice output in a conversation using the emotion of the user as in 0018-0019, 0102, 0053. The voice output of Kim is a more advanced version of the TTS in Phillips which also utilizes learning models but at the neural network level for learning. See rejection below.


Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5 and 8-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Phillips; Michael S. et al. US 20110055256 A1 (hereinafter Phillips) in view of KIM; Dae Hoon et al. US 20180136615 A1 (hereinafter KIM).
Re claim 1, Phillips teaches
A voice interaction system that has a conversation with a user by using a voice, comprising: 5hardware, including at least one memory configured to store a computer program and at least one processor configured to execute the computer program; acquire user speech given by the user; (0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
extract 1 at least a feature of the acquired user speech comprising non-linguistic information including prosodic information on the user speech and response history information; (feature space as well as phonetics in the form of stored acoustic sounds or audio samples where such data is acoustic and not linguistic, furthermore response history is analogous to usage history since usage comprises responses critical for updating intents with confidence/probabilistic scores… and usage history with MFCC which is a prosodic non-linguistic frequency based analysis of audio, in this instance for authentication see 0151, 0160-0162, 0080, 0084, 0097-0100, 0089 0107 0165 0169… and 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
determine a response in accordance with the extracted feature using any one of a plurality of learning models generated in advance by machine learning, each of the plurality of learning models receiving the feature of the acquired user speech comprising the non-linguistic information and the response history information… (feature space as well as phonetics in the form of stored acoustic sounds or audio samples where such data is acoustic and not linguistic, furthermore response history is analogous to usage history since usage comprises responses critical for updating intents with confidence/probabilistic scores… One or more models is selected and such one or more models is used for adaptation/learning i.e. the plurality. and usage history with MFCC which is a prosodic non-linguistic frequency based analysis of audio, in this instance for authentication see 0151, 0160-0162, 0080, 0084, 0097-0100, 0089 0107 0165 0169… and 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
15perform control in order to execute the determined response; (command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
select a learning model from the plurality of learning models in accordance with the 20detected user state, (one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
wherein the response determination unit, implemented by the hardware, determines the response using the learning model selected by the learning model selection unit.  (contextual response based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)

However, Phillips fails to teach
detect an emotion of the user, and detect a user state based on the detected emotion of the user, which is a state of the user; (Kim emotions are extracted from at least face recognition expressly “recognizing an emotion of the identified user from the face image based on emotion learning data of emotions that correspond to a plurality of face images, learned by using the neural network model”, NN models for learning, facial recognition = face recognition 0019 0027 0045 0075 with fig. 2.)
…and outputting one of silent, nod and speech as the response based on emotions of the user; (Kim SVM to produce vectors, NN modeling and learning, an emotion that corresponds to the face image is recognized by applying a CNN model to the recognized face image by the emotion recognition unit 105 using a face recognition unit 103 (S105). For example, one of seven emotions such as anger, happiness, surprise, hatred, sorrow, fear, and neutrality learned in the exemplary embodiment of the present invention is recognized by applying the CNN model learned with respect to the face image, and a result of the emotion recognition is output (S107). In this case, the emotion recognition result is processed into natural language and a conversation sentence that corresponds to the natural language is generated by using the CNN model and then output by voice (S107), 0018-0019, 0102, 0053)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Kim to allow for an improvement of Phillips existing voice output or TTS using an advanced neural network learning model for voice output in conversational context while using emotion thus improving Phillips voice output to be faster and more accurate once combined as NN and also with emotion output thereby providing confirming speech in-context as in Philips using a multi-modal input approach analogous to Phillips having text input as well as speech, wherein KIM provides emotion derived from facial recognition input using neural network learning models analogous to Phillips adaptive learning models thereby using faster models, and further improving context in Phillips by adding a new context such as emotion, wherein such an addition would aid in predicting user actions e.g. reducing disambiguation by detecting excitement or frustration which could help distinguish intents, such would be derived from not only vocal tones but also identified facial expressions from face recognition combined together to produce an enhanced system response with more data (voice, context, face identification, and emotion thereof), which also includes improving Philips biometric embodiments for authentication purposes by utilizing face recognition.


Re claim 2, Phillips teaches
2. The voice interaction system according to Claim 1, wherein the user state detection unit detects a degree of activeness of the user in the conversation as the user state, and the learning model selection unit selects the learning model that corresponds to the degree of the activeness of the user.  (contextual response based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)


Re claim 3, Phillips teaches
3. The voice interaction system according to Claim 2, wherein36 the user state detection unit detects an amount of speech given by the user in a predetermined period or a percentage of time during which the user has made a speech with respect to a sum of time during which the voice interaction system has output a voice as a response and the time during which the user has made a speech in the 5predetermined period, and (time periods and phonemic durations as well as frequency of occurrence i.e. inputs within a time sample overall…contextual response based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
the learning model selection unit selects the learning model that corresponds to the amount of speech given by the user or the percentage of the time during which the user has made a speech.  (correlated to the models, time periods and phonemic durations as well as frequency of occurrence i.e. inputs within a time sample overall…contextual response based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)

Re claim 4, Phillips teaches
104. The voice interaction system according to Claim 1, wherein the user state detection unit detects identification information on the user as the user state, and the learning model selection unit selects the learning model that corresponds to the identification information on the user.  (user profiles identity user, contextual response based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)

Re claim 5, Phillips fails to teach
The voice interaction system according to Claim 1, wherein the user state detection unit acquires a face image of the user and detects the emotion of the user based on the acquired face image and the learning model selection unit selects the learning model that corresponds to the emotion of the user.  
Kim teaches face recognition expressly “recognizing an emotion of the identified user from the face image based on emotion learning data of emotions that correspond to a plurality of face images, learned by using the neural network model”, NN models for learning, facial recognition = face recognition 0019 0027 0045 0075 with fig. 2.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Kim to allow for an improvement of Phillips by incorporating a multi-modal input approach analogous to Phillips having text input as well as speech, wherein KIM provides emotion derived from facial recognition input using neural network learning models analogous to Phillips adaptive learning models thereby using faster models, and further improving context in Phillips by adding a new context such as emotion, wherein such an addition would aid in predicting user actions e.g. reducing disambiguation by detecting excitement or frustration which could help distinguish intents, such would be derived from not only vocal tones but also identified facial expressions from face recognition combined together to produce an enhanced system response with more data (voice, context, face identification, and emotion thereof), which also includes improving Philips biometric embodiments for authentication purposes by utilizing face recognition.



Re claims 8 and 9, Phillips teaches
8. A voice interaction method performed by a voice interaction system that has a conversation with a user by using a voice, the voice interaction method cornprising: 
acquiring user speech given by the user;  (user speech 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
5extracting at least a feature of the acquired user speech comprising non-linguistic information including prosodic information on the user speech and response history information; … (feature space as well as phonetics in the form of stored acoustic sounds or audio samples where such data is acoustic and not linguistic, furthermore response history is analogous to usage history since usage comprises responses critical for updating intents with confidence/probabilistic scores… One or more models is selected and such one or more models is used for adaptation/learning i.e. the plurality. and usage history with MFCC which is a prosodic non-linguistic frequency based analysis of audio, in this instance for authentication see 0151, 0160-0162, 0080, 0084, 0097-0100, 0089 0107 0165 0169… and 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
determining a response in accordance with the extracted feature using any one of a plurality of learning models generated in advance by machine learning, each of the plurality of learning models receiving the feature of the acquired user speech comprising the non-linguistic information and the response history information… (feature space as well as phonetics in the form of stored acoustic sounds or audio samples where such data is acoustic and not linguistic, furthermore response history is analogous to usage history since usage comprises responses critical for updating intents with confidence/probabilistic scores… One or more models is selected and such one or more models is used for adaptation/learning i.e. the plurality. and usage history with MFCC which is a prosodic non-linguistic frequency based analysis of audio, in this instance for authentication see 0151, 0160-0162, 0080, 0084, 0097-0100, 0089 0107 0165 0169… and 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
performing control in order to execute the determined response; (command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
10selecting a learning model from the plurality of learning models in accordance with the detected user state, (based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
wherein the response is determined using the selected learning model.  (contextual response based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)

However, Phillips fails to teach
detecting an emotion of the user, and detecting a user state based on the detected emotion of the user, which is a state of the user; (Kim emotions are extracted from at least face recognition expressly “recognizing an emotion of the identified user from the face image based on emotion learning data of emotions that correspond to a plurality of face images, learned by using the neural network model”, NN models for learning, facial recognition = face recognition 0019 0027 0045 0075 with fig. 2.)
…and outputting one of silent, nod and speech as the response based on emotions of the user;  (Kim SVM to produce vectors, NN modeling and learning, an emotion that corresponds to the face image is recognized by applying a CNN model to the recognized face image by the emotion recognition unit 105 using a face recognition unit 103 (S105). For example, one of seven emotions such as anger, happiness, surprise, hatred, sorrow, fear, and neutrality learned in the exemplary embodiment of the present invention is recognized by applying the CNN model learned with respect to the face image, and a result of the emotion recognition is output (S107). In this case, the emotion recognition result is processed into natural language and a conversation sentence that corresponds to the natural language is generated by using the CNN model and then output by voice (S107), 0018-0019, 0102, 0053)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Kim to allow for an improvement of Phillips existing voice output or TTS using an advanced neural network learning model for voice output in conversational context while using emotion thus improving Phillips voice output to be faster and more accurate once combined as NN and also with emotion output thereby providing confirming speech in-context as in Philips using a multi-modal input approach analogous to Phillips having text input as well as speech, wherein KIM provides emotion derived from facial recognition input using neural network learning models analogous to Phillips adaptive learning models thereby using faster models, and further improving context in Phillips by adding a new context such as emotion, wherein such an addition would aid in predicting user actions e.g. reducing disambiguation by detecting excitement or frustration which could help distinguish intents, such would be derived from not only vocal tones but also identified facial expressions from face recognition combined together to produce an enhanced system response with more data (voice, context, face identification, and emotion thereof), which also includes improving Philips biometric embodiments for authentication purposes by utilizing face recognition.


Re claims 10 and 11, Phillips teaches
10. A learning model generation apparatus configured to generate a learning model used in a voice interaction system that has a conversation with a user by using a 30voice, the apparatus comprising: hardware, including at least one memory configured to store a computer program and at least one processor configured to execute the computer program; (contextual response based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
acquire user speech, which is speech given by at least one desired user, by having a conversation with the desired user; (system interaction with user 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
extract 5a feature vector indicating at least a feature of the acquired user speech comprising non-linguistic information including prosodic information on the user speech and response history information; (feature space as well as phonetics in the form of stored acoustic sounds or audio samples where such data is acoustic and not linguistic, furthermore response history is analogous to usage history since usage comprises responses critical for updating intents with confidence/probabilistic scores… One or more models is selected and such one or more models is used for adaptation/learning i.e. the plurality. and usage history with MFCC which is a prosodic non-linguistic frequency based analysis of audio, in this instance for authentication see 0151, 0160-0162, 0080, 0084, 0097-0100, 0089 0107 0165 0169… and 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
generate sample data in which a correct label indicating a response to the user speech and the feature vector are associated with each other; (determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
classify the sample data for each of the user states; and (contextual response based on one of multiple models selecting depending on context of input, user state as in the user actions, command and control, determine response from system based on input based on multiple models, feature space 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)
1generate a plurality of learning models by machine learning for each of pieces of the classified sample data, each of the plurality of learning models receiving the feature vector and the response history information, and output one of silent, nod and speech as the response based on emotions of the user.  (feature space as well as phonetics in the form of stored acoustic sounds or audio samples where such data is acoustic and not linguistic, furthermore response history is analogous to usage history since usage comprises responses critical for updating intents with confidence/probabilistic scores… and usage history with MFCC which is a prosodic non-linguistic frequency based analysis of audio, in this instance for authentication see 0151, 0160-0162, 0080, 0084, 0097-0100, 0089 0107 0165 0169… and 0071, 0062, 0099, 0117, 0142, 0145 with fig. 2-2b)

However, Phillips fails to teach
detect an emotion of the user, and acquire a user state based on the detected emotion of the user, which is a state of the desired user when the user has made a speech, to associate the acquired user state with the sample data that corresponds to the user speech; (Kim SVM to produce vectors, NN modeling and learning, an emotion that corresponds to the face image is recognized by applying a CNN model to the recognized face image by the emotion recognition unit 105 using a face recognition unit 103 (S105). For example, one of seven emotions such as anger, happiness, surprise, hatred, sorrow, fear, and neutrality learned in the exemplary embodiment of the present invention is recognized by applying the CNN model learned with respect to the face image, and a result of the emotion recognition is output (S107). In this case, the emotion recognition result is processed into natural language and a conversation sentence that corresponds to the natural language is generated by using the CNN model and then output by voice (S107), 0018-0019, 0102, 0053… also when a user is speaking, emotions are extracted from at least a multimodal supplemental face recognition expressly “recognizing an emotion of the identified user from the face image based on emotion learning data of emotions that correspond to a plurality of face images, learned by using the neural network model”, NN models for learning, facial recognition = face recognition 0019 0024 0027 0045 0075 with fig. 2.)
and outputting one of silent, nod and speech as the response based on emotions of the user;  (Kim SVM to produce vectors, NN modeling and learning, an emotion that corresponds to the face image is recognized by applying a CNN model to the recognized face image by the emotion recognition unit 105 using a face recognition unit 103 (S105). For example, one of seven emotions such as anger, happiness, surprise, hatred, sorrow, fear, and neutrality learned in the exemplary embodiment of the present invention is recognized by applying the CNN model learned with respect to the face image, and a result of the emotion recognition is output (S107). In this case, the emotion recognition result is processed into natural language and a conversation sentence that corresponds to the natural language is generated by using the CNN model and then output by voice (S107), 0018-0019, 0102, 0053… also when a user is speaking, emotions are extracted from at least a multimodal supplemental face recognition expressly “recognizing an emotion of the identified user from the face image based on emotion learning data of emotions that correspond to a plurality of face images, learned by using the neural network model”, NN models for learning, facial recognition = face recognition 0019 0024 0027 0045 0075 with fig. 2.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Kim to allow for an improvement of Phillips existing voice output or TTS using an advanced neural network learning model for voice output in conversational context while using emotion thus improving Phillips voice output to be faster and more accurate once combined as NN and also with emotion output thereby providing confirming speech in-context as in Philips using a multi-modal input approach analogous to Phillips having text input as well as speech, wherein KIM provides emotion derived from facial recognition input using neural network learning models analogous to Phillips adaptive learning models thereby using faster models, and further improving context in Phillips by adding a new context such as emotion, wherein such an addition would aid in predicting user actions e.g. reducing disambiguation by detecting excitement or frustration which could help distinguish intents, such would be derived from not only vocal tones but also identified facial expressions from face recognition combined together to produce an enhanced system response with more data (voice, context, face identification, and emotion thereof), which also includes improving Philips biometric embodiments for authentication purposes by utilizing face recognition.


Claim 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Phillips; Michael S. et al. US 20110055256 A1 (hereinafter Phillips) in view of KIM and further in view of HAN; Youngwoong et al. US 20170147753 A1 (hereinafter Han).
Re claim 6, Phillips fails to teach
6. The voice interaction system according to Claim 1 , wherein the user state detection unit detects a health condition of the user as the user state, and the learning model selection unit selects the learning model that corresponds 25to the health condition of the user.  
Han teaches learning in the context of health  0024, 0047.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Han to allow for an improvement of context in Phillips by adding a new context such as medical/health, wherein such an addition would aid in predicting user actions e.g. if the user has a health related application where contextual input can comprise commands related to logging health, scheduling appointments, conditions, etc. analogous with Phillips.


Claim 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Phillips; Michael S. et al. US 20110055256 A1 (hereinafter Phillips) in view of KIM and further in view of Biemer; Michael US 20150009010 A1 (hereinafter Biemer).
Re claim 7, Phillips fails to teach
7. The voice interaction system according to Claim 1 , wherein the user state detection unit detects a degree of an awakening state of the user as the user state, and  30the learning model selection unit selects the learning model that corresponds to the degree of the awakening state of the user.  
Biemer teaches an awakening state 0090.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Biemer to allow for an improvement of context in Phillips by adding a new context such as user alert/awake/absence status, wherein such an addition would aid in predicting user actions e.g. if the user is awake or not to preserve battery power of a device or to alert the user in the instance he/she is driving and falls asleep by analyzing gaze, thereby correlating user actions to associated models.




Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

HERGENROEDER; Alex Lauren	US 20180277117 A1
	Voice/speech analysis.


Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL COLUCCI whose telephone number is (571)270-1847.  The examiner can normally be reached on M-F 9 AM - 7 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/MICHAEL COLUCCI/Primary Examiner, Art Unit 2655                                                                                                                                                                                               (571)-270-1847
Examiner FAX:  (571)-270-2847
Michael.Colucci@uspto.gov