DETAILED ACTION
1.	This communication is in response to the Application filed on 7/17/2019. Claims 1-19 are pending and have been examined.
Claim Rejections - 35 USC § 103
2.	Claims 1-2, 8-14, 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Eggen, et al. (US 20080235018; hereinafter EGGEN) in view of Senior, et al. (US 8527276; hereinafter SENIOR).
As per claim 1, EGGEN (Title: Method and System for Determining the Topic of a Conversation and Locating and Presenting Related Content) discloses “A learning device (EGGEN, [0003], the intelligent system would need to monitor the conversation and understand what topic(s) were being discussed <read on learning> without requiring explicit input from the participants) comprising: 
a voice recognition unit configured to perform voice recognition of speech voice of a plurality of users (EGGEN, [0022], The speech recognition system captures the conversation of one or more participants; [0022], The speech recognition system .. converts the audio information to text); 
an estimation unit configured to estimate a status when a speech is made (EGGEN, [0023], extracts keywords from the transcript of the audio track; [0003], the intelligent system would need to monitor the conversation and understand what topic(s) were being discussed <where ‘keywords’ and ‘topics’ read on ‘status’ of the speech/conversation which can be broadly interpreted>); and
a learning unit configured to learn, on a basis of data of the speech voice, a result of the voice recognition, and the status when the speech is made, [ voice synthesis data ] to be used for [ generation of synthesized voice ] according to a status upon voice synthesis (EGGEN, [0003], Based on the conversation, the system would search for and retrieve content and information; [0018], The supplemental content is then presented to the participants; [0018], the expert system presents the content in the form of audio information, including speech, sounds, and music).”
EGGEN does not expressly disclose “voice synthesis data .. generation of synthesized voice ..” However, the limitation is taught by SENIOR (Title: Speech synthesis using deep neural networks).
In the same field of endeavor, SENIOR teaches: [Abstract] “speech synthesis using deep neural networks.”
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of SENIOR in the system taught by EGGEN to generate synthesized speech for voice instructions or reminders to the participants in the conversation.
As per claim 2 (dependent on claim 1), EGGEN in view of SENIOR further discloses “wherein the estimation unit generates, on a basis of the voice recognition result, relationship data indicating a relationship between a speech user and a speech destination user included in the plurality of users (EGGEN, [0023], extracts keywords from the transcript of the audio track <where relationship is determined based on the keyword extracted from voice/speech such as ‘father’ or ‘mother’ and so on>).”
As per claim 8 (dependent on claim 1), EGGEN in view of SENIOR further discloses “wherein the learning unit generates, as the voice synthesis data, dictionary data obtained in such a manner that each of the speech voice data and the voice recognition result is classified according to the status when the speech is made (EGGEN, [0027], if the participants 105, 110 are discussing the weather, the system 200 may inspire the participants 105, 110 by presenting information on the weather forecast .. if they are discussing plans for a vacation in Australia, the system 200 may present photographs and nature sounds of Australia; and if they are simply discussing what to have for dinner, the system 200 may present pictures of entrees along with their recipes <read on classified recognition results and ‘dictionary data’ based on the recognized keywords/topics or ‘status’, where ‘status’ can be broadly interpreted>; SENIOR, Abstract, speech synthesis using deep neural networks).”
As per claim 9 (dependent on claim 1), EGGEN in view of SENIOR further discloses “wherein the learning unit generates, as the voice synthesis data, a neural network taking information regarding each of the voice recognition result and the status when the speech is made as input and taking the speech voice data as output (SENIOR, Abstract, speech synthesis using deep neural networks).”
Claim 10 (similar in scope to claim 1) is rejected under the same rationale as applied above for claim 1.  
As per claim 11, EGGEN discloses “A [ voice synthesis ] device comprising: 
an estimation unit configured to estimate a status (EGGEN, [0023], extracts keywords from the transcript of the audio track; [0003], the intelligent system would need to monitor the conversation and understand what topic(s) were being discussed <where ‘keywords’ and ‘topics’ read on ‘status’ of the speech/conversation which can be broadly interpreted>); and
a generation unit configured to use [ voice synthesis data ] generated by learning on a basis of data on speech voice 15of a plurality of users, a voice recognition result of the speech voice, and a status when a speech is made to [ generate synthesized voice ] indicating a content of predetermined text data and obtained according to the estimated status (EGGEN, [0001], a method and system for obtaining and presenting content <read on predetermined text data such as already existent on the Web> that is relevant to an ongoing conversation <where ‘conversation’ reads on a plurality of users>; [0003], the intelligent system would need to monitor the conversation and understand what topic(s) were being discussed <read on learning and where ‘topics’ read on ‘status’ which can be broadly interpreted> without requiring explicit input from the participants; [0022], The speech recognition system .. converts the audio information to text; [0018], the expert system presents the content in the form of audio information, including speech, sounds, and music).”
EGGEN does not expressly disclose “voice synthesis .. generate synthesized voice ..” However, the limitation is taught by SENIOR (Title: Speech synthesis using deep neural networks).
In the same field of endeavor, SENIOR teaches: [Abstract] “speech synthesis using deep neural networks.”
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of SENIOR in the system taught by EGGEN to generate synthesized speech for voice instructions or reminders to the participants in the conversation.
As per claim 12 (dependent on claim 11), EGGEN in view of SENIOR further discloses “wherein 20the generation unit generates the synthesized voice taking, as a speaker user, a user according to a speech destination user of the synthesized voice (SENIOR, Abstract, speech synthesis using deep neural networks <The applicant is requested to clarify the claim as the writing is hard to understand>).”
claim 13 (dependent on claim 12), EGGEN in view of SENIOR further discloses “a control unit configured to select a speech user on a basis of relationship data indicating a relationship 30between the speech user and a speech destination user included in the plura1ity of users, the relationship data being 92generated upon learning on a basis of the voice recognition result (<The applicant is requested to clarify the limitation, because to select a speech user, the system must know BOTH the relationship and the speech destination user, but the claim does not specify if the destination speech user is already known>; EGGEN, [0023], extracts keywords from the transcript of the audio track <where the relationship is determined based on the keyword extracted such as ‘father’ or ‘mother’ and so on>; [0022], The speech recognition system captures the conversation of one or more participants; [0022], The speech recognition system .. converts the audio information to text <for keyword extraction to determine relationship>).”
As per claim 14 (dependent on claim 13), EGGEN in view of SENIOR further discloses “5wherein the control unit selects the speech destination user on a basis of the content of the text data (EGGEN, [0023], extracts keywords from the transcript of the audio track <where the speech destination user is determined based on the keyword extracted such as a name or ‘father’ or ‘mother’ from the content of the text data such as to determine relationship given the identity of the speech user>).”
Claims 17-18, 19 (similar in scope to claims 8-9, 11) are rejected under the same rationale as applied above for claims 8-9, 11.  
3.	Claims 3-5, 15 are rejected under 35 U.S.C. 103 as being unpatentable over EGGEN in view of SENIOR, and further in view of Nakadai, et al. (US 20040104702; hereinafter NAKADAI).
claim 3 (dependent on claim 2), EGGEN in view of SENIOR further discloses “an image recognition unit configured to analyze a captured image [ to recognize a face on the image ]; and
a voice signal processing unit configured [ to detect a sound source direction ] on a basis of a voice signal detected when the speech is made,
wherein the estimation unit [ specifies the speech user on a basis of the sound source direction and a direction of the face on the image ].”
EGGEN in view of SENIOR does not expressly disclose “to recognize a face on the image .. to detect a sound source direction .. specifies the speech user on a basis of the sound source direction and a direction of the face on the image ..” However, the limitation is taught by NAKADAI (Title: Robot audiovisual system).
In the same field of endeavor, NAKADAI teaches: [0010] “.. to track vision and audition for an object or target,” [0014] “the said audition module in response to sound signals from the said microphones is adapted to extract pitches therefrom, separate their sound sources from each other and locate sound sources such as to identify a sound source as at least one speaker .. the said vision module on the basis of an image taken by the camera is adapted to identify by face, and locate, each such speaker ..” and [0019] “to find and identify the direction in which each of the sound sources as individual speakers lies.” 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of NAKADAI in the system taught by EGGEN and SENIOR to identify a talker based on sound source direction and face recognition.
claim 4 (dependent on claim 3), EGGEN in view of SENIOR and NAKADAI further discloses “wherein the estimation unit specifies, as the speech destination user, a user making a speech subsequently to a speech made by the speech user (EGGEN, [0022], The speech recognition system captures the conversation of one or more participants <where conversation reads on speech and subsequent responsive speech>; NAKADAI, [0014], to identify a sound source as at least one speaker; [0019], to find and identify the direction in which each of the sound sources as individual speakers lies).”
As per claim 5 (dependent on claim 3), EGGEN in view of SENIOR and NAKADAI further discloses “wherein the voice signal processing unit extracts, as a noise component, components of other directions than the sound source direction of speech voice of the speech user from the voice signal (NAKADAI, [0014], the said audition module in response to sound signals from the said microphones <where multiple sound signals read on noise components> is adapted to extract pitches therefrom, separate their sound sources from each other and locate sound sources such as to identify a sound source as at least one speaker <read on target sound other than the noise>).”
Claim 15 (similar in scope to claims 3 and 5) is rejected under the same rationale as applied above for claims 3 and 5.  
4.	Claims 6-7, 16 are rejected under 35 U.S.C. 103 as being unpatentable over EGGEN in view of SENIOR and NAKADAI, and further in view of Shaburrov, et al. (US 20150286858; hereinafter SHABUROV).
As per claim 6 (dependent on claim 5), EGGEN in view of SENIOR and NAKADAI further discloses “wherein [ the image recognition unit recognizes an emotion of the speech user whose face is on the image ].”
the image recognition unit recognizes an emotion of the speech user whose face is on the image ..” However, the limitation is taught by SHABUROV (Title: Emotion recognition in video conferencing).
In the same field of endeavor, SHABUROV teaches: [0005] “.. for video conferencing, in which an emotional status of participating individuals can be recognized .. determining the emotional status by analyzing a video channel to detect facial emotions and/or an audio channel to detect speech emotions.”
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of SHABUROV in the system taught by EGGEN, SENIOR and NAKADAI to provide emotion estimation based on the talker’s speech or facial expression.
As per claim 7 (dependent on claim 6), EGGEN in view of SENIOR, NAKADAI and SHABUROV further discloses “wherein the estimation unit estimates, as the status when the speech is made, the emotion of the speech user and a noise level (SHABUROV, [0005], an emotional status of participating individuals can be recognized .. determining the emotional status by analyzing a video channel to detect facial emotions and/or an audio channel to detect speech emotions; NAKADINE, [0143], The levels of a background noise automatically measured may be displayed).”
As per claim 16 (dependent on claim 15), EGGEN in view of SENIOR and NAKADAI further discloses “20wherein the estimation unit [ specifies an emotion of the speaker user on a basis of the content of the text data to estimate the emotion ] of the speaker user and a noise level as the status (EGGEN, [0023], extracts keywords from the transcript of the audio track <read on .” 
EGGEN in view of SENIOR and NAKADAI does not expressly disclose “specifies an emotion ] of the speaker user on a basis of the content of the text data to estimate the emotion ..” However, the limitation is taught by SHABUROV (Title: Emotion recognition in video conferencing).
In the same field of endeavor, SHABUROV teaches: [0005] “an emotional status of participating individuals can be recognized .. determining the emotional status by analyzing .. an audio channel to detect speech emotions” and [0012] “the recognizing of the speech emotion can comprise recognizing a speech in the audio stream.”
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of SHABUROV in the system taught by EGGEN, SENIOR and NAKADAI to provide emotion estimation based on the talker’s recognized spoken content.
 				Conclusion
5.	 Any inquiry concerning this communication or earlier communications from the examiner should be directed to FENG-TZER TZENG whose telephone number is (571)272-4609. The examiner can normally be reached on M-F (8:00-5:30). The fax phone number where this application or proceeding is assigned is 571-273-4609.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir (SPE) can be reached on (571)272-7799.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications 

/FENG-TZER TZENG/	3/10/2021

Primary Examiner, Art Unit 2659