DETAILED ACTION
1.	This communication is in response to the Amendments and Arguments (RCE) filed on 8/30/2021. Claims 1-20 are pending and have been examined. Claim 20 is added.
Response to Amendments and Arguments
2.	  Applicant's arguments with respect to claim rejections under 35 U.S.C. 103 have been fully considered, but they are not persuasive. In particular, the applicant argues that the references do not teach most of the limitations of the independent claims, in particular: “generation of synthesized voice according to the estimated status .. the estimated status is based on context information .. the context information includes at least one of an identity of the speech user or an identify of a speech destination user included in the plurality of user ..” In response, the examiner respectfully disagrees.
Note that FADO teaches: [Abstract] “automatically adjusting volume of speech generated by a text-to-speech application <read on synthesized voice> can include measuring an ambient noise level of an audio environment <read on estimating a status which is not based on content of the speech voice>.” NAKADAI teaches: [0010] “to track vision and audition for an object or target” and [0014] “the said audition module in response to sound signals from the said microphones is adapted to extract pitches therefrom, separate their sound sources from each other and locate sound sources such as to identify a sound source as at least one speaker .. the said vision module on the basis of an image taken by the camera is adapted to identify by face, and locate, each such speaker ..” which teach a ready mechanism to “identify” any speaker in a conversation. MORIO teaches: [Abstract] “The multiple voice synthesis unit (16) generates a standard voice signal by means of waveform superimposition based on voice element data read from a voice element database (15) and prosodic information from a voice element selecting unit (14) <read on based on a user’s identity or any other criterion> .. Accordingly, a concurrent vocalization by multiple speakers based on the same text can be implemented.”
The applicant is suggested to recite how does the system obtain the “user identity” to be distinct from the cited reference for allowability consideration.
Claim Rejections - 35 USC § 103
3.	Claims 1-5, 8, 10-15, 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Eggen, et al. (US 20080235018; hereinafter EGGEN) in view of Fado, et al. (US 20040193422; hereinafter FADO), further in view of Nakadai, et al. (US 20040104702; hereinafter NAKADAI) and further in view of Morio, et al. (US 20040054537; hereinafter MORIO).
As per claim 1, EGGEN (Title: Method and System for Determining the Topic of a Conversation and Locating and Presenting Related Content) discloses “A learning device (EGGEN, [0003], the intelligent system would need to monitor the conversation and understand what topic(s) were being discussed <read on learning> without requiring explicit input from the participants) comprising: 
a voice recognition unit configured to perform voice recognition of speech voice of a plurality of users (EGGEN, [0022], The speech recognition system captures the conversation of one or more participants; [0022], The speech recognition system .. converts the audio information to text); 
[ an estimation unit configured to estimate a status ] when a speech is made by a speech user among the plurality of users (EGGEN, [0003], to monitor the conversation); and
a learning unit configured to learn, on a basis of data of the speech voice, a result of the voice recognition, and the estimated status when the speech is made, [ voice synthesis data ] to be used for [ generation of synthesized voice according to the estimated status ] upon voice synthesis (EGGEN, [0003], Based on the conversation, the system would search for and retrieve content and information; [0018], The supplemental content is then presented to the participants; [0018], the expert system presents the content in the form of audio information, including speech, sounds, and music);
wherein [ the estimated status is based on context information other than content of the speech voice ], wherein [ the context information includes at least one of an identity of the speech user or an identify of a speech destination user included in the plurality of user ], and wherein the voice recognition unit, the estimation unit, and the learning unit are each implemented via at least one processor (EGGEN, [0020], the processor 201 could be distributed or singular <to implement multiple functions on a single processor or separate processors is a system design choice>).”
EGGEN does not expressly disclose “an estimation unit configured to estimate a status .. voice synthesis data .. generation of synthesized voice according to the estimated status .. the estimated status is based on context information other than content of the speech voice ..” However, the limitation is taught by FADO (Title: Compensating for ambient noise levels in text-to-speech applications). Note that Specification [0012]: “The estimation unit can estimate, as the statuses when the speech is made, the emotion of the speech user and a noise level.”
In the same field of endeavor, FADO teaches: [Abstract] “automatically adjusting volume of speech generated by a text-to-speech application <read on voice synthesis> can include measuring an ambient noise level of an audio environment <read on estimating a status which is not based on content of the speech voice>. A target volume for speech output 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of FADO in the system taught by EGGEN to provide background noise measurement combined with speech synthesis for synthetic voice power control. Also see Claim 6 where SHABUROV teaches: [0005] “for video conferencing, in which an emotional status of participating individuals can be recognized .. determining the emotional status by analyzing a video channel to detect facial emotions and/or an audio channel to detect speech emotions.”
EGGEN in view of FADO does not expressly disclose “the context information includes at least one of an identity of the speech user or an identify of a speech destination user included in the plurality of user ..” However, the limitation is taught by NAKADAI (Title: Robot audiovisual system). Examiner’s Note: The applicant is requested to clarify how does the system obtain the “user identity.” Otherwise, it is subject to the broadest interpretation as to the process through which the user identity is obtained.
In the same field of endeavor, NAKADAI teaches: [0010] “to track vision and audition for an object or target” and [0014] “the said audition module in response to sound signals from the said microphones is adapted to extract pitches therefrom, separate their sound sources from each other and locate sound sources such as to identify a sound source as at least one speaker .. the said vision module on the basis of an image taken by the camera is adapted to identify by face, and locate, each such speaker ..” 

EGGEN in view of FADO and NAKADAI does not expressly disclose the overall combined limitation of “generation of synthesized voice according to the estimated status .. the estimated status is based on context information .. the context information includes at least one of an identity of the speech user or an identify of a speech destination user included in the plurality of user ..” However, the limitation is taught by MORIO (Title: Text voice synthesis device and program recording medium). 
In the same field of endeavor, MORIO teaches: [Abstract] “The multiple voice synthesis unit (16) generates a standard voice signal by means of waveform superimposition based on voice element data read from a voice element database (15) and prosodic information from a voice element selecting unit (14) <read on based on a user’s identity or any other criterion> .. Accordingly, a concurrent vocalization by multiple speakers based on the same text can be implemented ..” 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of MORIO in the system taught by EGGEN, FADO and NAKADAI to provide personalized synthetic speech based on the known speaker’s identity.
As per claim 2 (dependent on claim 1), EGGEN in view of FADO, NAKADAI and MORIO further discloses “wherein the estimation unit is further configured to generate, on a basis of the voice recognition result, relationship data indicating a relationship between the speech user and the speech destination user included in the plurality of users (EGGEN, [0023], extracts keywords from the transcript of the audio track; [0027], if they are discussing plans for a vacation in Australia, the system 200 may present photographs and nature sounds of Australia; and if they are simply discussing what to have for dinner, the system 200 may present pictures of entrees along with their recipes <where relationship is determined based on the keyword extracted>).”
As per claim 3 (dependent on claim 2), EGGEN in view of FADO, NAKADAI and MORIO further discloses “an image recognition unit configured to analyze a captured image to recognize a face on the image (NAKADAI, [0014], the said vision module on the basis of an image taken by the camera is adapted to identify by face); and
a voice signal processing unit configured to detect a sound source direction on a basis of a voice signal detected when the speech is made (NAKADAI, [0019], to find and identify the direction in which each of the sound sources as individual speakers lies);
wherein the estimation unit is further configured to specify the speech user on a basis of the sound source direction and a direction of the face on the image (NAKADAI, [0010], to track vision and audition for an object or target; [0014], the said audition module in response to sound signals from the said microphones is adapted to extract pitches therefrom, separate their sound sources from each other and locate sound sources such as to identify a sound source as at least one speaker .. the said vision module on the basis of an image taken by the camera is adapted to identify by face, and locate, each such speaker”; [0019], to find and identify the direction in which each of the sound sources as individual speakers lies); and wherein the image recognition unit and the voice signal processing unit are each implemented via at least one processor 201 could be distributed or singular <to implement multiple functions on a single processor or separate processors is a system design choice>).”
As per claim 4 (dependent on claim 3), EGGEN in view of FADO, NAKADAI and MORIO further discloses “wherein the estimation unit is further configured to specify, as the speech destination user, a user making a speech subsequently to a speech made by the speech user (EGGEN, [0022], The speech recognition system captures the conversation of one or more participants <where conversation reads on speech and subsequent responsive speech>; NAKADAI, [0014], to identify a sound source as at least one speaker; [0019], to find and identify the direction in which each of the sound sources as individual speakers lies).” 
As per claim 5 (dependent on claim 3), EGGEN in view of FADO, NAKADAI and MORIO further discloses “wherein the voice signal processing unit is further configured to extract, as a noise component, components of other directions than the sound source direction of speech voice of the speech user from the voice signal (NAKADAI, [0014], the said audition module in response to sound signals from the said microphones <where multiple sound signals read on noise components> is adapted to extract pitches therefrom, separate their sound sources from each other and locate sound sources such as to identify a sound source as at least one speaker <read on target sound other than the noise>).”
As per claim 8 (dependent on claim 1), EGGEN in view of FADO, NAKADAI and MORIO further discloses “wherein the learning unit is further configured to generate, as the voice synthesis data, dictionary data obtained in such a manner that each of the speech voice data and the voice recognition result is classified according to the status when the speech is made (FADO, [Abstract], automatically adjusting volume of speech generated by a text-to-speech application can include measuring <read on classification> an ambient noise level of an 105, 110 are discussing the weather, the system 200 may inspire the participants 105, 110 by presenting information on the weather forecast .. if they are discussing plans for a vacation in Australia, the system 200 may present photographs and nature sounds of Australia; and if they are simply discussing what to have for dinner, the system 200 may present pictures of entrees along with their recipes <read on ‘classified speech voice data and voice recognition result.’ The applicant must clarify ‘dictionary data’ as the term used here is ambiguous>).”
Claims 10 and 11 (similar in scope to claim 1) are rejected under the same rationale as applied above for claim 1.
As per claim 12 (dependent on claim 11), EGGEN in view of FADO, NAKADAI and MORIO further discloses “wherein 20the generation unit is further configured to generate, the synthesized voice taking, as a speaker user for the synthesized voice, a user determined according to the identity of the speech destination user of the synthesized voice (Examiner’s Note: The applicant is requested to clarify the writing of this limitation as it is confusing. MORIO, [Abstract], The multiple voice synthesis unit (16) generates a standard voice signal by means of waveform superimposition based on voice element data read from a voice element database and prosodic information from a voice element selecting unit (14) <read on based on a user’s identity> .. Accordingly, a concurrent vocalization by multiple speakers based on the same text can be implemented).”
claim 13 (dependent on claim 12), EGGEN in view of FADO, NAKADAI and MORIO further discloses “a control unit configured to select a speech user on a basis of relationship data indicating a relationship 30between the speech user and the speech destination user included in the plura1ity of users, the relationship data being 92generated upon learning on a basis of the voice recognition result, wherein the estimation unit and the generation unit are each implemented via at least one processor (EGGEN, [0023], extracts keywords from the transcript of the audio track; [0027], if they are discussing plans for a vacation in Australia, the system 200 may present photographs and nature sounds of Australia; and if they are simply discussing what to have for dinner, the system 200 may present pictures of entrees along with their recipes <where relationship is determined based on the keyword extracted, and to select which speaker user after determining the relationship of the conversation speakers is a system design choice>; [0020], the processor 201 could be distributed or singular <to implement multiple functions on a single processor or separate processors is a system design choice>).”
As per claim 14 (dependent on claim 13), EGGEN in view of FADO, NAKADAI and MORIO further discloses “5wherein the control unit selects the speech destination user on a basis of the content of the text data (EGGEN, [0023], extracts keywords from the transcript of the audio track <where the speech destination user is determined, as a system design choice, based on the keyword extracted from the content of the text data>).”
Claim 15 (similar in scope to claims 3 and 5) is rejected under the same rationale as applied above for claims 3 and 5.  
Claims 17-18, 19 (similar in scope to claims 8-9, 11) are rejected under the same rationale as applied above for claims 8-9, 11. 
claim 20 (dependent on claim 8), EGGEN in view of FADO, NAKADAI and MORIO further discloses “wherein the learning unit generates the dictionary data as the voice synthesis data based on the identity of the speech user and the identity of the speech destination user (MORIO, [Abstract], The multiple voice synthesis unit (16) generates a standard voice signal by means of waveform superimposition based on voice element data read from a voice element database and prosodic information from a voice element selecting unit (14) <read on based on a user’s identity and dictionary data which can be broadly interpreted> .. Accordingly, a concurrent vocalization by multiple speakers based on the same text can be implemented).” 
4.	Claims 6-7, 9, 16 are rejected under 35 U.S.C. 103 as being unpatentable over EGGEN in view of FADO, NAKADAI and MORIO, and further in view of Shaburrov, et al. (US 20150286858; hereinafter SHABUROV).
As per claim 6 (dependent on claim 5), EGGEN in view of FADO, NAKADAI and MORIO further discloses “wherein [ the image recognition unit recognizes an emotion of the speech user whose face is on the image ].”
EGGEN in view of FADO, NAKADAI and MORIO does not expressly disclose “the image recognition unit recognizes an emotion of the speech user whose face is on the image ..” However, the limitation is taught by SHABUROV (Title: Emotion recognition in video conferencing).
In the same field of endeavor, SHABUROV teaches: [0005] “.. for video conferencing, in which an emotional status of participating individuals can be recognized .. determining the emotional status by analyzing a video channel to detect facial emotions and/or an audio channel to detect speech emotions.”

As per claim 7 (dependent on claim 6), EGGEN in view of FADO, NAKADAI, MORIO and SHABUROV further discloses “wherein the estimation unit estimates, as the status when the speech is made, the emotion of the speech user and a noise level (SHABUROV, [0005], an emotional status of participating individuals can be recognized .. determining the emotional status by analyzing a video channel to detect facial emotions and/or an audio channel to detect speech emotions; NAKADINE, [0143], The levels of a background noise automatically measured may be displayed).”
As per claim 9 (dependent on claim 1), EGGEN in view of FADO, NAKADAI and MORIO further discloses “wherein the learning unit is further configured to generate, as the voice synthesis data, [ a neural network ] taking information regarding each of the voice recognition result and the status when the speech is made as input and taking the speech voice data as output (FADO, [Abstract], automatically adjusting volume of speech generated by a text-to-speech application can include measuring an ambient noise level <read on status> of an audio environment; EGGEN, [0003], Based on the conversation <read on ‘voice recognition result’>, the system would search for and retrieve content and information; [0018], the expert system presents the content in the form of audio information, including speech, sounds, and music).”
EGGEN in view of FADO, NAKADAI and MORIO does not expressly disclose “a neural network  ..” However, the limitation is taught by SHABUROV (Title: Emotion recognition in video conferencing).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of SHABUROV in the system taught by EGGEN, FADO, NAKADAI and MORIO to provide a ready mechanism for speech recognition and speech synthesis based on a neural network algorithm.
As per claim 16 (dependent on claim 15), EGGEN in view of FADO, NAKADAI and MORIO further discloses “20wherein the estimation unit [ specifies an emotion of the speaker user on a basis of the content of the text data to estimate the emotion ] of the speaker user and a noise level as the status (EGGEN, [0023], extracts keywords from the transcript of the audio track <read on estimating any status including the speaker’s emotion based on the extracted keywords from the content of the text data>; NAKADINE, [0143], The levels of a background noise automatically measured may be displayed).” 
EGGEN in view of FADO, NAKADAI and MORIO does not expressly disclose “specifies an emotion of the speaker user on a basis of the content of the text data to estimate the emotion ..” However, the limitation is taught by SHABUROV (Title: Emotion recognition in video conferencing).
In the same field of endeavor, SHABUROV teaches: [0005] “an emotional status of participating individuals can be recognized .. determining the emotional status by analyzing .. an audio channel to detect speech emotions” and [0012] “the recognizing of the speech emotion can comprise recognizing a speech in the audio stream.”

Conclusion
5.	 Any inquiry concerning this communication or earlier communications from the examiner should be directed to FENG-TZER TZENG whose telephone number is (571)272-4609. The examiner can normally be reached on M-F (8:00-5:30). The fax phone number where this application or proceeding is assigned is 571-273-4609.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir (SPE) can be reached on (571)272-7799.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/FENG-TZER TZENG/	9/20/2021

Primary Examiner, Art Unit 2659