DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the
first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed on September 16th, 2022 has been entered. Claims 1-19 remain
pending. Applicant’s amendments and changes have overcome all the 35 U.S.C. 112(b) rejections and specification objections previously set forth in the Non-Final Office Action mailed on May 13th, 2022. 

Response to Arguments
Applicant’s arguments filed on May 13th, 2022 have been fully considered but
they are not persuasive. Applicant’s arguments with respect to claims 1, 4-13, 16-17,
and 20 have been considered but are moot because the new grounds of rejection were necessitated due to the amendments as there has been a change in scope.
	Applicant argues on pg. 10-11 and 14-15, that the cited references do not teach, “wherein the determining the speech style characteristic for the received plurality of sentences includes receiving setting information for modifying detailed characteristics of the speech style characteristic for a part of the plurality of sentences through the user interface and modifying the detailed characteristics of the speech style characteristic for the part of the plurality of sentences, and the determined speech style characteristic includes the modified detailed characteristics for the part of the plurality of sentences and original detailed characteristics for other part of the plurality of sentences” as recited in amended independent claim 1. Furthermore, figure 3 and 5 of Kurz is used to argue that the voice model for all portions 52 and 54 are updated in the same manner simultaneously; therefore, fails to teach the amended limitation present in independent claim 1. Moreover, Tam, Pore, and Yang are argued to not teach the limitations either. 
	The limitation, “wherein the determining the speech style characteristic for the received plurality of sentences includes receiving setting information for modifying detailed characteristics of the speech style characteristic for a part of the plurality of sentences through the user interface and modifying the detailed characteristics of the speech style characteristic for the part of the plurality of sentences, and the determined speech style characteristic includes the modified detailed characteristics for the part of the plurality of sentences and original detailed characteristics for other part of the plurality of sentences” further limits the scope of the claim as it gives detail on an above limitation of determining a speech style characteristic; furthermore, with the use of user interface. Please see below for the factual inquiries for establishing obviousness under 35 U.S.C. 103 as to give detail onto the rationale for obviousness for the features that have been amended. Applicant’s arguments with respect to independent claim 1 under 102(a)(1) have been fully considered and are moot upon a further consideration and a new ground(s) of rejection made under AIA  35 U.S.C. 103 as being unpatentable over Chen et al. (US Pub. No. 2014/0025382 A1) hereinafter Chen in view of Kurzweil et al. (US Pub. No. 2019/0196666 A1) hereinafter Kurz. 

	Applicant argues on pg. 15, that support for new claims 17-19 can be found, for example, at paragraphs [0135] and [0151] of the present publication (US 2021/0142783); applicant further argues that the cited references do not teach the limitations present as recited in claims 17-19. 
	It is agreed upon that support for the new references can be found in paras. 135 and 151 of the present publication. Please see below for the factual inquiries for establishing obviousness under 35 U.S.C. 103 as to give detail onto the rationale for obviousness for the features that have been amended. Applicant’s arguments with respect to new claims 17-19 have been fully considered and are moot upon a further consideration and ground(s) of rejection made under AIA  35 U.S.C. 103 as being unpatentable over Chen et al. (US Pub. No. 2014/0025382 A1) hereinafter Chen in view of Kurzweil et al. (US Pub. No. 2019/0196666 A1) hereinafter Kurz further in view of Mahyar (US Pat. No. 10,930,263) for claims 17-18, and claim 19 under the ground(s) of rejection made under AIA  35 U.S.C. 103 as being unpatentable over Chen et al. (US Pub. No. 2014/0025382 A1) hereinafter Chen in view of Kurzweil et al. (US Pub. No. 2019/0196666 A1) hereinafter Kurz further in view of S. Yang, Z. Wu and L. Xie, "On the training of DNN-based average voice model for speech synthesis," 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp. 1-6, doi: 10.1109/APSIPA.2016.7820818.




Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35
U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness
rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under
35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


This application currently names joint inventors. In considering patentability of the
claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.


Claims 1-3, 6-9, 11, 12, and 15-16 are rejected under 35 U.S.C. 103 as being
unpatentable over Chen et al. (US Pub. No. 2014/0025382 A1) hereinafter Chen in view of Kurzweil et al. (US Pub. No. 2019/0196666 A1) hereinafter Kurz.	
Regarding claim 1, Chen teaches a method for generating a synthetic speech for text through a user interface (Para. 19-20, In an embodiment a text to speech method is provided, the method comprising: [0020] receiving input text; furthermore, Para. 54 indicates the input module 11 is connected to a text input 15. Text input 15 receives text. The text input 15 may be for example a keyboard i.e. user interface with program, keyboard, and user) the method comprising:
receiving a plurality of sentences (Para. 54, text input 15 may be a means for receiving text data from an external storage medium or a network; furthermore, para. 73 indicates It is assumed that each utterance in the training data 251 contains unique expressive information. This unique expressive information can be determined from the speech data and can be read from the transcription of the speech, i.e. the text data as well. In the training data, the speech sentences and text sentences are synchronized as shown in FIG. 5 i.e. text data received are sentences which are plural);
determining a speech style characteristic for the plurality of sentences (Para. 79, the "expressive linguistic feature extraction" block 253 converts the text to be synthesized into a linguistic feature vector in linguistic feature space 255, then through the transformation block 261, the linguistic feature is mapped to a synthesis feature in expressive synthesis space 259. This synthesis feature vector contains the emotion information in original text data); and
outputting a synthetic speech for the plurality of sentences that reflects the determined speech style characteristic (Para. 79, can be used by the synthesizer 207 (FIG. 4) directly to synthesize the expressive speech from the sentences provided as input),
the plurality of sentences and the determined speech style characteristic are inputted to an artificial neural network text-to-speech synthesis model and the synthetic speech is generated based on speech data outputted from the artificial neural network text- to-speech synthesis model (Para. 80, In an embodiment, machine learning methods, e.g. neural network (NN) are used to provide the transformation block 261 and train the transformations from expressive linguistic space 255 to expressive synthesis space 259. For each sentence in the training data 251 as in various sentences from the plurality stated above, the speech data is used to generate an expressive synthesis feature vector in synthesis feature space 259 and the transcription of the speech data is used to generate an expressive linguistic feature in linguistic feature space 255.  Para 79, synthesis feature vector contains the emotion information in original text data and can be used by the synthesizer 207 (FIG. 4) directly to synthesize the expressive speech.  Using the linguistic features of the training data as the input of NN and the synthesis features of the training data as the target output, the parameters of the NN can be updated to learn the mapping from linguistic feature space to synthesis feature space i.e. linguistic features containing text and speech style characteristics are inputted into the neural network as to synthesize speech matching the features). 
However, Chen fails to explicitly disclose:
wherein the determining the speech style characteristic for the received plurality of sentences includes receiving setting information for modifying detailed characteristics of the speech style characteristic for a part of the plurality of sentences through the user interface and modifying the detailed characteristics of the speech style characteristic for the part of the plurality of sentences, and the determined speech style characteristic includes the modified detailed characteristics for the part of the plurality of sentences and original detailed characteristics for other part of the plurality of sentences,
In a related field of endeavor (e.g. narration of text, see para. 5) Kurz teaches text has some portions that have been associated with a particular character or voice model and others that have not. This is represented visually on the user interface as some portions exhibiting a visual indicium and others not exhibiting a visual indicium (e.g., the text includes some highlighted portions and some non-highlighted portions) A default voice model can be used to provide the narration for the portions that have not been associated with a particular character or voice model (e.g., all non-highlighted portions). For example, in a typical story much of the text relates to describing the scene and not to actual words spoken by characters in the story. Such non-dialog portions of the text may remain non-highlighted and not associated with a particular character or voice model. These portions can be read using the default voice (e.g., a narrator's voice) while the dialog portions may be associated with a particular character or voice model (and indicated by the highlighting) such that a different, unique voice is used for dialog spoken by each character in the story, see para. 31. For example, depending on the portions of the text and the determined text style characteristics, these portions are then reflected with their corresponding voice model which are indicated in the user interface with highlighting; furthermore, para. 32 indicates a menu giving specifications to the various voice models. Furthermore, para. 36 indicates, The system 10 determines 112 if the user is making additional selections of portions of the text to associate with particular characters. If the user is making additional selections of portions of the text, the system returns to receiving 104 the user's selection of portions of the text, displays 106 the menu of available characters, receives a user selection and generates a visual indication to apply to a subsequent portion of text. Furthermore, parts of the plurality of sentences may be highlighted individually as to choose the speaker model with slight alterations to a default model where it leads to a modification in the synthesis of the text. Moreover, Kurz teaches the narration software 30 permits the user to select and optionally modify a particular voice model which defines and controls aspects of the computer voice, including for example, the speaking speed and volume. The voice model includes the language of the computer voice. The voice model may be selected from a database that includes multiple voice models to apply to selected portions of the document. A voice model can have other parameters associated with it besides the voice itself and the language, speed and volume, including, for example, gender (male or female), age (e.g. child or adult), voice pitch, visual indication (such as a particular color of highlighting) of document text that is associated with this voice model, emotion (e.g. angry, sad, etc.), intensity (e.g. mumble, whisper, conversational, projecting voice as at a party, yell, shout). The user can select different voice models to apply to different portions of text such that when the system 10 reads the text the different portions are read using the different voice models. The system can also provide a visual indication, such as highlighting, of which portions are associated with which voice models in the electronic document, see para. 26 and figure 8 with underlined text; therefore, detailed modifications may be made to a default model and the coordinated highlighting may differentiate a detailed modification of the model characteristics compared to an original, as such, creating a sentence where part of the sentence has a detailed modification of the original characteristics as to whisper and return to its original characteristics within the same sentence. 
Modifying Chen to include the features disclosed by Kurz discloses:
 wherein the determining the speech style characteristic for the received plurality of sentences includes receiving setting information for modifying detailed characteristics of the speech style characteristic for a part of the plurality of sentences through the user interface (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature of wherein the determining the speech style characteristics for the received plurality of sentences includes receiving setting information for modifying detailed characteristics of the speech style characteristic for a plurality of sentences through the user interface as taught by Kurz, see para 31-32) and modifying the detailed characteristics of the speech style characteristic for the part of the plurality of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature of also modifying the detailed characteristics of the speech style characteristic for the part of the plurality of sentences as taught by Kurz, see para. 26 and 32), and the determined speech style characteristic includes the modified detailed characteristics for the part of the plurality of sentences and original detailed characteristics for other part of the plurality of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature of also where the determined speech style characteristic includes the modified detailed characteristics for the part of the plurality of sentences and original detailed characteristics for other part of the plurality of sentences).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Kurz to the method of Chen, given the similar field of endeavor (e.g. narration of text); furthermore, doing so would have provided the users of Chen, with the added benefits as Pre-associating a voice model with a document and enabling a user to overwrite the voice model with a different voice model for selected portions of words in the document provides the advantage of enabling a user to select one or more portions of the text to be read using a voice model that is different from the narrator's voice model. It is believed that this can be advantageous because using multiple voices when reading the document can produce an audio output that is more interesting and engaging for a user. Enabling the user to overwrite the voice model that is preassociated with the document can also provide the advantage of allowing the user to associate different voice models with different portions of the document and this may be done through the user interface. For example, if the document is a play script each role in the play can be selected and associated with a different voice model while background information or other non-entity read parts can be read using the narrator's voice as recognized by Kurz, see para. 7 and para. 53 recognizes the computer system reduces the amount of time necessary to select and associate voice models with different portions of the story.

Regarding claim 2, Chen in view of Kurz teaches the method of claim 1 (see claim 1 above), in addition, Chen teaches:
further comprising outputting the plurality of sentences (Para. 79, can be used by the synthesizer 207 (FIG. 4) directly to synthesize the expressive speech i.e. synthesizes sentences that are inputted), wherein the determining the speech style characteristics of the received plurality of sentences includes changing setting information for at least a part of the outputted one or more sentences (Para. 81. The "linguistic feature extraction" block 253 converts the text data into a linguistic feature vector. This feature vector should contain the discriminative information, i.e. if two text data contains different emotion, their linguistic features should be distinguishable in the linguistic features space i.e. feature vectors are changed therefore parameters/settings of the synthesized speech are also changed; furthermore, broadest reasonable interpretation of changed setting information are anything that may change the speech style characteristics applied hereinafter in the office action, and in this case, the feature vectors may change the speech style characteristics), the speech style characteristic applied to the at least part of the plurality of sentences is changed based on the changed setting information (Para. 81. The "linguistic feature extraction" block 253 converts the text data into a linguistic feature vector. This feature vector should contain the discriminative information, i.e. if two text data contains different emotion, their linguistic features should be distinguishable in the linguistic features space i.e. feature vectors are changed therefore parameters/settings of the synthesized speech are also changed; therefore, speech style applied is changed based on the settings changed when determining the speech style characteristics for the received one or more sentences), and 
the at least part of the plurality of sentences and the changed speech style characteristic are inputted to the artificial neural network text-to-speech synthesis model and the synthetic speech is changed based on speech data outputted from the artificial neural network text-to-speech synthesis model (Para. 80, In an embodiment, machine learning methods, e.g. neural network (NN) are used to provide the transformation block 261 and train the transformations from expressive linguistic space 255 to expressive synthesis space 259. For each sentence in the training data 251, the speech data is used to generate an expressive synthesis feature vector in synthesis feature space 259 and the transcription of the speech data is used to generate an expressive linguistic feature in linguistic feature space 255. Using the linguistic features of the training data as the input of NN and the synthesis features of the training data as the target output, the parameters of the NN can be updated to learn the mapping from linguistic feature space to synthesis feature space i.e. linguistic features containing text and speech style characteristics are inputted into the neural network as to synthesize speech matching the features; furthermore, para. Para. 81 indicates the "linguistic feature extraction" block 253 converts the text data into a linguistic feature vector. This feature vector should contain the discriminative information, i.e. if two text data contains different emotion, their linguistic features should be distinguishable in the linguistic features space i.e. feature vectors are changed therefore parameters/settings of the synthesized speech are also changed; therefore, these individual feature vectors are inputted in the model to synthesize speech according to the indicated speech style characteristics).

Regarding claim 3, Chen in view of Kurz  teaches the method of claim 2 (see claim 2 above);
 However, the previous rationale for obviousness under independent claim 1 does not make obvious:
wherein the changing the setting information for the at least part of the outputted plurality of sentences includes changing setting information for visual representation of the part of the outputted plurality of sentences.
In a related field of endeavor (e.g. narration of text, see para. 5) Kurz teaches text has some portions that have been associated with a particular character or voice model and others that have not. This is represented visually on the user interface as some portions exhibiting a visual indicium and others not exhibiting a visual indicium (e.g., the text includes some highlighted portions and some non-highlighted portions) A default voice model can be used to provide the narration for the portions that have not been associated with a particular character or voice model (e.g., all non-highlighted portions). For example, in a typical story much of the text relates to describing the scene and not to actual words spoken by characters in the story. Such non-dialog portions of the text may remain non-highlighted and not associated with a particular character or voice model. These portions can be read using the default voice (e.g., a narrator's voice) while the dialog portions may be associated with a particular character or voice model (and indicated by the highlighting) such that a different, unique voice is used for dialog spoken by each character in the story, see para. 31. For example, depending on the portions of the text and the determined text style characteristics, these portions are then reflected with their corresponding voice model which are indicated in the user interface with highlighting; furthermore, para. 32 indicates a menu giving specifications to the various voice models. Furthermore, para. 36 indicates, The system 10 determines 112 if the user is making additional selections of portions of the text to associate with particular characters. If the user is making additional selections of portions of the text, the system returns to receiving 104 the user's selection of portions of the text, displays 106 the menu of available characters, receives a user selection and generates a visual indication to apply to a subsequent portion of text.
Modifying Chen to include the features of Kurz discloses:
wherein the changing the setting information for the at least part of the outputted plurality of sentences includes changing setting information for visual representation of the part of the outputted plurality of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature where it includes changing setting information for visual representation of the part of the outputted plurality sentences from changing the setting information for the at least as taught by Kurz, see paras. 31-32 and 36), and 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Kurz to the method of Chen, given the similar field of endeavor (e.g. narration of text); furthermore, doing so would have provided the users of Chen, with the added benefits of distinguishing distinctive characteristics in the text as by exhibiting a visual indicium while other portions do not as to represent how the text is being synthesized in different portions of the text, as a person of ordinary skill in the art would recognize from teachings by Kurz, see para. 31. Furthermore, para. 35 indicates In order to make selection of the character more user friendly, the drop down menu 55 can include an image (e.g., images 57, 59, and 61) of the character. For example, one of the character voices can be similar to the voice of the Fox television cartoon character Homer Simpson (e.g., character 58), an image of Homer Simpson (e.g., image 59) could be included in the drop down menu 55. Inclusion of the images is believed to make selection of the desired voice model to apply to different portions of the text more user friendly.

Regarding claim 6, Chen in view of Kurz teaches the method of claim 1 (see claim 1 above), using the motivation for combination stated in independent claim 1, Kurz also teaches:
dividing the plurality of sentences into one or more sets of sentences (para. 54, the computer system can step through each of the non-highlighted or non-associated portions and ask the user which character to associate with the quotation. For example, the computer system could recognize that the first portion 202 of the text shown in FIG. 9 is spoken by the narrator because the portion is not enclosed in quotations. When reaching the first set of quotations including the text “Please man give me that straw to build me a house,” the computer system could request an input from the user of which character to associate with the quotation. Such a process could continue until the entire text had been associated with different characters; furthermore, it may be divided and analyzed with the use of natural language process, see para. 55) wherein the determining the speech style characteristic for the received plurality of sentences includes:
determining a role corresponding to the divided one or more sets of sentences (Para. 53, the computer system searches the text of a story 200 (in this case the story of the Three Little Pigs) to identify the portions spoken by the narrator (e.g., the non-dialog portions). The system associates all of the non-dialog portions with the voice model for the narrator as indicated by the highlighted portions 202, 206, and 210); and 
setting a predetermined speech style characteristic corresponding to the determined
role (Para. 31 teaches, these portions can be read using the default voice (e.g., a narrator's voice) as stated above, the determined role is the narrator which has a default voice i.e. predetermined speech style characteristic).
The reasoned combination of Chen, as modified by the teachings of Kurz noted above thus makes obvious:
dividing the plurality of sentences into one or more sets of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature where it divides the plurality of sentences into one or more sets of sentences as taught by Kurz, see paras. 54-55), wherein the determining the speech style characteristic for the received plurality of sentences includes:
determining a role corresponding to the divided one or more sets of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature where it includes determining a role corresponding to the divided one or more sets of sentences as taught by Kurz, see para. 53); and 
setting a predetermined speech style characteristic corresponding to the determined
role (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature where it includes setting a predetermined speech style characteristic corresponding to the determined role as taught by Kurz, see para. 31).

Regarding claim 7, Chen in view of Kurz teaches the method of claim 6 (see claim 6 above), using the motivation for combination stated in independent claim 1, Kurz also teaches:
wherein an analysis result is generated by analyzing the divided one or more sets of sentences using natural language processing (Para. 55, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process i.e. division of sets of one or more sentences undergo natural language processing analysis), and
the determining the role corresponding to the divided one or more sets of sentences includes:
outputting one or more role candidates recommended based on the analysis result of the one or more sets of sentences (Para. 55, Such a process could continue until the entire text had been associated with different characters; furthermore, it may be divided and analyzed with the use of natural language process. Kurz also teaches in para. 53, the computer system searches the text of a story 200 (in this case the story of the Three Little Pigs) to identify the portions spoken by the narrator (e.g., the non-dialog portions). The system associates all of the non-dialog portions with the voice model for the narrator as indicated by the highlighted portions 202, 206, and 210) i.e. role candidates are recommended based on the analysis result conducted through natural language processing in a manner of textual entity resolution); and
selecting at least a part of the outputted one or more role candidates (Para. 55, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process; furthermore In the event that the computer system selects the incorrect character, the user can modify the character selection using one or more of techniques described herein i.e. selecting at least a part of the outputted on or more role candidates).
The reasoned combination of Chen, as modified by the teachings of Kurz noted above thus makes obvious:
wherein the divided one or more sets of sentences are analyzed using natural language processing (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature wherein the divided one or more sets of sentences are analyzed using natural language processing as taught by Kurz, see para. 55), and 
the determining the role corresponding to the divided one or more sets of sentences includes:
outputting one or more role candidates recommended based on the analysis result of the one or more sets of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature outputting one or more role candidates recommended based on the analysis result of the one or more sets of sentences as taught by Kurz, see paras. 53 and 55); and
selecting at least a part of the outputted one or more role candidates (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature selecting at least a part of the outputted one or more role candidates as taught by Kurz, see para. 55).

Regarding claim 8, Chen in view of Kurz teaches the method of claim 7 (see claim 7 above), using the motivation for combination stated in independent claim 1, Kurz also teaches:
wherein the divided one or more sets of sentences are grouped based on the analysis result (Para. 55, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process i.e. division of sets of one or more sentences undergo natural language processing analysis and the result leads to a grouping that may be seen through the highlighting as indicated by para. 31), and
the determining the role corresponding to the divided one or more sets of sentences
includes:
outputting one or more role candidates corresponding to each of the grouped sets of sentences recommended based on the analysis result (Para. 55, Such a process could continue until the entire text had been associated with different characters; furthermore, it may be divided e.g. grouped sentences and analyzed with the use of natural language process. Kurz also teaches in para. 53, the computer system searches the text of a story 200 (in this case the story of the Three Little Pigs) to identify the portions spoken by the narrator (e.g., the non-dialog portions). The system associates all of the non-dialog portions with the voice model for the narrator as indicated by the highlighted portions 202, 206, and 210) i.e. role candidates are recommended based on the analysis result conducted through natural language processing in a manner of textual entity resolution; furthermore, the grouping is demonstrated by the highlighting as to distinguish the various portions corresponding to each speaker, see para. 31); and
selecting at least a part of the outputted one or more role candidates (Para. 55, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process; furthermore In the event that the computer system selects the incorrect character, the user can modify the character selection using one or more of techniques described herein i.e. selecting at least a part of the outputted on or more role candidates).
more role candidates).
The reasoned combination of Chen, as modified by the teachings of Kurz noted above thus makes obvious:
wherein the divided one or more sets of sentences are grouped based on the analysis result (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature wherein the divided one or more sets of sentences are grouped based on the analysis result as taught by Kurz, see paras. 31 and 55), and 
the determining the role corresponding to the divided one or more sets of sentences includes:
outputting one or more role candidates corresponding to each of the grouped sets of sentences recommended based on the analysis result (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature outputting one or more role candidates recommended corresponding to each of the grouped sets recommended based on the analysis result taught by Kurz, see paras. 31 and 55); and
selecting at least a part of the outputted one or more role candidates (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature selecting at least a part of the outputted one or more role candidates as taught by Kurz, see para. 55).

Regarding claim 9, Chen in view of Kurz teaches the method of claim 7 (see claim 7 above), ), using the motivation for combination stated in independent claim 1, Kurz also teaches:
wherein the determining the speech style characteristic for the received plurality of sentences includes:
outputting one or more speech style characteristic candidates recommended based on the analysis result of the one or more sets of sentences (Para. 55, Such a process could continue until the entire text had been associated with different characters; furthermore, it may be divided e.g. grouped sentences and analyzed with the use of natural language process. Kurz also teaches in para. 53, the computer system searches the text of a story 200 (in this case the story of the Three Little Pigs) to identify the portions spoken by the narrator (e.g., the non-dialog portions). The system associates all of the non-dialog portions with the voice model for the narrator as indicated by the highlighted portions 202, 206, and 210) i.e. role candidates are recommended based on the analysis result conducted through natural language processing in a manner of textual entity resolution; furthermore, the grouping is demonstrated by the highlighting as to distinguish the various portions corresponding to each speaker, see para. 31 i.e. speech styles corresponding to each of the grouped sets of sentences are recommended based on the analysis result of the natural language processing, see para. 28 where A character can have multiple associated moods. “Mood attributes” can be various attributes of a character. For instance, one attribute can be “normal,” other attributes include “happy,” “sad,” “tired,” “energetic,” “fast talking,” “slow talking,” “native language,” “foreign language,” “hushed voice “loud voice,” etc. Mood attributes can include varying features such as speed of playback, volumes, pitch, etc. or can be the result of recording different voices corresponding to the different moods as mentioned in para. 27, the system 10 reads different portions of the text 50 using different voice models. For example, if the text includes multiple characters, a listener may find listening to the text more engaging if different voices are used for each of the characters in the text rather than using a single voice for the entire narration of the text. In another example, extremely important or key points could be emphasized by using a different voice model to recite those portions of the text); and
selecting at least a part of the outputted one or more speech style characteristic candidates (Para. 55, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process; furthermore In the event that the computer system selects the incorrect character, the user can modify the character selection using one or more of techniques described herein i.e. selecting at least a part of the outputted on or more role candidates where para. 42 indicates the menu for modification where the menu contains speech style characteristics to be selected).
The reasoned combination of Chen, as modified by the teachings of Kurz noted above thus makes obvious:
wherein the determining the speech style characteristic for the received plurality of sentences includes:
outputting one or more speech style characteristic candidates recommended based on the analysis result of the one or more sets of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature outputting one or more speech style characteristic candidates recommended based on the analysis result taught by Kurz, see paras. 27, 31 and 55); and
selecting at least a part of the outputted one or more speech style characteristic candidates (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature selecting at least a part of the outputted one or more speech style characteristic candidates as taught by Kurz, see paras. 42 and 55).

Regarding claim 11, Chen in view of Kurz teaches the method of claim 1 (see claim 1 above), in addition, Chen teaches:
wherein an audio content including the synthetic speech is generated (Para. 23, indicates outputting said sequence of speech vectors as audio; furthermore, Para. 79, can be used by the synthesizer 207 (FIG. 4) directly to synthesize the expressive speech).


Regarding claim 12, Chen in view of Kurz teaches the method of claim 11 (see claim 11 above), in addition, Chen teaches:
 further comprising, in response to a request to download the generated audio content, receiving the generated audio content (Connected to the output module 13 is output for audio 17. The audio output 17 is used for outputting a speech signal converted from text which is input into text input 15. The audio output 17 may be for example a direct audio output e.g. a speaker or an output for an audio data file which may be sent to a storage medium, networked etc. i.e. to use the text-to-speech system is a request to download the generated audio content as its purpose is to output synthesized speech; therefore, as a response the downloading is done where the audio file may be sent e.g. transmission of data to a storage medium, networked etc. or may be direct audio output).

Regarding claim 15, Chen in view of Kurz teaches the method of claim 1 (see claim 1 above), in addition, Chen teaches:
further comprising outputting the received plurality of sentences (Para. 79, can be used by the synthesizer 207 (FIG. 4) directly to synthesize the expressive speech i.e. synthesizes sentences that are inputted), 
wherein the determining the speech style characteristic for the received plurality of sentences includes:
the at least part of the plurality of sentences and the changed value indicative of the
speech style characteristic are inputted to the artificial neural network text-to-speech synthesis model and the synthetic speech is changed based on speech data outputted from the artificial neural network text-to-speech synthesis model (Para. 242, the processor 180 may adjust the speech style feature for a default TTS engine based on the speech style determined with respect to the default TTS engine stored in the memory 170, and may generate a speech corresponding to the text using the adjusted default TTS engine where repeated citation of claim 1, (Para. 80, In an embodiment, machine learning methods, e.g. neural network (NN) are used to provide the transformation block 261 and train the transformations from expressive linguistic space 255 to expressive synthesis space 259. For each sentence in the training data 251, the speech data is used to generate an expressive synthesis feature vector in synthesis feature space 259 and the transcription of the speech data is used to generate an expressive linguistic feature in linguistic feature space 255. Using the linguistic features of the training data as the input of NN and the synthesis features of the training data as the target output, the parameters of the NN can be updated to learn the mapping from linguistic feature space to synthesis feature space i.e. linguistic features containing text and speech style characteristics and changed values corresponding are to the text input which are extracted as feature vectors and then inputted into the neural network as to synthesize speech matching the features),
The reasoned combination of Chen, as modified by the teachings of Kurz noted above thus makes obvious:
Selecting at least a part of the outputted plurality of sentences (Para. 55, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process; furthermore In the event that the computer system selects the incorrect character, the user can modify the character selection using one or more of techniques described herein i.e. selecting at least a part of the outputted plurality of sentences); 
Outputting an interface for changing the speech style characteristic for the at least part of the selected plurality of sentences (Para. 55, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process; furthermore In the event that the computer system selects the incorrect character, the user can modify the character selection using one or more of techniques described herein i.e. selecting at least a part of the outputted on or more role candidates where para. 42 indicates the menu for modification where the menu contains speech style characteristics to be selected i.e. outputting user interface for changing the speech style characteristic for the at least part of the selected plurality of sentences as can be seen through figure 5 in edit a cast member); and
changing a value indicative of the speech style characteristic for the at least part through the interface (Para. 42 indicates a sliding scale is presented and a user moves a slider on the sliding scale to indicate a relative increase or decrease in the volume of the narration by the corresponding character. In some additional examples, a drop down menu can include various volume options such as very soft, soft, normal, loud, very loud. The edit cast member window 136 also includes a portion 146 for selecting a reading speed for the character. The reading speed provides an average number of words per minute that the computer system will read at when the text is associated with the character. As such, the portion for selecting the reading speed modifies the speed at which the character reads. The edit cast member window 136 also includes a portion 138 for associating an image with the character. This image can be presented to the user when the user selects a portion of the text to associate with a character (e.g., as shown in FIG. 3). The edit cast member window 136 can also include an input for selecting the gender of the character (e.g., as shown in block 140) and an input for selecting the age of the character (e.g., as shown in block 142). Other attributes of the voice model can be modified in a similar manner),
Modifying Chen to include the features of Kurz discloses:
Selecting at least a part of the outputted plurality of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature selecting at least a part of the outputted plurality of sentences as taught by Kurz, see para. 55);
Outputting an interface for changing the speech style characteristic for the at least part of the selected plurality of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature outputting an interface for changing the speech style characteristic for the at least part of the selected plurality of sentences as taught by Kurz, see paras. 42 and 55); and
changing a value indicative of the speech style characteristic for the at least part through the interface (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature of changing a value indicative of the speech style characteristic for the at least part through the interface as taught by Kurz, see para. 42).

Regarding claim 16, a computer program stored on a non-transitory computer-readable recording medium for executing, on a computer, a method for processing synthetic speech for text through a user interface according to claim 1 (Para. 52, implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device; examples of non-transitory computer-readable recording mediums given)

Claims 4 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in
view of Kurz and further in view of Tancanblatt et al. (US Pat. No. 6,006,187) hereinafter Tan.
Regarding claim 4, Chen in view of Kurz teaches the method of claim 2 (see claim 2 above), in addition, Chen teaches: 
wherein the receiving the one or more sentences includes receiving a plurality of sentences (Para. 54, text input 15 may be a means for receiving text data from an external storage medium or a network; furthermore, para. 73 indicates It is assumed that each utterance in the training data 251 contains unique expressive information. This unique expressive information can be determined from the speech data and can be read from the transcription of the speech, i.e. the text data as well. In the training data, the speech sentences and text sentences are synchronized as shown in FIG. 5 i.e. text data received are sentences i.e. plural), 
However, Chen in view of Kurz fails to explicitly disclose:
the method further includes adding a visual representation indicative of characteristic of an effect to be inserted between the plurality of sentences, and
the synthetic speech includes a sound effect generated based on the characteristic of the effect included in the added visual representation.
In a related field of endeavor (e.g. text-to-speech with feature extraction, see abstract), Tan discloses the use of punctuation and further depicted as element 24 on fig. 2 where a period represents a pause/silence within a sentence; furthermore the duration of the pause/silence may be customized as it is a TTS intonation editor, see lines 24-41 on col. 4. 
Modifying Chen in view of Kurz to include the features of Tan discloses:
the method further includes adding a visual representation indicative of characteristic of an effect to be inserted between the plurality of sentences (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature where a visual representation is added indicative of an effect to be inserted between the plurality of sentences as taught by Tan, see lines 24-41 on col. 4), and 
the synthetic speech includes a sound effect generated based on the characteristic of the effect included in the added visual representation (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature where the synthetic speech includes a sound effect generated based on the characteristic in the added visual representation as taught by Tan, see lines 24-41 on col. 4).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Tan to the method of Chen in view of Kurz, given the similar field of endeavor (e.g. speech synthesizer with feature extraction); furthermore, doing so would have provided the users of Chen in view of Kurz, with the added benefits of providing a visual feel to the computer prosody user interface as recognized by Tan, see abstract. Furthermore, lines 16-30 on col. 2 indicates it provides a visual "feel" regarding the speech parameters being set or assigned by a user. In one embodiment, the presentation means are redimensionable to correspond to the speech parameters set using the speech parameter manipulation means. Preferably, the horizontal and vertical dimensions of the presentation means correspond to the speaking rate relative word duration dimension set by the duration control means and the word prominence set by the prominence control means, respectively. Additionally, the accent means and the phrase contour means are preferably visually coordinated with the presentation means--that is, assigning an accent or a phrase contour to a word, punctuation or text will cause a visual change to the corresponding presentation means. A person of ordinary skill in the art would recognize the teachings as allowing for customization in the synthesizer as silences may be customized.

Regarding claim 5, Chen in view of Kurz and Tan teaches the method of claim 4 (see claim 4 above), using the motivation for combination stated in dependent claim 4, Tan also teaches: 
wherein the effect to be inserted between the plurality of sentences includes a silence (lines 16-30 on col. 2, In the text as seen through fig. 2 there is a period which is a representation of a pause/silence), and
the adding the visual representation indicative of the characteristic of the effect to be inserted between the plurality of sentences includes adding a visual representation indicative of a time of the silence to be inserted between the plurality of sentences (lines 16-30 on col. 2, Each word and punctuation of the text is presented within its own word box 24. To modify the speaking rate relative word duration and/or word prominence of a word or punctuation, the user must first select one or more words or punctuations to modify by clicking on the appropriate word boxes with the computer mouse preferably causing the word boxes to be highlighted i.e. this may be seen in fig. 2 with the border box for 24 indicating the period’s time of silence to be inserted and may be adjusted by elements 28b and 28a for duration of the visual representation).
The reasoned combination of Chen, as modified by the teachings of Tan noted above thus makes obvious:
wherein the effect to be inserted between the plurality of sentences includes a silence (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature wherein the effect to be inserted between the plurality of sentences includes a silence as taught by Tan, see lines 16-30 on col. 2, fig. 2), and
the adding the visual representation indicative of the characteristic of the effect to be inserted between the plurality of sentences includes adding a visual representation indicative of a time of the silence to be inserted between the plurality of sentences (e.g. Chen’s method of generating synthetic speech through an interface now also including the feature where the adding the visual representation indicative of the characteristic of the effect to be inserted between the plurality of sentences includes adding a visual representation indicative of the time of the silence to be inserted between the plurality of sentences taught by Tan, see lines 16-30 on col. 2, fig. 2).




Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Chen in
view of Kurz and further in view of Pore et al. (US Pub. No. 2019/0295527 A1) hereinafter Pore.
Regarding claim 10, Chen in view of Kurz teaches the method of claim 1 (see claim 1 above);
However, Chen in view of Kurz fails to explicitly disclose:
wherein an inspection result is generated by inspecting the synthetic speech for the plurality of sentences, and
the method further includes changing the speech style characteristic applied to the synthetic speech based on the inspection result.
In a related field of endeavor (e.g. text to speech, see abstract) teaches Typographic errors due to phonemic spellings can be corrected by identifying and utilizing appropriate text-to-speech and then speech-to-text algorithms. If a word is unknown in the language in which the text-to-speech code is being run, most algorithms default to reading that word phonemically. In the audio file, the word now sounds the same as a known word in the given language. When the word is converted back to text using speech-to-text, it is transcribed with the correct spelling, and can be put through further natural language processing (NLP) systems to extract higher level features from the message such as key topic, sentiment, and other features to confirm or verify the spelling. Identifying the likely L1 language of the author, or the origin of a language, the system can be optimized by choosing the most suitable accent for text-to-speech and speech-to-text application programming interfaces (APIs) manually or automatically i.e. the process of converting it back to text using speech-to-text is the inspection this may lead to natural language processing (NLP) systems to extract higher level features from the message such as key topic, sentiment, and other features to confirm or verify the spelling. Identifying the likely L1 language of the author, or the origin of a language, the system can be optimized by choosing the most suitable accent for text-to-speech and speech-to-text application programming interfaces (APIs) manually or automatically i.e. verifying and making necessary changes to correct the processing as the most suitable action is taken, see para. 16. 
Modifying Chen in view of Kurz to use the techniques disclosed by Pore discloses:
wherein an inspection result is generated by inspecting the synthetic speech for the plurality of sentences (e.g. Chen’s method of generating synthetic speech through an interface  in view of Kurz now also including the feature wherein the synthetic speech is inspected as taught by Pore, see para. 16), and
the method further includes changing the speech style characteristic applied to the synthetic speech based on the inspection result (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature changing the speech style characteristic applied to the synthetic speech based on the inspection result as taught by Pore, see para. 16). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Pore to the method of Chen in view of Kurz, given the similar field of endeavor (e.g. speech synthesizer); furthermore, doing so would have provided the users of Chen in view of Kurz, with the added benefits of correcting typographic errors due to phonemic spellings and/or choosing suitable actions to correct mistakes as acknowledged by Pore, see para. 16. Furthermore, para. 25 indicates Using the speech-to-text and text-to-speech APIs that are built to recognize particular accents would improve the accuracy of the conversion of the texts to include the correct spelling. Furthermore, it may prove to be useful in other services using synthesizers as recognized by para. 17, being able to automatically correct such transcribed messages would allow for better use of those data streams by automated systems such as automatic cataloging or categorizing systems that can trigger automatic follow up actions. As other examples, the methodology of the present disclosure may be useful in services such as chatbots that interpret messages and interact with users, computer-implemented language interpreters or processors that review comments sections, surveys, and review sites.

Claims 13 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in
view of Kurz and further in view of Yang et al. (US Pub. No. 2021/0366462 A1) hereinafter Yang.
Regarding claim 13, Chen in view of Kurz teaches the method of claim 11 (see claim 11 above);
However, Chen in view of Kurz fails to explicitly disclose:
further comprising, in response to a request to stream the generated audio content, playing back the generated audio content in real time.
In a related field of endeavor (e.g. classification in text-to-speech methods and device, see abstract), Yang discloses transmitting device 12 and the at least one receiving device 14 may further include slate PCs 22 and 32, a tablet PC, laptop computers 23 and 33, etc. The slate PCs 22 and 32 and the laptop computers 23 and 33 may be connected to the at least one network system 16 via wireless access points 25, see para. 54, where para. 79 indicates and may transmit the information in the voice form to the client device 50 i.e. download where it is a stream requested by the user as by using the text-to-speech service; furthermore, para. 94 indicates since the ASR module 73, the NLU module 75, and the TTS module 76 are included in the client device 70, communication with Cloud may not be necessary for a speech processing procedure such as speech recognition, speech synthesis, and the like, and thus, an instant real-time speech processing operation is possible.
Modifying Chen in view of Kurz to include the features disclosed by Yang discloses:
further comprising, in response to a request to stream the generated audio content, playing back the generated audio content in real time (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature of in response to stream the generated audio content i.e. output through transmission of data, playing back the generated audio content in real time as taught by Yang, see para. 54, 79, and 94). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Yang to the method of Chen in view of Kurz, given the similar field of endeavor (e.g. speech synthesizer with classification); furthermore, doing so would have provided the users of Chen, with the added benefits of providing results in real-time as a person of ordinary skill in the art would recognize from the teachings of Yang see para. 94, real-time meaning faster processing times and providing results in real-time to the user. Furthermore, para. 134 indicates processing can be performed as a high speed. 

Regarding claim 14, Chen in view of Kurz teaches the method of claim 11 (see claim 11 above);
However, Chen in view of Kurz fails to explicitly disclose:
further comprising mixing the generated audio content with a video content.
In a related field of endeavor (e.g. classification in text-to-speech methods and device, see abstract), Yang discloses a text-to-speech (TTS) method according to an embodiment of the present invention may be applied in various patterns. That is, the TTS method according to an embodiment of the present invention may be applied in various ways in addition to a case where a speech is synthesized by carrying emotion in a received message, see para. 228. For example, para. 230 indicates multimedia contents (movies, drama, animation dubbing, etc.) conventionally output script lines with synthesized speeches of the same tone, but, if an embodiment of the present invention is applied, speeches synthesized with various types of emotion according to a script line and a situation, and thus, a diversity of user immersive content experience may be provided i.e. video content mixed with the audio content from the speech synthesizer. 
Modifying Chen in view of Kurz to include the features disclosed by Yang discloses:
further comprising mixing the generated audio content with a video content (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature of further comprising mixing the generated audio content with a video content as taught by Yang, see para. 228 and 230).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Yang to the method of Chen in view of Kurz, given the similar field of endeavor (e.g. speech synthesizer with classification); furthermore, doing so would have provided the users of Chen in view of Kurz, with the added benefits of providing a diversity of immersive content to be experienced by the user as recognized by Yang, see para. 230. As another example recognized by Yang, see para. 231, Navigation devices provide video i.e. moving map according to the vehicle’s position; furthermore, it recognizes that diversity tones are spoken according to a driving situation, and thus, it is possible to appropriately call attention to situations such as distracted driving and alert occurrence.

Claims 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Chen
in view of Kurz and further in view of Mahyar (US Pat. No. 10,930,263).
Regarding claim 17,  Chen in view of Kurz teaches the method of claim 1 (see claim 1 above);
However Chen in view of Kurz fails to explicitly disclose: 
wherein determining the speech style characteristic comprises determining an embedding vector indicative of the speech style characteristic of a single role, and
wherein synthetic speech for the plurality of sentences is generated using the embedding vector.
In a related field of endeavor (e.g. speech synthesis, see abstract), Mahyar teaches, the set of speaker characteristics, attributes, or patterns, also referred to as a speaker vector or speaker embedding (which may also include information such as I-vectors, D-vectors, etc.), also includes features learned by the neural network structure, such as features represented by the hidden states, which may have no analogous parameter in the domain of speech analysis by humans. For example, a neural network may learn to represent speech characteristics in the training data based on features unrelated to prosodic analysis of words, phonemes, etc., as used in speech analysis by humans, see lines 21-31 on col. 10. The determination occurs by the neural network learning to represent the speech characteristics into a speaker embedding where an embedding is representative of a single role. Furthermore, Mahyar describes that the automatic vocalization system 100 generates a representation of the received text for facilitating input to the previously trained voice synthesis models, as such, the synthetic speech for the plurality of sentences is generated using the vector representation as the embedding vectors for the role representing the target speaker, see lines 40-59 on col. 10.
Modifying Chen in view of Kurz to include the features disclosed by Mahyar discloses:
wherein determining the speech style characteristic comprises determining an embedding vector indicative of the speech style characteristic of a single role (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature of wherein determining the speech style characteristic comprises determining an embedding vector indicative of the speech style characteristic of a single role as taught by Mahyar, see lines 21-31 on col. 10), and
wherein synthetic speech for the plurality of sentences is generated using the embedding vector (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature of wherein synthetic speech for the plurality of sentences is generated using the embedding vector as taught by Mahyar, see lines 40-59 on col. 10).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Mahyar to the method of Chen in view of Kurz, given the similar field of endeavor (e.g. speech synthesizer); furthermore, doing so would have provided the users of Chen in view of Kurz, with the added benefits as The disclosed techniques improve the technology of automatic media content localization by enabling generation of localized voices with greater similarity to the actor's or actresses' speaking characteristics, and/or in a shorter amount of time (e.g., on the timescale of hours, rather than weeks/months) compared to video dubbing using voice actors or actresses speaking in the targeted localization language. The disclosed techniques also improve the technology of streaming media content by allowing more rapid introduction of media content to different geographic regions while also improving the customer experience as taught by Mahyar, see lines 60 col. 1 – line 4 on col. 2. Furthermore, by the generated representation of the speaker which uses speaker embeddings for a role, it can reduce the computation time for an iterative process that minimize the error function associated with a comparison of predicted audio output by the voice model with speech samples of the target speaker as taught by Mahyar, see lines 59-66 on col. 7.

Regarding claim 18,  Chen in view of Kurz teaches the method of claim 1 (see claim 1 above), using the motivation for combination stated in independent claim 1, Kurz also teaches:
displaying a list of roles on the user interface (Para. 33, FIG. 3 also shows a menu 55 used for selection of portions of a text to be read using different voice models);
receiving a selection of a single role among the list of roles from a user through the user interface (Para. 33, FIG. 3 also shows a menu 55 used for selection of portions of a text to be read using different voice models. A user selects a portion of the text by using an input device such as a keyboard or mouse to select a portion of the text, or, on devices with a touchscreen, a finger or stylus pointing device may be used to select text. Once the user has selected a portion of the text, a drop down menu 55 is generated that provides a list of the different available characters (e.g., characters 56, 58, and 60) that can be used for the narration); and
However, Chen in view of Kurz fails to explicitly disclose:
determining an embedding vector indicative of the speech style characteristic of the single role.
In a related field of endeavor (e.g. speech synthesis, see abstract), Mahyar teaches, the set of speaker characteristics, attributes, or patterns, also referred to as a speaker vector or speaker embedding (which may also include information such as I-vectors, D-vectors, etc.), also includes features learned by the neural network structure, such as features represented by the hidden states, which may have no analogous parameter in the domain of speech analysis by humans. For example, a neural network may learn to represent speech characteristics in the training data based on features unrelated to prosodic analysis of words, phonemes, etc., as used in speech analysis by humans, see lines 21-31 on col. 10. The determination occurs by the neural network learning to represent the speech characteristics into a speaker embedding where an embedding is representative of a single role.
Modifying Chen in view of Kurz to include the features disclosed by Mahyar discloses:
determining an embedding vector indicative of the speech style characteristic of the single role (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature of determining an embedding vector indicative of the speech style characteristic of the single role as taught by Mahyar, see lines 21-31 on col. 10).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Mahyar to the method of Chen in view of Kurz, given the similar field of endeavor (e.g. speech synthesizer); furthermore, doing so would have provided the users of Chen in view of Kurz, with the added benefits as the disclosed techniques improve the technology of automatic media content localization by enabling generation of localized voices with greater similarity to the actor's or actresses' speaking characteristics, and/or in a shorter amount of time (e.g., on the timescale of hours, rather than weeks/months) compared to video dubbing using voice actors or actresses speaking in the targeted localization language. The disclosed techniques also improve the technology of streaming media content by allowing more rapid introduction of media content to different geographic regions while also improving the customer experience as taught by Mahyar, see lines 60 col. 1 – line 4 on col. 2. Furthermore, by the generated representation of the speaker which uses speaker embeddings for a role, it can reduce the computation time for an iterative process that minimize the error function associated with a comparison of predicted audio output by the voice model with speech samples of the target speaker as taught by Mahyar, see lines 59-66 on col. 7.


Claims 19 is rejected under 35 U.S.C. 103 as being unpatentable over Chen
in view of Kurz and further in view of S. Yang, Z. Wu and L. Xie, "On the training of DNN-based average voice model for speech synthesis," 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp. 1-6, doi: 10.1109/APSIPA.2016.7820818 hereinafter Yang.
Regarding claim 19,  Chen in view of Kurz teaches the method of claim 1 (see claim 1 above);
However, Chen in view of Kurz fails to explicitly disclose:
receiving a speaker ID vector; and
obtaining an embedding vector indicative of the speech style characteristic of a single role based on the speaker ID vector, 
wherein synthetic speech for the plurality of sentences is generated using the embedding vector.
In a related field of endeavor (e.g. speech synthesis, see lines 1-3 in section I. Introduction), Yang teaches, I-vector is a low-dimensional vector representing speaker individuality and has been widely used in speaker recognition [19], see lines 19-21 on pg. 2; furthermore, To make the i-vector more robust and compact, linear discriminant analysis (LDA) [20] is usually adopted, see lines 25-27 on pg. 2. Moreover, The i-vector, which represents speaker identity [21], is used to control the network to produce the speaker’s voice. The framework is presented in Figure 1. In the framework, an i-vector and a gender code are appended with the speaker-independent linguistic features as the network input. The i-vector and gender code are used as speaker dependent features to discriminate among different speakers. See lines 3-9 on section II. Average Voice Model Training pg. 2, as such the i-vector is received which is representative of the speaker ID and the speaker embeddings are the speaker-dependent linguistic features as represented by the i-vector, gender, and linguistic features, see figure 1, which are able to create a more robust and compact embedding vector representative of speech style characteristics per the linguistic features of the single role as they are discriminated among different speakers. Furthermore, it is stated that, with the help of the speaker identity vector, the speech synthesis performance might be improved, see lines 27-29 on pg. 2, as the speech synthesizer system for text is generated using the embedding vector; furthermore, the conclusion section, pg. 5 indicates that sematic analysis of the multi-speaker average voice model for DNN-based speech synthesis and they were performed with the speaker embeddings.
Modifying Chen in view of Kurz to include the features disclosed by Yang discloses:
receiving a speaker ID vector (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature of receiving a speaker ID  vector as taught by Yang, see lines 19-21 on pg. 2 in relation to figure 1); and
obtaining an embedding vector indicative of the speech style characteristic of a single role based on the speaker ID vector (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature of obtaining an embedding vector indicative of the speech style characteristic of a single role based on the speaker ID vector, wherein synthetic speech for the plurality of sentences is generated using the embedding vector as taught by Yang, see lines 25-27 on pg. 2 and lines 3-9 on section II. Average Voice Model Training pg. 2 in relation to figure 1),
wherein synthetic speech for the plurality of sentences is generated using the embedding vector (e.g. Chen’s method of generating synthetic speech through an interface in view of Kurz now also including the feature of wherein synthetic speech for the plurality of sentences is generated using the embedding vector as taught by Yang, see lines 27-29 on pg. 2 and conclusion section, pg. 5).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Yang to the method of Chen in view of Kurz, given the similar field of endeavor (e.g. speech synthesizer); furthermore, doing so would have provided the users of Chen in view of Kurz, with the added benefits of both naturalness and similarity are significantly increased, and the subjective results are consistent with the objective results, as per improving results with speech synthesis as taught by Yang, see lines 36-42 on pg. 5. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s
disclosure.
Morita (US Pub. No. 2020/0066250 A1)  teaches The speech synthesizing unit 10
receives input of text information, and generates a speech waveform of the synthetic speech using various models and rules stored in the speech synthesis model storing unit 20. At that time, if a speaker parameter value representing the values of the parameters related to the speaker individuality is also input from the speaker parameter control unit 40, then the speech synthesizing unit 10 generates a speech waveform while controlling the speaker individuality according to the input speaker parameter value. The speaker individuality represents the features of the voice unique to the speaker and, for example, has a plurality of factors such as age, brightness, hardness, and clarity. The speaker parameter value represents the set of values corresponding to such factors of the speaker individuality, see para. 18 where display/input control unit 30 visualizes and displays the speaker parameter value that are set in the speaker parameter control unit 40, and provides to the users a user interface that enables the users to change/input the parameter values of the speaker parameter value. When a user makes use of the user interface and changes/inputs the speaker parameter value, the display/input control unit 30 sends the speaker parameter value corresponding to the user operation to the speaker parameter control unit 40, see para. 32.

Mairano et al. (US Pub. No. 2017/0186418 A1) hereinafter Mairano teaches, A text-to
speech (TTS) system includes components capable of supporting the generation of speech output in any of multiple styles, and may switch seamlessly from producing speech output in one style to producing speech output in another style. For example, a concatenative TTS system may include a speech base storing speech units associated with multiple speech styles, and a linguistic analysis component to generate a phonetic transcription specifying speech output in any of multiple styles. Text input may include a style indication associated with a particular segment of the input text. The linguistic analysis component may invoke encoded rules and/or components based upon the style indication, and generate a phonetic transcription specifying a speech style, which may be processed to generate output speech, see abstract. Furthermore, para. 32 indicates In representative system 300, speech base 325 stores speech units of multiple styles, with each speech unit having a particular style indication (e.g., a tag, such as a markup tag, or any other suitable indication). For example, demiphones from joyful recordings may each have an associated joyful style indication, demiphones from didactic recordings may each have an associated didactic style indication, demiphones from neutral recordings may each have an associated neutral style indication, and so on. The phonetic transcript may specify pauses as well through punctuation, see para. 49.

Jeong (KR 20200056261 A) teaches, an electronic device capable of implementing a
more natural dialogue system. The electronic device of the present invention comprises: a memory including at least one command; and a processor executing the at least one command. When a text sentence is inputted, the processor obtains prosody information of the text sentence, divides the text sentence into a plurality of sentence elements, inputs the plurality of sentence elements and the prosody information of the text sentence into a text to speech (TTS) module to obtain a voice in which prosody information of each of the plurality of sentence elements in parallel is reflected, and merges the voice for the plurality of sentence elements obtained in parallel to obtain a voice for the text sentence, see abstract. Specifically, Jeong teaches the use of artificial neural networks and the processor 120 may select a voice of speech based on feature information corresponding to a text sentence. At this time, the feature information corresponding to the text sentence may include emotion information, information about the speaker (e.g., gender, specific person, etc.). That is, even if it is a same length voice, a different voice may be selected based on emotion information and information on the speaker. After completing the output of the identified voice, the processor 120 synthesizes the text sentence.

Killalea et al. (US Pat. No. 8,150,695 B1) discloses, A method is provided for presenting a
written work. A character identity is recognized within a written work. Presentation information for the written work, such as a graphical scheme or an electronic voice, is determined based on the character identity. The presentation information is provided to a user computing device. The user computing device renders the written work or a portion thereof using the presentation information, see abstract. Specifically, lines 20-30 on col 4, the processor 205 is also programmed to obtain presentation information for the identified portions. Presentation information may be obtained from within the written work, from elsewhere in the memory area 210, or via the communication interface 225 (e.g., from a remote device such as the server computing device 107). In some embodiments, the user 101 selects presentation information using the input device 220. The selected presentation information is associated with the written work, with a character identity, and/or with the identified portions and stored in the memory area 210.

Kaszezuk et al. (US Pub. No. 2014/0122079 A1) discloses, Features are disclosed for
generating text-to-speech (TTS) audio programs from textual content received from multiple sources. A TTS system may assemble an audio program from several individual audio presentations of user-selected network-accessible content. Users may configure the TTS system to retrieve personal content as well as publicly accessible content. The audio program may include segues, introductions, summaries, and the like. Voices may be selected for individual content items based on user selections or on characteristics of the content or content source, see abstract.

Luan et al. (US Pub. No. 2015/0243275 A1) discloses, Multi-voice font interpolation is
provided. A multi-voice font interpolation engine allows the production of computer generated speech with a wide variety of speaker characteristics and/or prosody by interpolating speaker characteristics and prosody from existing fonts. Using prediction models from multiple voice fonts, the multi-voice font interpolation engine predicts values for the parameters that influence speaker characteristics and/or prosody for the phoneme sequence obtained from the text to spoken. For each parameter, additional parameter values are generated by a weighted interpolation from the predicted values. Modifying an existing voice font with the interpolated parameters changes the style and/or emotion of the speech while retaining the base sound qualities of the original voice. The multi-voice font interpolation engine allows the speaker characteristics and/or prosody to be transplanted from one voice font to another or entirely new speaker characteristics and/or prosody to be generated for an existing voice font, see abstract.

Applicant's amendment necessitated the new ground(s) of rejection presented in this
Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the
examiner should be directed to JONATHAN E AMAYA HERNANDEZ whose telephone number is (571)272-2484. The examiner can normally be reached Monday - Friday 9:30 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J.E.A./             Examiner, Art Unit 2655                                                                                                                                                                                           
/JONATHAN C KIM/Primary Examiner, Art Unit 2655