Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 13, and 18 are independent and have been amended.  Claims 6, 17, and 20 are amended to overcome an objection.
This Application is published as 2022-0199086.
Apparent priority: 12/22/2020.

Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection.

Independent Claims have two aspects:1) Cross-Lingual Speech Synthesis, which requires translating the speech of an original speaker from a source to target language and synthesizing the translated speech in the voice of the original speaker AND 2) it catalogs/indexes the emotion of the original speaker when he speaks in the source language.  The Claim does not state that the Emotion of the original Speaker is reflected in the synthesized speech.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 8/5/2022 has been entered.
Response to Amendments
Objection to Claims 6, 17, and 20 is withdrawn in view of the amendments.
Response to Arguments

    PNG
    media_image1.png
    487
    538
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    342
    718
    media_image2.png
    Greyscale
	Note that Tischer continues to teach the added language and, in addition to the Claim language, the supporting Specification of the instant Application (See Figure 7 and “[0123] …where the custom phoneme includes the performer's pronunciation of a sound in the translated word …”) is very very similar to the operation of Tischer.  Note the modified rejection and Figures 3, 4, and 7 of Tischer that show various Tables/ Indexes / Data Structures that include and correlate the information required by the added language.  However, because at the time of the interview, Examiner stated that neither of the primary or secondary references taught the suggested language, a reference is added.
Applicant’s arguments are directed to the material added by amendment which are subject to new grounds of rejection.
1. A computer-implemented method comprising:  
converting, by a processor, an original audio signal to an original text string,
wherein the original audio signal is from a recording of the original text string being spoken by a specific person, and 
wherein the original text string includes a word in a source language; 
generating, by the processor, a translated text string by translating the original text string from the source language to a target language, 
wherein the translated text string includes, as a translated word, a translation of the word from the source language to a target language; 
assembling, by the processor, a standard phoneme sequence from a set of standard phonemes, wherein the standard phoneme sequence includes a standard pronunciation of the translated word; 
associating, by the processor, a custom phoneme with a standard phoneme of the standard phoneme sequence, wherein the custom phoneme includes the specific person's pronunciation of a sound in the translated word; and 
synthesizing, by the processor, the translated text string to a translated audio signal that includes the translated word pronounced using the custom phoneme; 
storing the translated audio signal in a data structure associated with a voice identifier that combines a performer identifier and an emotional state, wherein the performer identifier is indicative of the specific person and wherein the emotional state is indicative of an emotion expressed by the specific person in the original recording; and
indexing the original text string, wherein the indexing comprises using the voice identifier to associate the original text with the specific person.

	The claimed index would have the following structure:
610
604
608
616
Voice ID
Specific Person
(Performer ID)
Emotional State 
(of Specific Person who has Performer ID)
Original Text (obtained from original speech of the Specific Person)
PETER/HAPPY
PETER
HAPPY
CAT


Which corresponds to the following columns of Table 600 of Figure 6.
610
604
608
616
PETER/HAPPY
PETER
HAPPY
CAT


	Specification:
[0043] In some embodiments, the process converts the identified speech into text, for example into text subtitles. In some such embodiments, the process indexes the converted subtitle texts associated with the identified performers so as to be associated with the Voice ID of the recording from the original audio file. In some such embodiments, the process translates each indexed segment of the text associated with each identified performer, resulting in a translated text file in the target language. In some embodiments, the process extracts each identified performer's vocalic characteristics. …
“[0093] … As a non-limiting example, a data structure includes fields such as Video File ID, Performer ID, Script ID, Emotional State, Voice ID, First Language, Original Audio, Original Text, Secondary Language, Translated Text, Synthesized Audio, Started Time Stamp, and Ended Time Stamp….”  “[0106] …. The Emotional State 608 identifies an emotion expressed by the performer in the video file indicated by the Video File ID 602. The Voice ID 610 combines the performer indicated by the Performer ID 604 with the emotion expressed by the performer according to the Emotional State 608. ….”
[0123] In an embodiment, at block 714, the process performs text indexing. For example, in some embodiments, the process indexes the converted text file to be associated with the identified performer. In some embodiments, at block 716, the process performs a customized para-dubbing synthesis process that uses the indexed text and generates translated audio that includes one or more audio segments recorded by identified speaker as phonemes for constructing the synthesized translation audio output. For example, in some embodiments, the process assembles a standard phoneme sequence from a set of standard phonemes, where the standard phoneme sequence includes a standard pronunciation of the translated word, and associates a custom phoneme with a standard phoneme of the standard phoneme sequence, where the custom phoneme includes the performer's pronunciation of a sound in the translated word. Then, at block 718, the process outputs synthesized voice speaking the translated audio.
	“[0004] … The embodiment also includes assembling, by the processor, a standard phoneme sequence from a set of standard phonemes, wherein the standard phoneme sequence includes a standard pronunciation of the translated word. The embodiment also includes associating, by the processor, a custom phoneme with a standard phoneme of the standard phoneme sequence, wherein the custom phoneme includes the specific person's pronunciation of a sound in the translated word….”
Standard Phonemes:  In the target language: speaker-independent phonemes.
Custom Phonemes: In the target language: speaker-specific phonemes.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 6-8, 13, 17-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chen (U.S. 2010/0198577) in view of Tischer (U.S. 2006/0069567) and Isobe (U.S. 20110093272).

    PNG
    media_image3.png
    502
    431
    media_image3.png
    Greyscale

Regarding Claim 1, Chen teaches: 
1. A computer-implemented method comprising:  [Chen, Figure 11, “Processor 1102” and “Memory 1108.”]
converting, by a processor, an original audio signal to an original text string, wherein the original audio signal is from a recording of the original text string being spoken by a specific person, and wherein the original text string includes a word in a source language; [Chen, Figure 1, the word “Hello 108” in the Source Language = English is spoken by the “Speaker 102” and converted to text by the “Speech Recognition Module 1110” of Figure 11.  The input speech may be from a recording:  “[0031] FIG. 1 is a schematic diagram of speaker adaptation in an illustrative translation environment 100. A human speaker 102, or a recording or device reproducing human speech, is shown with a translation computer system using speaker adaptation with HMM state mapping 104 and a listener 106. Human speaker 102 produces speech 108 saying the word "Hello." The speaker's voice speaking the language of the speaker (LS) (in this example, English) (VSLS) 110 is input into the translation computer system 104 via an input device, such as the microphone depicted here….”]
generating, by the processor, a translated text string by translating the original text string from the source language to a target language, wherein the translated text string includes, as a translated word, a translation of the word from the source language to a target language; [Chen, Figures 1 or 11:  “Hola 112.”  “[0031] …After processing in the translation computer system 104, the translated word "Hola" is output 112 in Listener language LL, Spanish in this example. This output 112 is presented to listener 106 via an output device, such as the speaker depicted here. The output comprises synthesized voice output of the human speaker 102 uttering the listener's 106 language (VOLL). Thus, the listener 106 appears to hear the speaker 102 speaking the listener's language.”]
assembling, by the processor, a standard phoneme sequence from a set of standard phonemes, wherein the standard phoneme sequence includes a standard pronunciation of the translated word; [Chen, this means the pronunciation of the Spanish word “Hola,” in standard and speaker-agnostic phonemes of the Spanish language. Figure 5 shows sets of speaker language phonemes (English phonemes) and listener language phonemes (Spanish phonemes).  Figure 7 shows the Spanish word “Hola” in the phonemes of Spanish language: “LL phonemes 710” which means Listener Language phonemes.  In Figure 12, this corresponds to VALL Model 1212: Speaker Irrelevant Listener Language.  “[0064] At 1212, a HMM model of the voice of the auxiliary speaker speaking the language of the listener (VALL) is shown with VALL HMM states 1214….”]
associating, by the processor, a custom phoneme with a standard phoneme of the standard phoneme sequence, wherein the custom phoneme includes the specific person's pronunciation of a sound in the translated word; and [Chen, Figures 1, 11, and 12: the translated word “Hola” is output in Spanish but with the voice/ “specific person’s pronunciation” of the “speaker 102” who initially spoke the word in English.  Figure 11, “state mapping 1118” in the “speaker adaptation module 1114” causes the phonemes that are adapted to the voice of the speaker are used in the “speech synthesis module 1120.”  Figure 12 shows Cross-Lingual Speaker Adaptation at 1224.  Figure 13:  “Synthesize speaker’s voice speaking listener’s language (VOLL) 1332.”]
synthesizing, by the processor, the translated text string to a translated audio signal that includes the translated word pronounced using the custom phoneme; [Chen, Figure 11, “Speech Synthesis Module 1120” generating “Hola (VoLL) 112” and outputting it from the speaker 1122.  Figure 13:  “Synthesize speaker’s voice speaking listener’s language (VoLL) 1332.”  “[0071] At 1332, the speaker's voice speaking the listener's language is synthesized (VOLL) using the TLL and VLLL model of 1330 which was modified by VSLS….”]
storing the translated audio signal in a data structure associated with a voice identifier that combines a performer identifier and an emotional state, wherein the performer identifier is indicative of the specific person and wherein the emotional state is indicative of an emotion expressed by the specific person in the original recording; and
indexing the original text string, wherein the indexing comprises using the voice identifier to associated the original text with the specific person.

    PNG
    media_image4.png
    513
    776
    media_image4.png
    Greyscale


    PNG
    media_image5.png
    768
    494
    media_image5.png
    Greyscale

    PNG
    media_image6.png
    787
    525
    media_image6.png
    Greyscale


Chen teaches phoneme mapping using the samples of speech of the speaker 102 and any emotion the speaker 102 might have had during the generation of the speech samples would be reflected in the sample prosody.  Chen does not expressly teach that the emotion of speaker 102 is recorded together with his speech samples.
Tischer teaches:
storing the translated audio signal in a data structure associated with a voice identifier that combines a performer identifier and an emotional state, wherein the performer identifier is indicative of the specific person and [Tischer, Figure 3 shows a “data structure” in which “Speech Samples A, B, …, n”/ “Translated Audio Signals” are associated with a “Known Speaker X” / “Specific Person” with identifier “X”/ “Performer Identifier” and stored.  The “Speech Samples” /”Translated Audio Signals” correspond to and include the “Phonemes” that are needed for speech synthesis.  [0044].  The different “Speech Samples” include different “emotions” of the “Speaker X” when he speaks the “Speech Sample.”  [0044] and [0059].  “[0045] As an example, FIG. 3 shows a voice file 101. The voice file 101 comprises speech samples A, B, . . . n of known speaker X (100). Speech samples A, B, . . . n are recorded using a conventional audio input interface 501. Speech sample A (110) comprises sounds A1, A2, A3, . . . An (111), which are recorded from sample words read by speaker X (100) from a pronouncing dictionary. Sounds A1, A2, A3, . . . . An (111) are correlated with phonemes A1, A2, A3, . . . . An (112), respectively. Each of phonemes A1, A2, A3, . . . An (112) is further assigned a standardized identifier A1, A2, A3, . . . An (113), respectively.”]  
wherein the emotional state is indicative of an emotion expressed by the specific person in the original recording; and [Tischer.  In Figure 3 each of the A1, A2, A3, …An is the same Sound/Phoneme except with some variation that can be due to different emotions.  The recordation of emotions in voice files for each speaker is shown in Figure 2.  Additionally, using the voice file of a speaker in a source language in the translated speech in the target language is taught at [0100].  Figure 2, “Record speech samples of plurality of speakers 20” to “organize speech samples and audio representations into separate collection for each speaker 40.”  “[0044] As shown in the embodiment in FIG. 2, sounds from speech samples and correlated audio representations are organized (40) into a collection and saved (50) as a single voice file for a speaker. Voice files comprise various formats, or structures. For example, a voice file can be stored as a matrix organized into a number of locations each inhabited by a unique voice sample, or linguistic representation. A voice file can also be stored as an array of voice samples. In a voice file, speech samples comprise sample sounds spoken by a particular speaker. In embodiments, speech samples include sample words spoken, or read aloud, by the speaker from a pronouncing dictionary. Sample words in a pronouncing dictionary are correlated with standardized phonetic units, such as phonemes. Samples of words spoken from a pronouncing dictionary contain a range of distinct phonetic units representative of sounds comprising most spoken words in a vocabulary. Samples of words read from such standardized sources provide representative samples of a speaker's natural intonations, inflections, pitch, accent, emphasis, speed, rhythm, pausing, and emotions such as happiness and anger.”  “[0059] … Other samples can be recorded for emphasis, including high and low pitched voicings, as well as to capture voice-modulating emotions such as joy and anger….”  “[0100] …The sender's V-card, however, could also include the sender's distinct sounds, auditory representations, and identifiers (earlier described as the sender's "voice" font). Any electronic communications from that sender could be translated to speech using the sender's voice font. The sender could also be authenticated using the voice font, as earlier described. The V-card could even specify that the sender wishes all their electronic communications to be not only translated to speech, but also translated into a different language. A service provider or network operator may, as earlier mentioned, provide this service.”  “[0014] … Phonemes can be labeled, or classified, with a standardized identifier such as a unique number…..” ]
indexing the original text string, wherein the indexing comprises using the voice identifier to associate the original text with the specific person. [Tischer, Figure 3 is a “data structure” which teaches the “indexing” of the Claim.  Additionally, the “Identifier A1” to “Identifier An” associate the “Speech Samples A, B, …, n” with the “Specific Person” / “Known Speaker X.”.  Tischer, in Figure 7, shows another Table/Index/Matrix that establishes a relationship between the input text and the phonemes of a particular speaker.  “[0072] FIG. 7 is a schematic illustrating another exemplary embodiment.  … the TTS engine 507 uses the matrix 612 to retrieve the sequential string 614 of phonemes corresponding to the phrase 608. The TTS engine 507 then processes the sequential string 614 of phonemes when translating the textual sequence 604 to speech.”  “[0073] …  The TTS engine 507 parses the content 600 into as long of textual sequences that can be exactly found in the matrix 612. Using the previous example, if the TTS engine 507 can correlate the entire textual sequence "You are one lucky cricket" … to the same phrase in the matrix 612, then the TTS engine 507 retrieves the corresponding sequential string of phonemes: [0074] [Y UW . AA R . W AH N . L AH K IY . KR IH K AH T.”  See also Figure 8 and description.]
Chen and Tischer pertain to speech synthesis and refer to machine translation using the voice font of the original speaker and it would have been obvious to combine the feature of Tischer which stores the emotion associated with the speech samples of a user in his voice card (data structure associated with a speaker) with the system of Chen that collects and stores voice samples of various speakers in order to generate a synthesized speech with the voice of the original speaker so that the emotion of the speaker may also be reflected in the synthesized speech.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.


    PNG
    media_image7.png
    715
    482
    media_image7.png
    Greyscale
 
    PNG
    media_image8.png
    540
    474
    media_image8.png
    Greyscale


    PNG
    media_image9.png
    614
    446
    media_image9.png
    Greyscale


    PNG
    media_image10.png
    468
    653
    media_image10.png
    Greyscale

	Tischer teaches a Data Structure / Matrix / Index that includes the Sounds that teach the “Translated Audio Signal” (or are combined to form the “translated audio signal” depending on what the translated audio signal is) and correspond to “Sample Words” of Figure 4.  Tischer also teaches that the data structure/index of Figure 3 associates the Known Speaker X with the Sounds he uttered with different Emotions and corresponding to the readout of Sample Words.  The Data Structure/ Index of Tischer teaches and correlates all of the information that is recited in the last two limitations of the Claim.
However, a reference that provides a speaker identifier that more expressly combines the person and emotion is added.
Isobe is directed to speech synthesis and more expressly teaches:
storing the translated audio signal in a data structure associated with a voice identifier that combines a performer identifier and an emotional state, wherein the performer identifier is indicative of the specific person and wherein the emotional state is indicative of an emotion expressed by the specific person in the original recording; and [Isobe Figure 4 shows a data structure that shows the ID of the terminal which corresponds to the ID of the Person and the different speech data associated with different emotions.  Here the “Voice Identifier” “combines a performer identifier and an emotional state.”   “A media process server apparatus has a speech synthesis data storage device for storing, after categorizing into emotions, data for speech synthesis in association with a user identifier,…”  Abstract.  “[0052] FIG. 4 is data managed at speech synthesis data storage device 305. The data is managed for each user in association with a user identifier such as a communication terminal ID, a mail address, a chat ID, or an IM ID. In an example of FIG. 4, a communication terminal ID is used as a user identifier, and data for communication terminal 10a 3051 is shown as an example. Data for communication terminal 10a 3051 is speech data of a user's own voice for communication terminal 10a, and is managed, as shown, separately in speech data 3051a in which speech data is registered without being categorized into emotions and data portion by emotion 3051b. Data portion by emotion 3051b has speech data 3052 categorized into emotions and parameter 3053 for each emotion.”]
indexing the original text string, wherein the indexing comprises using the voice identifier to associated the original text with the specific person. [Isoble converts a text message to speech by synthesis and therefore associates the “original text string” with the “user identifier.”   “… a text analyzer for determining, from a text message received from a message server apparatus, emotion of text, and a speech data synthesizer for generating speech data with emotional expression by synthesizing speech corresponding to the text, using data for speech synthesis that corresponds to the determined emotion and that is in association with a user identifier of a user who is a transmitter of the text message.”  Abstract.  See also Figure 3.  Instant Claim converts the initial speech to text and then back to speech.  This reference begins from text and thus dovetails with the Claim.]

    PNG
    media_image11.png
    342
    492
    media_image11.png
    Greyscale


    PNG
    media_image12.png
    306
    475
    media_image12.png
    Greyscale

Chen and Tischer and Isobe pertain to speech synthesis and it would have been obvious to combine the data structure/index that shows a User ID with several versions of his speech according to different speaker emotions from Isobe with the system of combination that includes the same information but not expressly organized as such and also more expressly associates the input text to be synthesized with the user because it begins with text and derives the emotion of the user from the text of his message.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 6, Chen teaches and suggests:
6. The computer-implemented method of claim 1, further comprising associating, in memory, the custom phoneme with a matching standard phoneme, comprising: [Chen, Custom phonemes are the phonemes in the target language that sound like the speaker who speaks in the source language.  This step is from Claim 1 and here just adds that the association is stored in the memory.  See [0061]-[0062] which taken together teach that the “custom” / “mapped phonemes” are stored in the memory 1106 of Figure 11.]
identifying sample text in the source language associated with the custom phoneme; [Note that Figure 5 of the instant Application associates phonemes with letters so the “sample text” could be a letter.  Chen teaches about Phonemes:  “[0035] … Phonemes may comprise context dependent phones, that is, speech sounds where a relative position with other phones results in different speech sounds. For example, if phones "c ae t" of word "cat" are present, "c" is the left phone of "ae," and "t" is the right phone of "ae."”  This teaching is similar to the concept in Figure 5 of the instant Application with respect to finding phonemes for a target sound.  Thus, by identifying the “custom phonemes” which are obtained from the voice of the speaker (Figure 12, VSLS Samples 1202), the system is effectively identifying the “sample text” / Letter such as the “a” in “cat” that is taught by Chen to correspond to the phoneme.  This is suggested by the teachings of Chen. ] [Figure 2 shows the “sample text” as “Hello” in the source language (English) being associated with the custom (speaker specific) phonemes 204(A).]
identifying a matching standard phoneme associated with the sample text; and [Chen, Figure 12, mappings 1210 and 1216 take the process to the standard phonemes / VA: auxiliary voice in the LL language of the listener/target language.  The concept of “sample text” is again suggested by Chen’s teachings in [0035] regarding phonemes.] [Figure 8 shows establishing correspondence between speaker-independent phonemes of English and Spanish.]
associating the custom phoneme with the matching standard phoneme. [Chen, Figure 12, mapping 1222 takes the standard phonemes of the target language into the custom/speaker-specific phonemes in the target language.] [Figure 10 shows voice to voice mapping of phonemes (sub-phonemes) in the same language:  custom (speaker specific) to standard (speaker independent) mapping.]

Chen achieves the goal of obtaining custom phonemes (speaker-specific phonemes) but does not teach going through a “Sample Text” which depending on the definition of “sample text” could be an implied or obvious variation.  (Not inherent because the use of text is no necessary.)
Tischer teaches:
… associating, in memory, the custom phoneme with a matching standard phoneme, comprising:
identifying sample text in the source language associated with the custom phoneme; [Tischer, Abstract, claim 1 and Figure 4 showing the “Sample Word 103” / “Sample Text” which are associated with various phonemes.   The “Text” 140 associated with its speaker specific phonemes 142.  “[0057] FIG. 4 illustrates an example of translation of text using phonemes in a voice file. Embodiments of the voice file for the voice of a specific known speaker include all of the standardized phonemes as recorded by that speaker. In the example in FIG. 4, the voice file for known speaker X (100) includes recorded speech samples comprising the 39 standard phonemes in the Carnegie Mellon University (CMU) Pronouncing Dictionary ….  Sounds in sample words 103 recorded by known speaker X (100) are correlated with phonemes 112, 122, 132. The textual sequence 140, "You are one lucky cricket" (from the Disney movie "Mulan"), is converted to its constituent phoneme string using the CMU Phoneme Dictionary. Accordingly, the phoneme translation 142 of text 140 "You are one lucky cricket" is: Y UW. AA R. W AH N . L AH K IY. K R IH K AH T. When the voice file 101 is applied, the phoneme pronunciations 112, 122, 132 as recorded in the speech samples by known speaker X (100) are used to translate the text to sound like the voice of known speaker X (100).”]
identifying matching standard phoneme associated with sample text; and  [Tischer, Figure 4, The “Sample Word 103” teaches the “Sample Text” of the Claim and is associated with “Phonemes 112,122, 132” which are shown also in Figure 3.  Figure 4 shows how the phonemes are matched with the “Sample Word 103” / “Sample Text.”]
associating the custom phoneme with the matching standard phoneme. [Tischer, Figures 3-4, The “Phonemes 112, 122, 132” have several versions corresponding to different Emotions of the “Known Speaker X” and each of the versions A1, A2, … An provides the “custom phonemes” of the Claim.]

Chen and Tischer pertain to TTS in various languages and it would have been obvious to modify the system of Chen which uses context dependent phonemes (corresponding to “sample text” of the Claim) but does not mention “text” with the system of Tischer that expressly shows conducting a text to phoneme mapping in order to arrive at a system where the letters (sample texts) are used as intermediaries for mapping.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 7, Chen suggests:
7. The computer-implemented method of claim 6, 
wherein the custom phoneme is one of a plurality of different custom phonemes, and [Chen teaches adapting the TTS acoustic model for a particular speaker.  This method could be extended to as many speakers as is desirable although the presence of several speakers is not expressly described in Chen.]
wherein the method further comprises associating, in memory, each of the plurality of custom phonemes with respective standard phonemes of the set of standard phonemes. [Chen teaches in Figure 11 that the “State Mapping Module 1118” and the other parts of the speech synthesizer are stored in “Memory 1108.”]

Regarding Claim 8, Chen teaches the “custom phonemes” of Claim 8 by showing the adaptation of the HMM model in Figure 4 for a particular speaker in a same language.  “[0038] FIG. 4 is a flow diagram illustrating speaker adaptation in a same language 400. At 402, sub-phonemic samples 206 as described above of a first voice, voice "X" or VX, are taken….”  These teach the “custom phonemes” of the Claim.  Chen does not mention the incidental feature of obtaining these phonemes from the recorded speech samples of person X.
Tischer expressly teaches:
8. The computer-implemented method of claim 1, wherein the custom phoneme is from a set of custom phonemes extracted from recorded utterances of the specific person. [Tischer, Figure 2, “Record speech samples of plurality of speakers 20” etc. during which process custom phonemes corresponding to a particular speaker are generated.  See Figure 3 showing the known speaker X speech samples generating the phoneme sets.  See Figure 4 for “known speaker X recorded phonemes.”  See [0042]-[0044].]
Chen and Tischer are directed to TTS.  It would have been obvious to combine the details of extracting the characteristics of the voice of the speaker from recorded samples of his speech from Tischer with the system of Chen that leaves out the minutia to focus on the main aspects of cross-lingual speaker adaptation of a TTS device.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 13 is a computer program product claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale.
13. A computer program product comprising one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions executable by a processor to cause the processor to perform operations comprising:  [Chen:  “[0061] Memory 1106 also resides within or is accessible by the translation computer system and comprises a computer readable storage medium and coupled to processor 1102….”  “[0076] … Moreover, the acts and processes described may be implemented by a computer, processor or other computing device based on instructions stored on memory, the memory comprising one or more computer-readable storage media (CRSM).”]
….

Claim 17 is a computer program product claim with limitations similar to the limitations of Claim 6 and is rejected under similar rationale.

Claim 18 is a system claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale.
18. A computer system comprising a processor and one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions executable by the processor to cause the processor to perform operations comprising: [Chen: “1. One or more computer-readable storage media storing instructions for cross-language speaker adaptation in speech-to-speech language translation that when executed instruct a processor to perform acts comprising ….”  See also [0076]-[0077].
….

Claim 20 is a system claim with limitations similar to the limitations of Claim 6 and is rejected under similar rationale.

Claims 2, 9-12, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Chen and Tischer and Isobe in view of Chun (U.S. 9,922,641).
Regarding Claim 2, Chen does not teach a multimedia situation although, the method of processing the audio is not impacted by the source of the audio. 
Tischer teaches:  “[0071] Exemplary embodiments have many other useful applications. Embodiments can be used in a variety of computing platforms, ranging from computer network servers to handheld devices, including wireless telephones and personal digital assistants (PDAs). Customized text-to-speech translations, according to exemplary embodiments, can be utilized in any situation involving automated voice interfaces, devices, and systems. Such customized text-to-speech translations are particularly useful in radio and television advertising, in automobile computer systems providing driving directions, in educational programs such as teaching children to read and teaching people new languages, for books on tape, for speech service providers, in location-based services, and with video games.”  Tischer does not expressly mention a conferencing situation which is intended by the Claim.
Neither does Isobe.
Chun teaches:
2. The computer-implemented method of claim 1, further comprising: 
extracting the original audio signal from sequential data representative of sequentially matched multimedia content. [Chun teaches that its cross-lingual speaker adaptation method may be used for video conferencing situations which teach “sequentially matched multimedia content” of the Claim: “In some implementations, the speech recognition engine 205 can be configured to recognize human speech. For example, the speech recognition engine can be configured to produce speech data from audio and/or video data, for example, data captured by a user device (e.g., a computer system) though a microphone and/or a video camera. Such audio or video data can originate from, for example, a video or audio conference session between two or more users. In some implementations, each of the users can participate in the audio/video conference session using respective user devices.”  Col. 5, line 60 to Col. 6, line 3.
Chen/Tischer/Isobe and Chun are directed to cross-language speaker adaptation of a speech synthesizer system and have considerable overlap.  It would have been obvious to combine the aspect of Chun with discusses the use of its system in a video conferencing service with the system of Chen/Tischer/Isobe which is well suited for this purpose and as an application of the method and system of Chen.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

    PNG
    media_image13.png
    626
    450
    media_image13.png
    Greyscale
 
    PNG
    media_image14.png
    633
    204
    media_image14.png
    Greyscale

Regarding Claim 9, Chen is described in the context of translating a source to a target language (English to Spanish) but there is no limitation on how many second/target languages there could be.
Tischer teaches:  “[0100] … The sender could also be authenticated using the voice font, as earlier described. The V-card could even specify that the sender wishes all their electronic communications to be not only translated to speech, but also translated into a different language. …”
Isobe mentions that other languages may be at work.  “[0053] … In a case in which a language that is an object of the speech synthesis service is a language other than Japanese, speech data should be registered by using a section unit suited for the language instead of bunsetsu, or a segment.”
Chun expressly teaches the multiple language configuration:
9. The computer-implemented method of claim 1, 
wherein the target language is one of a plurality of different target languages, and [Chun, “… A speech synthesizer can be configured to speak multiple languages with different voice characteristics. Speech can be synthesized in multiple languages using the voice characteristics of a particular individual, even though the individual may not actually speak or even know one or more of the multiple languages….”  Col. 2, lines 25-30.]
wherein the generating of the translated text string comprises generating a plurality of translated text strings, the generating of the plurality of translated text strings comprising translating the original text string from the source language to each of the plurality of target languages. [Chun, Figure 2, “Translation Engine.”  “In some implementations, a speech synthesizer can be configured to resemble the voice of a particular individual. For example, consider a video conference system where the participants speak different languages and the speech of a speaker in one language is translated into various other languages….”  Col. 3, lines 35-41.  Chun (like Chen) goes through a STT and MT and then TTS : “In some implementations, the translation engine 230 reads the text file (in a first language) output by the speech recognition engine 205, and uses the text file to generate a second text file in a pre-specified target language. For example, the translation engine 230 may read an English-language text file and generate a French-language text file based on the English-language text file. …”  Col. 6, lines 55-62.  See also claims 7-8.]
Chen/Tischer/Isobe and Chun are directed to cross-language speaker adaptation of a speech synthesizer system and have considerable overlap.  It would have been obvious to modify and extend Chen which focuses on one example of cross-lingual translation with speaker adaptation which can be extended to translation to several languages with the system of Chun/Tischer/Isobe that is express on this point.  The use of Chun is hardly necessary and Chun is added to Chen merely to expedite prosecution.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 10, Chen uses the example of Hello to Hola which is translating a word.  “[0016] FIG. 7 is an illustration of HMM models for words in two different languages.”
Tischer too teaches the word-based pronunciation dictionaries:  “[0044] … In embodiments, speech samples include sample words spoken, or read aloud, by the speaker from a pronouncing dictionary….”  See [0044]-[0047].
Isobe teaches that emotions are detected from words of the text.  “[0040] …Text analyzer 302 recognizes an emotion class of text also from words expressing emotions such as "delightful", "sad", "happy", and the like.”  Isobe is not directed to translation.
Chun expressly teaches:
10. The computer-implemented method of claim 9, 
wherein the plurality of translated text strings include, as translated words, respective translations of the word from the source language to each of the plurality of target languages. [Chun is also word based and additionally teaches translation to a plurality of language expressly:  “In some implementations, the speech recognition engine 205 may include a speech segmentation routine for breaking sounds into sub-parts and using those sub-parts to identify words, a word disambiguation routine for identifying meanings of words, a syntactic lexicon to identify sentence structure, parts-of-speech, etc., and a routine to compensate for regional or foreign accents….  Self-describing computing languages can be useful in this context because they enable tagging of words, sentences, paragraphs, and grammatical features in a way that is recognizable to other computer programs or modules in the system 200. For example, the translation engine 230 can be configured to read the text output from the speech recognition engine 205, identify, e.g., words, sentences, paragraphs, and grammatical features, and use that information as needed.”  Col. 6, lines 36-55.]
Chun was cited for teaching translation into a plurality of languages and this aspect which pertains to translation into a plurality of target languages also comes from Chun under the rationale provided for Claim 9.

Regarding Claim 11, Chen teaches:
11. The computer-implemented method of claim 10, 
wherein the assembling of the standard phoneme sequence comprises assembling standard phoneme sequences for respective target languages including standard pronunciations of the translated words. [Chen, Figure 12 which shows the stages of transformation of the phonemes goes through a VALL Model 1212 where the VALL HMM States 1214 correspond to the standard/ speaker-independent phonemes of the target language (LL: Language of Listener).]
Chen does not teach having a plurality of language.
Tischer is not express on this point either while application of the method to different languages is obvious.
Isobe mentions possibility of the use of the system with other languages but is not directed to translation.
Chun teaches:
wherein the assembling of the standard phoneme sequence comprises assembling standard phoneme sequences for respective target languages including standard pronunciations of the translated words. [Chun, Figure 1, “speaker independent speech model for second language 135” includes an assembly of “standard phoneme sequences” for the second/target language.  Chun teaches translation into multiple languages:  “A speech synthesizer can be configured to speak multiple languages …. Speech can be synthesized in multiple languages ….”  Col. 2, lines 25-33.  Chun, Figure 1, also teaches the use of a “universal speech model 105” which is both speaker-independent and language-independent:   “The universal speech model can include a Gaussian mixture model that represents a plurality of speakers speaking one or more languages. The universal speech model can include a plurality of speech parameters estimated based on speech from the plurality of speakers. The speaker-independent speech synthesis model can include a plurality of hidden Markov models (HMMs).”  Col. 2, lines 3-12.]
Chen/Tischer/Isobe and Chun may be combined under the same rationale provided for Claim 9 as Chun was introduced for teaching a plurality of languages.

Regarding Claim 12, Chen teaches:
12. The computer-implemented method of claim 11, 
wherein the associating of the custom phoneme with the standard phoneme comprises associating custom phonemes with standard phonemes of the standard phoneme sequences, [Figure 10, shows the mapping from one voice to another such as standard voice (Auxiliary (A)) to a specific speaker’s voice (Voice of the Speaker Vs).]
wherein the custom phoneme includes the specific person's pronunciations of respective sounds in the translated words.
 (See also:  
Tischer, Figure 2, “correlate speech samples with standardized audio representations 30.”  This means speaker-independent phoneme set.  
Chun, Figure 1, “Speaker Independent Speech Model for Second Language 135” goes through “Modifying Speech Model for Second Language 140” to generate the “modified Speech Model for Second Language 145.”)

Regarding Claim 14, Chen does not discuss a distributed computing system where the programs or data are stored remotely and received via a network.  
Neither does Tischer.
Isobe, Figure 1, shows a distributed system but performs the synthesis itself remotely at the server.
Chun expressly teaches:
14. The computer program product of claim 13, 
wherein the stored program instructions are stored in a computer readable storage device in a data processing system, and [Chun, “In another aspect, the subject matter described in this specification can be embodied in a computer program product comprising computer readable instructions encoded on a storage device….”  Col. 1, lines 54-57.  Figure 4, “Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.”  Col. 11, 49-60.]
wherein the stored program instructions are transferred over a network from a remote data processing system. [Chun, Figure 2, “FIG. 2 is an example of a system 200 that uses cross-lingual speaker adaptation for multi-lingual speech synthesis such as described above with reference to FIG. 1. The system 200 can include, for example, a speech recognition engine 205, a training engine 210, a translation engine 230 and a speech synthesis engine 235. Two or more of these engines can be in communication with one another, possibly over a network such as a local area network (LAN), wide area network (WAN) or the internet….”  Col. 5, lines 46-59.]
Che/Tischer/Isobe and Chun are directed to cross-language speaker adaptation of a speech synthesizer system and have considerable overlap.  It would have been obvious to place the system of Chen/Tischer/Isobe in the distributed computing setting that Chun discloses to allow for the most expedient and convenient locating of the resources.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 3-5, 16, and 19 rejected under 35 U.S.C. 103 as being unpatentable over Chen and Tischer and Isobe in view of Rifkin (U.S. 20040225498).
Regarding Claim 3, Chen demonstrates its method as a communication between a speaker and a listener and then mentions the situation of mapping the voices of several speakers to an auxiliary model.  See Figure 12 and [0063].  Chen does not discuss looking for the voice font/print/profile of a particular speaker.
Tischer teaches:
3. The computer-implemented method of claim 1, further comprising: 
generating a query voice print from the original audio signal; [Tischer, Figure 2, “select one of stored voice files 70.”]
searching a database for a matching voice print that matches the query voice print; and [Tischer, Figure 3 shows the database of voice files.  “[0045] As an example, FIG. 3 shows a voice file 101. The voice file 101 comprises speech samples A, B, . . . n of known speaker X (100)….”  Figures 13 and 14 show storage 620 of a plurality of voice files.  “[0080] … As the paragraphs above explained, there may be a plurality 620 of voice files, with each voice file 612 having the characteristics of a known speaker. Each speaker's voice file contains that speaker's distinct sounds, auditory representations, and identifiers. Each speaker's voice file uniquely characterizes that speaker's speech speed, emphasis, rhythm, pitch, and pausing….”]
associating the specific person with an identifier associated with the matching voice print. [Tischer, Figures 13 and 14 showing authentication of a voice.  Figure 17 showing the flowchart of an authentication process:  “compare speech to a speaker’s unique voice characteristics stored in a voice file 732” and  “authenticate speaker/sender 738.”  Authentication is a bit different from Identification.]
Chen and Tischer are directed to natural language processing of voice.  It would have been obvious to combine the speaker authentication of Tischer with the system of Chen in a tandem manner as a precursor to the method of Chen and for automatic identification of the speaker of Chen.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Tischer does not teach Identification which is slightly different from Authentication.
Neither does Isoble since the identity is known.
Rifkin teaches:
3. The computer-implemented method of claim 1, further comprising: 
generating a query voice print from the original audio signal; [Rifkin receives the voice of a speaker and sends a feature vector (voice print) extracted from the voice to a database of enrolled speakers where a voice sample (including an extracted voice print) for each speaker is stored.  See Figure 2, “Voice Sample Database 275” and “Data Querying Module 242” which receives the input voice at the “Voice Processing Module 210.”  Figure 2 is an expansion of “Voice Recognition Module 155” of Figure 1.]
searching a database for a matching voice print that matches the query voice print; and [Rifkin, the “Voice Sample Databases 275” is searched.]  “[0023] …Generally, the speaker interface 110 sends voice samples of a speaker to the computing environment 105 for enrollment or recognition. During recognition, if the computing environment 105 outputs an identification if it matches a recognition voice sample with an enrollment voice sample. As used herein, the term "voice sample" refers to one or more words or partial words, phrases, numbers, codes, or any other vocal projection or utterance, of any language or dialect, by the speaker. A "feature vector," such as a voice print or speaker data point, refers to characteristics of the voice sample as indicated by metrics in the time or frequency domain, biometrics, statistics, and the like. ….”]
associating the specific person with an identifier associated with the matching voice print. [Rifkin, the “Identification Module 250” identifies the user based on the match of his input voice with the samples/ voiceprints in the databases 275.  “[0036] The system 100 identifies 320 an unidentified speaker using an unidentified voice sample to determine whether the speaker has enrolled. The unidentified speaker projects one or more voice samples towards the voice capture device 112. Using the reference information accumulated during enrollment 310, the system 100 is able to match the unidentified speaker to an enrolled speaker by unique characteristics, such as biometric parameters, of the speaker's voice. In one embodiment, the system 100 maps a feature vector extracted from the unidentified voice sample into the data structure and retrieves a certain number of approximate nearest neighbors for modeling….”]
Chen/Tischer/Isobe and Rifkin are directed to natural language processing of voice.  It would have been obvious to combine the speaker identification of Rifkin with the system of Chen/Tischer/Isobe in a tandem manner as a precursor to the method of Chen and for automatic identification of the speaker of Chen.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 4, Chen is not about speaker identification, verification, or authentication.
Tischer teaches:
4. The computer-implemented method of claim 3, further comprising: 
associating the original text string with the identifier associated with the matching voice print. [Tischer does teach associating the input audio with a “textual sequence corresponding to the audio content.”  See abstract, claim 1 and Figure 4.  “[0057] …Sounds in sample words 103 recorded by known speaker X (100) are correlated with phonemes 112, 122, 132. The textual sequence 140, "You are one lucky cricket" (from the Disney movie "Mulan"), is converted to its constituent phoneme string using the CMU Phoneme Dictionary. Accordingly, the phoneme translation 142 of text 140 "You are one lucky cricket" is: Y UW. AA R. W AH N . L AH K IY. K R IH K AH T. When the voice file 101 is applied, the phoneme pronunciations 112, 122, 132 as recorded in the speech samples by known speaker X (100) are used to translate the text to sound like the voice of known speaker X (100).”]
Chen and Tischer are directed to natural language processing of voice and to TTS.  It would have been obvious to combine the establishing an association between a speaker and the text of his speech from Tischer with the system of Chen again as a precursor and a pre-processor to the method of combination to add to the for automatic identification of the speaker also association of the text of his speech to speaker identity to make the account keeping aspects easier for Chen.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 5, Chen discusses a single speaker scenario.
Tischer teaches:
5. The computer-implemented method of claim 4, further comprising: 
retrieving the custom phoneme from memory based on the custom phoneme being associated with the identifier associated with the specific person. [Tischer, Figure 2, “select one of the stored voice files 70.”  “[0042] … A TTS device may have any number of voice files stored for use in translating speech to text. A user of the TTS device selects (70) one of the stored voice files and applies (80) the selected voice file to a translation of text to speech using a TTS engine, such as TTS engine 507. In this manner, a text is translated to speech using the voice and speech patterns and attributes of a known speaker. …”  See Figure 3 for how the voice files associated the speaker identity and his speech with the phonemes.]
Rationale as provided for Claim 1. 

Claim 16 is a computer program product claim with limitations similar to the limitations of Claim 3 and is rejected under similar rationale.
Claim 19 is a system claim with limitations similar to the limitations of Claim 3 and is rejected under similar rationale.

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Chen and Tischer and Isobe in view of Erickson (U.S. 20170116185).
Regarding Claim 15, Chen does not discuss metering the translation service or charging for it.  
Neither do Tischer or Isobe.
Erickson teaches:
15. The computer program product of claim 13, 
wherein the stored program instructions are stored in a computer readable storage device in a server data processing system, and  [Erickson pertains to cloud computing including servers:  “[0027] Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.”]
wherein the stored program instructions are downloaded in response to a request over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system, further comprising:  [Erickson teaches the use of cloud computing for natural language processing including machine translation.  “[0002] The present disclosure relates in general to natural language processing (NLP) systems, specifically NLP systems that include natural language generation (NLG), which include natural language translation (NLT) systems, natural language processing question & answer (NLP Q&A) systems, natural language dialogue systems and the like. More specifically, the present disclosure relates to a NLP system designed to integrate disfluencies with natural language (NL) outputs, wherein the disfluencies are selected and applied in a manner that communicates in natural language a level of confidence in the NL outputs.”  “[0004] MT systems are computer-based tools that support the translation of speech and/or text from one human language to another. MT systems come in a variety of forms, including, for example, fully automated machine translation (FAMT) systems, human-assisted machine translation (HAMT) systems, machine-aided translation (MAT) systems, and the like.”  “[0005] NLT systems are a known type of MT system. NLT systems are computer-based tools that allow two or more individuals in more or less immediate interaction, typically through email or otherwise online, to communicate in different languages. For example, cross-linguistic communication systems allow two or more people who are not fluent in the same language to communicate with one another. Speech recognition and language translation technologies have improved sufficiently that cross linguistic communication can now be automatically supported by technology. As used in the present disclosure, references to a speaker and/or a hearer using a NLT system include scenarios in which the "speaker" produces communications by typing or writing, along with situations in which the "hearer" receives communications by reading text.”  “[0098] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network….”]
program instructions to meter use of the program instructions associated with the request; and [Erickson teaches a metering capability:  “[0033] Measured service: cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.”]
program instructions to generate an invoice based on the metered use. [Erickson also teaches an invoicing capability:  “[0056] In one example, management layer 80 may provide the functions described below. Resource provisioning 81 provides dynamic procurement of computing resources and other resources that are utilized to perform tasks within the cloud computing environment. Metering and Pricing 82 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may comprise application software licenses. Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources. User portal 83 provides access to the cloud computing environment for consumers and system administrators. Service level management 84 provides cloud computing resource allocation and management such that required service levels are met. Service Level Agreement (SLA) planning and fulfillment 85 provide pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA…..”]

Chen/Tischer /Isobe and Erickson pertain to natural language processing including cross-lingual communication and machine translation and it would have been obvious to add the measuring/metering of the provision of the services for the purpose of billing and invoicing from Erickson with the system of Chen/Tischer/Isobe in order to monetize the translation system of Chen/Tischer/Isobe.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

Balasubramanian (U.S. 20150379989)

    PNG
    media_image15.png
    121
    330
    media_image15.png
    Greyscale


Freeland (U.S. 20030028380):
[0033] Preferably, there can be provided conversion from an original text-based message to a corresponding text-based message which involves a translation between two established human languages, such as French and English. Of course translation may involve either a source or a target language which is a constructed or devised language which is attributable to, associated with, or at least compatible with the character (for example, the Pokemon language). Translation between languages may be alternative or additional to substitution to an idiom of the character.

Use of Emotion in the voice font without translation:
Tischer (U.S. 20040111271)
Teegan (U.S. 2008/0291325)
Park (U.S. 20190385588)
Bhakta (U.S. 20080034044)

Claims 4-5, 8 and 15 are directed to applications/uses of the method in other systems.  Claim 6 corresponds to Figure 5 and needs further particularity in the definition of terms of the Claim.

For Claim 4 see also:
Lobazkov (U.S. 20100217600), Figure 9, steps 332 to 338 identify the sender of a communication (which may be a speaker in a phone call) and determine the text that is associated with this sender/user/speaker.  The identification is not via voice print.  “[0092] Reference is now made to FIG. 9 to describe steps in the method of text-to-speech conversion at the portable electronic device 100. A communication, such as a telephone call or electronic message in the form of an SMS, email, MMS, or Personal Identification Number (PIN) message, is received at the portable electronic device 100 (step 330). The originator of the communication is then determined by an identifier such as the phone number provided using caller identification in the case of a telephone call or by identifying the phone number for SMS and MMS messages, the email address for email messages, or PIN number for PIN messages (step 332). The identifier of the originator is then compared to the contact data listed in the appropriate category of the contact data records to match the identifier to one of the contacts in the address book (step 334)….”

For Claim 6, see:
Etezadi (U.S. 20090006097) deals with speaker-independent TTS (no custom phonemes and only standard phonemes) expressly shows words and letters as intermediaries for the mapping and teaches:
… associating, in memory, the custom phoneme with a matching standard phoneme, comprising: [Etezadi, Figure 2, “Memory 262” includes the association of “standard phonemes” and words of lexicons of different locales (languages).  “[0020] A pronunciation correction system (PCS) 266 is operative to correct pronunciation of text-to-speech (TTS) systems and speech recognition systems between different spoken languages, as described herein. The PCS 266 may apply letter-to-speech (LTS) rules sets and call the services of a lexicon service (LS) 267, as described below with reference to FIGS. 3-5.”]
identifying sample text in the source language associated with the custom phoneme; [Etezadi, Figure 3, 320: mapping of German (target) phonemes to letters/ “sample text.” These are not custom phonemes.  Just standard phonemes.  “[0034] … Referring then to FIG. 3, a German language phoneme table 320 is illustrated for containing a mapping of phonemes in the target language, for example, German, that correspond to phonemes comprising the beginning or target language, for example English….”] [Etezadi does not have speaker-specific phonemes (custom phonemes) but it finds/identifies words that correspond to standard phonemes in a foreign/target language.  Etezadi, Figures 2 and 4, “begin word lookup 405” to “search lexicon 410” to “word found? 415” to NO to “Locale Matches TTS/SR? 425” to again NO to “Query Lexicon Service 430.”  “[0030] According to embodiments of the invention, when a word or phrase requires text-to-speech conversion or speech recognition, a search of a word lexicon associated with the TTS system 268A or speech recognition system 268B is conducted. … If a matching word is not found, locale data for the word requiring pronunciation is determined….”   “[0031] If the locale for the word requiring pronunciation is different from a locale of a TTS and/or speech recognition system in use, a lexicon service 267 is queried to obtain a mapping of the phonemes associated with the word requiring pronunciation to corresponding phonemes of the language associated with the TTS and/or speech recognition system responsible for translating the word from text-to-speech or for recognizing the word….”  “[0032] If a word or phrase fails to be found via the lexicon service 267, the TTS system or SR system will then apply the LTS rules, as described below….”]
identifying a matching standard phoneme associated with sample text; and [Etezadi, Figure 3, 320: mapping of German (target) phonemes to letters/ “sample text.” These are standard phonemes of German mapped to letters.] [Etezadi, Figure 4, “apply LTS rules 440.”  “[0048] …If the locale of the words not found in the word lexicon matches a locale for a the TTS and/or SR system in use, the method proceeds to operation 440, and a letter-to-speech (LTS) rules system is applied to the subject words for the target language, for example, German, and the resulting LTS output is passed to the TTS and/or SR systems for generating an audible presentation of the subject word or words or for recognizing the subject word or words.”  Figure 5 shows the generation “cross map phonemes 545”: “[0055] At operation 535, a phoneme mapping table 310 is generated for the incoming or starting words, for example, the words "The Beatles" according to the incoming or starting language, for example, English, as described above with reference to FIG. 3. At operation 540, a one-to-one mapping between starting language phonemes comprising the subject words is made to corresponding phonemes of the destination or target language, for example, German. At operation 545, a lookup table may be used for mapping phonemes comprising the subject words according to the starting or incoming language to corresponding phonemes of the target or destination language. For example, a lookup table may be generated, as described above, for mapping phonemes from any starting language to corresponding phonemes, if available, in a target or destination language. For example, referring to FIG. 3, the phoneme "th" 325 in the English phoneme mapping table 310 is mapped to the phoneme "z" 335 in the German phoneme mapping table 320 for the words "The Beatles."”]
associated the custom phoneme with the matching standard phoneme.

    PNG
    media_image16.png
    570
    469
    media_image16.png
    Greyscale

    PNG
    media_image17.png
    552
    456
    media_image17.png
    Greyscale


    PNG
    media_image18.png
    774
    455
    media_image18.png
    Greyscale


Gabryjelski (U.S. 2020/0058289) pertains to Dubbing and includes a “sequentially matched multimedia content” including audio, video, and at times text.  See, e.g., Figure 2.]

    PNG
    media_image19.png
    287
    534
    media_image19.png
    Greyscale


Nesvadba (U.S. 2006/0285654) is directed to translation of audio in a video stream which teaches the “sequentially matched multimedia content” of the Claim.  Title:  “… Automatic Dubbing on and Audio-Visual Stream.”  See Figures 1 and 2 receiving the “audio-visual stream 2” entering the “audio-visual splitter 3” to yield the “audio stream 5.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659