DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 to 2, 4 to 5, 10 to 11, 13 to 14, 19 to 20, and 28 to 29 are rejected under 35 U.S.C. 103 as being unpatentable over Subramanian et al. (U.S. Patent Publication 2011/0184721) in view of Veeramani et al. (U.S. Patent Publication 20180188142).
Concerning independent claims 1, 10, and 19, Subramanian et al. discloses an apparatus, method, and computer program for communicating across voice and text channels with emotion preservation, comprising:
“receiving media comprising video and audio, wherein the audio comprises a first plurality of spoken words in a first language” – a voice communication is received (¶[0005]); a generic word recognition process for recognizing words in speech begins by receiving an audio communication channel with a stream of human speech (“wherein the audio comprises a first plurality of spoken words”) (¶[0033]: Figure 1A: Step 102); emotion communication architecture 200 can be incorporated into virtually any device including audio entertainment components (television) (¶[0040]: Figure 2); text and e.g., English; exemplary network topologies incorporating emotion preservation include content of data that may take the form of teleconferencing, multimedia entertainment, movies, television, cable programs, and videoconferencing (¶[0106]: Figure 10); configuring server 1062 at a media distribution center with an ability to markup text would aid in the enjoyment of media as entertainment media or film monologues (¶[0128]: Figure 10); Figure 10 illustrates monitors 1067, 1068, and 1069 that appear to be at least a television; “media comprising video and audio”, then, are included in the embodiments of multimedia entertainment, movies, television, and videoconferencing; 
“determining an emotional state expressed in the first plurality of spoken words based on a first set of non-linguistic characteristics associated with the first plurality of spoken words” – emotion recognition systems operate on the principle that emotions, or the emotional state of the speaker, can be distilled into an acoustic representation of sub-emotion units that make up speech, i.e., specific pitches, tones, cadences, and amplitudes (¶[0032]); unlike word recognition, emotional content of speech is evaluated from human voice patterns comprised of wide ranging pitches, tones, and amplitudes (¶[0035]: Figure 1B); speech communication is fed to voice analyzer 232, which performs two primary functions; it recognizes words, and it recognizes emotions from audio communication; emotion recognition may operate by matching concatenated chains of sub-emotion speech patterns extracted from the audio stream to pre-i.e., pitch, tone, cadence, and amplitude of the verbal delivery that characterizes the emotion (¶[0068]: Figure 2); here, voice analyzer 232 determines an emotional state from “a first set of non-linguistic characteristics” comprising pitch, tone, cadence, and amplitude (Compare Specification, ¶[0006] and ¶[0025], where non-linguistic characteristics include pitch, timbre, tone, accent, and rhythm); 
“identifying a person that utters the first plurality of spoken words based on metadata associated with the media” – context analyzer 230 assumes an identity of a speaker/user of a cell phone as the owner of the phone from connection information, e.g., phone number, instant message screen name, or email address (¶[0049]: Figure 2); here, information of a phone number, instant message screen name, or email address is equivalent to “metadata associated with the media”; that is, an identity of a speaker/user is determined from ‘metadata’ of information stored in a cell phone instead of from an identity determined from analysis of voice; an identity of a speaker may be determined by comparing voice patterns in the conversion with voice patterns from identified speakers; if voice analyzer 232 recognizes a speaker’s voice from voice 
“identifying vocal characteristics of the identified person that utters the plurality of spoken words” – a speaker profile specifies a speaker’s language, dialect, and geographic region, and personality attributes that define the uniqueness of the speaker’s communication (¶[0048]: Figure 2); voice analyzer 232 performs a set of functions including speaker voice analysis; voice analyzer 232 analyzes the voice for speaker voice pattern recognition; if a speech in the communication matches a voice pattern, voice analyzer 232 notifies context analyzer 230, which then sends a more complete context profile for the speaker (¶[0060]: Figure 2); here, a voice pattern of a speaker is “vocal characteristics of the identified person that utters the spoken words”; voice analyzer 232, then, “identifies vocal characteristics of the identified person”, and retrieves a context profile for the speaker using these vocal characteristics to provide more complete information about a particular speaker;  
“translating the first plurality of spoken words of the first language into a second plurality of spoken words in a second language” – emotion translation component 250 efficiently translates text and emotion markup metadata to voice communication; emotion translation component translates text and emotion metadata into another language (¶[0076]: Figure 5); emotion translation component 250 comprises two separate architectures: text and emotion translation architecture 272 and speech and emotion synthesis architecture 270; text and emotion translation architecture 272 translates text into a different language than the original communication (¶[0077]: Figure 5); text is forwarded to text translator 252; text-to-text definitions within text-to-text 
“generating a translated audio based on the second plurality of spoken words of the second language speech and the vocal characteristics of the person that utters the first plurality of spoken words, wherein the translated audio comprises a second set of non-linguistic characteristics associated with the determined emotional state” – text and emotion translation architecture 272 converts the emotion data from emotion metadata expressed in one culture to emotion metadata relevant to another culture using a set of emotion to emotion definitions in emotion to emotion dictionary 255; if voice is desired, the translated text and translated emotion metadata is fed into speech and emotion synthesis architecture 270 which modulates the text into audible word sounds and adjusts the delivery with emotion using the translated emotion metadata (¶[0077]: Figure 5); with regard to emotion synthesis architecture 270, text and emotion markup metadata are utilized for synthesizing human speech; voice synthesizer 258 receives input text or text that is adjusted for emotion from text translator 252; the synthesized voice is then received at voice emotion adjuster 260, which adjusts the pitch, tone, and amplitude of the voice and changes the frequency or cadence of the voice delivery based on the emotion information it receives (“a second set of non-linguistic characteristics”); emotion to voice pattern definitions are selected using the context profiles for the user (¶[0083] - ¶[0084]: Figure 5); profile information specifies speaker information including a language, dialect, and geographic region (¶[0096]: Figure 7); voice emotion adjuster 260 receives translated text for a second language, e.g., French, 
“generating for output the retrieved video and the translated audio, wherein the translated output is output with the video instead of the first plurality of words” – emotion translation component 250 efficiently translates text and emotion markup metadata to voice communication; emotion translation component translates text and emotion metadata into another language (¶[0076]: Figure 5); Figure 5 illustrates voice emotion adjuster 260 provides modulated voice with emotion is output (OUT); after text is translated from an original language to the language of a user with a text to text dictionary, and it is determined that the text is to be synthesized into audio, the modulated voice is adjusted for emotion by altering the tone, camber, and frequency of synthesized voice, and voice with emotion is output (¶[0102]: Figures 8A to 8B: Step 836); exemplary network topologies incorporating emotion preservation include content of data that may take the form of teleconferencing, multimedia entertainment, movies, television, cable programs, and videoconferencing (¶[0106]: Figure 10); configuring 
Concerning independent claims 1, 10, and 19, Subramanian et al. discloses the concept of translating and synthesizing speech so that emotion is retained when it is translated from a first language to a second language, where media can include video for multimedia entertainment, movies, television, and videoconferencing.  However, Subramanian et al. omits a buffer for the audio and video in the limitations of “buffering the media in a buffer as it is received”, “retrieving audio from the buffer”, and “retrieving video from the buffer”.  Still, it is known in the prior art to use buffering to synchronize audio and video signals due to differential processing times, e.g., in videoconferencing, where a goal is to enable synchronization of audio and video components.  Moreover, Subramanian et al. additionally discloses identifying a person that utters the first plurality of spoken words “based on metadata associated with the media”.  Here, Subramanian et al. discloses that a context analyzer 230 assumes the identity of a speaker/user of a cell phone as the owner of the phone from information contained in connection information, e.g., phone number of instant message screen name or email address.  (¶[0049]: Figure 2)  Even if Subramanian et al. does not expressly disclose “metadata associated with the media” for identifying a speaker/user, this phone number or screen name of a person is equivalent to “metadata” that is used for determining an identity of a speaker/user (“identifying a person that utters the first plurality of spoken words based on metadata associated with the media”).  

Veeramani et al. teaches real time closed captioning that includes an audio interceptor to intercept an audio portion of an audio/video output stream of a multi-media application and a speech recognizer to recognize speech within the audio portion to output closed captions to complement video content of the audio/video streams.  (Abstract)  Moreover, Veeramani et al. teaches providing closed captioning in a different language other than the original language of the speech in the audio/video stream 124.  (¶[0020]: Figure 1)  Audio interceptor 214 and video interceptor 234 may respectively intercept the audio and video portions, i.e., Audio 212 and Video 232 of an audio/video stream.  Audio interceptor 214 and video interceptor 234 may respectively include audio and video delay buffers 216 and 236 to facilitate delay outputs 220 and 240 for the audio and video portions for an amount of time to provide time for speech in the audio portion to be recognized, and closed captions corresponding to the recognized speech to be automatically generated to complement the video portion all in real time.  (¶[0023]: Figure 2)  Veeramani et al., then, teaches “receiving media comprising video and audio, wherein the audio comprises a first plurality of words spoken in a first language” and “buffering the media in a buffer as it is received”.  Transliteration engines 108 may be invoked to transliterate detected speech in a first language into a second language.  (¶[0024]: Figure 2)  Additionally, speech recognizer 222 may be configured to recognize speakers of the recognized speech.  One embodiment provides that identification of a speaker may be provided by applications 104, e.g., in the case of VOIP (voice over internet protocol) applications, the originating source/user of an audio/video stream may be known and provided to speech recognizer 222.  (¶[0025]: Figure 2)  Here, if a Veeramani et al., then, provides “identifying a person that utters the first plurality of spoken words based on metadata associated with the media”.  Recognized speech phases in a recognized speech may be transliterated from the original language in the audio portion to one or more other languages.  (¶[0033]: Figure 3)  User interface 400 is particularly suitable for multi-media applications including online meeting application and VOIP applications that support sessions, and include speaker identifiers (ID) 432 of the speakers whose speech is included.  (¶[0037]: Figure 4)  Main controller 202 may be configured to control the amount of time audio and/or video interceptors 214 and/or 234 are to respectively delay the audio and/or video to account for the amount of time needed to recognize the speech and generate the closed captions.  (¶[0026]: Figure 2)  Main controller 202, then, provides for “retrieving audio from the buffer” and “retrieving the video from the buffer” to accommodate any delays for recognizing and translating words from a first language to a second language so that closed captions can be generated that are overlaid at an appropriate point in the video.  An objective is to break away from a language barrier to support a truly universal audience for audio/video content.  (¶[0003])  It would have been obvious to one having ordinary skill in the art to buffer media in a buffer for language translation and identify a user based on metadata as taught by Veeramani et al. to communicate across voice channels with emotion preservation of Subramanian et al. for a purpose of breaking away from a language barrier to support a universal audience.

Subramanian et al. discloses:
“transcribing the first plurality of spoken words in the first language” – text content of a voice communication is realized using word recognition techniques (¶[0005]); speech communication is fed to voice analyzer 232, which recognizes words by word recognition for matching concatenated chains of linguistic phonemes extracted from the audio stream to pre-constructed phoneme word models, the results of which are sent to transcriber 234 (¶[0052]: Figure 2); transcriber 234 receives a word solution from voice analyzer 232, and transcribes them into a textual solution (¶[0057]: Figure 2);
“translating the transcribed words of the first language into words of the second language” – text is translated from a source language into a target language using text translation definitions (¶[0006]); text and emotion translation architecture 272 translates text (¶[0077]: Figure 5); text is forwarded to text translator 252, and text-to-text definitions within text-to-text dictionary 253 are selected for translating the text into the user’s language (¶[0078]: Figure 5); text is translated from the original language to the language of the user with a text to text dictionary (¶[0102]: Figure 8B: Step 818); 
“wherein synthesizing the speech comprises synthesizing the speech from the translated words of the second language” – translated text and emotion words are modulated into a synthesized voice (¶[0006]); if voice is desired, the translated text and translated emotion metadata is fed into speech and emotion synthesis architecture 270 which modulates the text into audible word sounds and adjusts the delivery with emotion using the translated emotion metadata (¶[0077]: Figure 5); a check is made to determine whether to synthesize the text into audio; the translated text is modulated, and the modulated voice is adjusted for emotion (¶[0102]: Figure 8B: Steps 826 to 828).
Subramanian et al. discloses an emotion-text/phrase dictionary 220 (¶0044]: Figure 2); when a matching word or phrase is found in emotion-text/phrase dictionary 220, the emotion definition for the word provides an inference to the speaker’s emotional state (“determining the emotional state expressed by the first plurality of spoken words based on matches resulting from the comparison”) (¶[0063]: Figure 2); text/phrase analyzer 236 mines emotional-phrase dictionary 220 for the emotional state of the speaker based on words and phrases the speaker employs for conveying a message (¶[0068]: Figure 2); here, emotion-text/phrase dictionary 220 is “a database” that include “a plurality of emotional identifiers” for “determining the emotional state expressed by the first plurality of spoken words”.
Concerning claims 5 and 14, Subramanian et al. discloses emotion-voice pattern dictionary 222 contains emotion to voice pattern definitions for deducing emotions from voice patterns in a communication (¶[0044]: Figure 2); if voice is desired, translated text and translated emotion metadata is fed into speech and emotion synthesis architecture 270 which modulates the text into audible word sounds and adjusts the delivery with emotion using the translated emotion metadata; voice emotion adjuster 260 retrieves voice patterns corresponding to the emotion metadata from emotion-voice pattern dictionary 222 (¶[0084]: Figure 5); here, emotion-voice pattern dictionary 222 is “a particular set of non-linguistic characteristics stored in a language translation database”.
Concerning claim 28, Subramanian et al. discloses that text and emotion markup abstraction for a voice communication in a source language is translated into a target language and then voice synthesized and adjusted for emotion; the emotion metadata is translated into emotion metadata for a target language using emotion translation 
Concerning claim 29, Subramanian et al. discloses that emotion recognition may operate by matching concatenated chains of sub-emotion speech patterns extracted from the audio stream to pre-constructed emotion unit models; the voice patterns include specific pitches, tones, cadences, and amplitudes (¶[0052]: Figure 2); voice analyzer 232 recognizes emotion by extracting voice patterns from the verbal communication that are indicative of emotion, i.e., pitch (“pitch”), tone (“tone”), cadence (“rhythm”), and amplitude (“volume”) of the verbal delivery that characterizes the emotion (“wherein the set of non-linguistic characteristics includes at least one of pitch, timbre, tone, accent, rhythm, or volume”) (¶[0068]: Figure 2); here, voice analyzer 232 determines an emotional state from “the set of non-linguistic characteristics” comprising .

Claims 3 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Subramanian et al. (U.S. Patent Publication 2011/0184721) in view of Veeramani et al. (U.S. Patent Publication 20180188142) as applied to claims 1 to 2 and 10 to 11 above, and further in view of Appleby (U.S. Patent No. 6,463,404).
Subramanian et al. discloses language translation from a first language to a second language, but omits “retrieving from a language translation database, rules of grammar of the first language”, “determining using the rules of grammar of the first language, parts of speech of the translated words of the first language”, “retrieving from the language translation database rules of grammar of the second language”, and “translating the translated words of the first language into words of the second language based on the parts of speech assigned to the transcribed words of the first language and rules of grammar of the second language”.  That is, Subramanian et al. omits using rules of grammar and parts of speech in the first and second languages to perform translation.  However, it is known in the prior art to apply grammars to perform machine translation.  Generally, Appleby teaches document translation that parses the document using grammar rules specific to a source language.  (Abstract)  Specifically, server 200 stores data for use by a parser and an abstractor in each language.  This data comprises, for each language, a grammar rules database 227, 237, and an abstraction rules database 228, 230.  Multilingual lexical database 240 stores an entry for each word in any language represented within a translator program, giving the type of lexical e.g., whether it is a noun, a verb, a pronoun, an adjective (“parts of speech”), etc.  The grammar rules stored within each grammar rules database 227, 237 represent, for a corresponding language, the ways in which words of that language may be combined.  English may include one rule that indicates that a verb of ‘to see’ requires an object and a subject, and that in the active form the subject is the active participant or agent (the person who sees) and the object is the passive participant (the thing that is seen).  (Column 5, Lines 33 to 55: Figure 6)  Text is processed by a source language parser program, which, for each word, applies the rules within the grammar rules database 227, which are applicable to words of that type.  If the English text contains the phrase “the dog saw the cat”, the word “the” is a definite article, and a rule within the grammar rules database 227 indicates that it can be followed by a noun to which it refers.  (Column 7, Lines 46 to 65)  An operation of the generator is essentially the reverse of that of the parser, where it operates to look up applicable rules in the target language rules database 237, and assemble the corresponding words located from lexical database 240 into a string of text ordered in accordance with grammar rules.  (Column 11, Line 66 to Column 12, Line 9)  An objective is to use a language independent intermediate structure to provide machine translation from a source language to a target language.  (Column 1, Lines 35 to 47)  It would have been obvious to one having ordinary skill in the art to use rules of grammar for a source language and a target language based on parts of speech of a first language as taught by Appleby to communicate across voice channels with emotion preservation of Subramanian et al. for a purpose of obtaining machine translation with a language independent intermediate structure.

Claims 6, 9, 15, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Subramanian et al. (U.S. Patent Publication 2011/0184721) in view of Veeramani et al. (U.S. Patent Publication 20180188142) as applied to claims 1, 5, 10, and 14 above, and further in view of Kawatake (U.S. Patent Publication 2020/00012724).
Concerning claims 6 and 15, Subramanian et al. discloses determining a pitch of speech, which is a ‘non-linguistic characteristic’ known in the art to be conventionally useful for determining a gender of a speaker, retrieving from a database a plurality of non-linguistic characteristics, and synthesizing speech using the second plurality of non-linguistic characteristics.  However, Subramanian et al. does not disclose determining, based on a plurality of non-linguistic characteristics, “a gender of a person that utters the first plurality of spoken words”, and retrieving from a language translation database a second plurality of non-linguistic characteristics “based on the gender of the person”.  Still, gender determination is well-known in speaker recognition.
Concerning claims 6 and 15, Kawatake generally teaches bidirectional speech translation using a combination of a speech recognition engine, a translation engine, and a speech synthesis engine.  (Abstract)  Specifically, Kawatake teaches a speech synthesizer unit that synthesizes speech in accordance with gender of the first speaker estimated based on a feature amount of speech entered by the first speaker.  (¶[0010] and ¶[0012])  Speech synthesis engines 34 have different specifications, e.g., tones or types of speech to be synthesized.  (¶[0072]: Figure 3)  A value indicating a gender of a speaker who performs a speech entry operation may be set as a value of gender data included in log data to be generated, and a value indicating emotion of a speaker who e.g., anger, joy, and calm.  (¶[0087] - ¶[0090]: Figure 3)  Speech synthesis engine 34 reproduces attributes of a speaker including gender, and translation engine 28 is capable of reproducing a speaker’s attributes of gender as indicated by gender data.  (¶[0113] - ¶[0116]: Figure 7)  Kawatake, then, teaches these limitations of determining a gender of a person based on the plurality of non-linguistic characteristics, performing language translation from a first language to a second language, and synthesizing translated speech using a set of non-linguistic characteristics.  An objective is to provide a two-way conversation between speakers speaking first and second languages in a smooth manner.  (¶[0006])  It would have been obvious to one having ordinary skill in the art to determine a gender of a person from a set of non-linguistic characteristics to synthesize translated speech as taught by Kawatake to communicate across voice channels with emotion preservation of Subramanian et al. for a purpose of providing a two way conversation between languages in a smooth manner.
Subramanian et al. discloses “determining the emotional state, translating the first plurality of spoken words, synthesizing speech from the second plurality of spoken words” and “generating, the retrieved video with the synthesized speech for display”.  Subramanian et al. does not expressly disclose “receiving an input to generate alternate audio for a media stream”.  However, this is an obvious expedient as it is as simple as a button to enable a control unit 110 to begin translating for Subramanian et al.  Specifically, Kawatake teaches that speech entry operations may including tapping operation part 12da by a first speaker, entering speech in the first language while the operation part 12da is being tapped, and releasing the tap state of operation part 12ad.  Subsequently, a speech entry operation by a second speaker may be a series of operations including tapping operation part 12db by the second speaker, entering speech in the second language while operation part 12db is being tapped, and releasing the tap state of operation part 12db.

Claims 7, 9, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Subramanian et al. (U.S. Patent Publication 2011/0184721) in view of Veeramani et al. (U.S. Patent Publication 20180188142) as applied to claims 1, 5 to 6, 10, and 14 to 15 above, and further in view of Dalce (U.S. Patent Publication 2006/0048508).
Concerning claims 7 and 16, Subramanian et al. discloses determining a pitch of speech, which is conventionally known in the prior art as a ‘non-linguistic characteristic’, retrieving from a database a plurality of non-linguistic characteristics, and synthesizing speech using the second plurality of non-linguistic characteristics.  Moreover, Subramanian et al. discloses that voice analyzer 232 attempts to identify a speaker by Subramanian et al. does not disclose determining, based on a plurality of non-linguistic characteristics, “an ethnicity of a person that utters the first plurality of spoken words”, and retrieving a second plurality of non-linguistic characteristics “based on the determined ethnicity of the person”.  
Concerning claims 7 and 16, Dalce teaches a universal language translator that can automatically translate a spoken word or phrase between speakers, and can synthesize a speaker’s voice into the dialect of the other speaker so that each speaker sounds like they’re speaking the language of the other.  A dialect detector could automatically select target dialects by listening to aspects of each speaker’s phrases.  (Abstract)  A voice recognition module detects phonetic elements of a speaker’s voice in order to help mimic the user’s voice.  The voice recognition module could detect a speaker’s pitch, speed of talking, intonation, and/or average audio frequency (“determining, based on the plurality of non-linguistic characteristics”), and could synthesize speech that emulates one or more of these auditory attributes.  The term dialect is used as a method of pronouncing a language that is specific to a culture or a region.  Dialects of English can include Southern, Bostonian, British, Australian, and South African, and dialects of Chinese can include Mandarin, Cantonese, and Shanghainese.  (¶[0012] - ¶[0013])  Compare Applicants’ Specification, ¶[0008] and ¶[0076], which broadly defines ‘ethnicity’ as encompassing accents from people in different parts of the country including an English accent from Boston and an English accent from Texas.  This is useful for one-way communication when a user is translating input from a radio.  (¶[0030])  By interposing a universal language translator Dalce to communicate across voice channels with emotion preservation of Subramanian et al. for a purpose of translating a sentence so that it is modulated to sound like the voice of the party speaking on the other end.
Concerning claims 9 and 18, Subramanian et al. discloses “determining the emotional state, translating the first plurality of spoken words, synthesizing speech from the second plurality of spoken words” and “generating the retrieved video with the synthesized speech for display”.  Subramanian et al. does not expressly disclose “receiving an input to generate alternate audio for a media stream”.  However, this is an obvious expedient as it is as simple as a button to enable a control unit 110 to begin translating for Subramanian et al.  Specifically, Dalce expressly teaches that a user touches a button for “Start speaking” and changes to a button for “Finished speaking”.  (¶[0032])  Dalce’s buttons, then, are “an input to generate alternate audio for a media stream”.  
 
Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Subramanian et al. (U.S. Patent Publication 2011/0184721) in view of Veeramani et al. (U.S. Patent Publication 20180188142) as applied to claims 1 and 10 above, and further in view of Zhu et al. (U.S. Patent Publication 2012/0069974).
Subramanian et al. discloses that voice analyzer 232 attempts to identify a speaker by comparing voice patterns in the conversation with voice patterns from identified speakers (“determining an identity of a person that utters the first plurality of spoken words”).  If voice analyzer 232 recognizes a speaker’s voice from the voice patterns, context analyzer 230 is notified which then selects a context profile for the speaker from profile database 212.  (¶[0050]: Figure 2)  Subramanian et al. does not disclose retrieving from a language translation database a plurality of voice samples of the person, calculating a vocal fingerprint of the person using the plurality of voice samples, and synthesizing speech based on the calculated vocal fingerprint.  However, it is known in the prior art to perform speech synthesis for an identified person using voice samples and a vocal fingerprint.  Specifically, Zhu et al. teaches a text-to-multi-voice messaging, where an end user is able to select different voices (“a vocal fingerprint”) for translating different portions of a text message.  (Abstract)  A voice message is generated from input text and one or more voice samples associated with one or more contacts in a sender’s address book.  The sender is able to select different contacts’ voices which are to be used to translate portions of the input text.  (¶[0026])  The end user could input as text the dialogue between Little Red Riding Hood and the Wolf, and specify that Aunt Alice’s voice is to be used for translating the Little Red Riding Hood portion of the dialogue and that Uncle Bob’s voice be used for translating the Wolf’s portion.  (¶[0027])  A voice sample database 212 contains voice samples which can be used by translator 210 to synthesize one or more voice segments associated with text portions of a message.  Translator 210 verifies whether its voice sample database 212 contains samples of the voice of the requested owner of the voice Zhu et al. determines “an identity of a person that utters the first plurality of spoken words” at least because a sender is able to select an identity of a voice to be used to translate a portion of text.  Then, “a plurality of voice samples” are retrieved to translate a voice message.  Implicitly, “a vocal fingerprint of the person” is calculated when speech is synthesized.  That is, “a vocal fingerprint” is simply that collection of voice samples that define speech of a contact, Aunt Alice, or Uncle Bob.  An objective is to provide end users with interesting new communication services.  (¶[0007])  It would have been obvious to one having ordinary skill in the art to determine an identity of a person to utter a first plurality of words by retrieving a plurality of voice samples to synthesize speech with a vocal fingerprint as taught by Zhu et al. to communicate over voice channels with emotion preservation of Subramanian et al. for a purpose of providing end users with interesting new communication services.

Response to Arguments
Applicants’ arguments filed 03 March 2021 have been considered but are moot in view of new grounds of rejection, necessitated by amendment.
Applicants amend independent claims 1, 10, and 19 to set forth new limitations, and present arguments traversing the prior rejection of these independent claims as being obvious under 35 U.S.C. §103 over Subramanian et al. (U.S. Patent Publication 2011/0184721) and Rangarajan Sridhar et al. (U.S. Patent Publication 2017/0372693).  Subramanian et al. and Rangarajan Sridhar et al. fail to disclose or teach the limitation of “identifying a person that utters the first plurality of spoken words based on metadata associated with the media”.  Applicants contend that Subramanian et al., at ¶[0050], discloses that the identity of the speaker may be determined from voice patterns in the communication, where voice analyzer 232 attempts to identify a speaker by comparing voice patterns in the conversation with voice patterns from identified speakers.  So, Applicants argue that Subramanian et al. only determines an identity of a speaker by comparing voice patterns, but does not identify a person based on ‘metadata associated with the media’, and contend that this is similarly not taught by Rangarajan Sridhar et al.
Generally, Applicants’ argument is not persuasive, but new grounds of rejection are necessitated by the amendment.  Here, Applicants’ independent claims are now Subramanian et al. (U.S. Patent Publication 2011/0184721) in view of Veeramani et al. (U.S. Patent Publication 20180188142).  The rejection no longer relies upon Rangarajan Sridhar et al.  Instead, Veeramani et al. is maintained to teach the limitations of buffering media in a buffer as it is received, retrieving audio from the buffer, retrieving video from the buffer, and identifying a person that utters the first plurality of spoken words “based on metadata associated with the media”.  Applicants’ extensive amendments that change the direction of the claimed subject matter necessitate a new search.  The rejection of some of the dependent claims continues to rely upon Appleby (U.S. Patent No. 6,463,404), Kawatake (U.S. Patent Publication 2020/00012724), Dalce (U.S. Patent Publication 2006/0048508), and Zhu et al. (U.S. Patent Publication 2012/0069974).
However, Applicants’ arguments are not totally persuasive as directed against Subramanian et al.  Here, Applicants argue that Subramanian et al. only discloses identifying a speaker by comparing voice patterns, but does not identify a speaker “based on metadata associated with the media”.  But Subramanian et al. discloses an alternative embodiment that equivalently identifies a speaker “based on metadata associated with the media”.  Specifically, Subramanian et al., at ¶[0049], discloses an embodiment that includes a cell phone, where context analyzer 230 assumes the identity of a speaker/user as the owner of the phone from connection information, e.g., phone number, instant messaging screen name, or email address.  Here, a phone number, instant message screen name, or email address is “metadata associated with the media”.  That is, if a speaker is making a phone call with his/her cell phone, then a speaker can be identified from the speaker’s phone number, and if a user is sending an Subramanian et al. is not actually limited to only an embodiment of identifying a speaker by comparing a speaker’s voice to a voice pattern, but discloses an alternate embodiment of assuming an identify of a speaker from metadata of a phone number.  Admittedly, Subramanian et al. does not expressly describe the phone number of a speaker as being “metadata”, but one skilled in the art would understand that a phone number is metadata identifying the speaker, and that this metadata of a phone call is transmitted along with the audio of the phone call.  Actually, Subramanian et al. does include the terminology of ‘metadata’ in describing ‘emotion metadata’, but one skilled in the art could understand that ‘metadata’ can be used to describe connection information of a phone number being assumed to identify a speaker in Subramanian et al.
Moreover, this limitation of “identifying a person that utters the first plurality of spoken words based on metadata associated with the media” is taught is a similar way by Veeramani et al.  Here, Veeramani et al., at ¶[0025], teaches that speech recognizer 222 may be configured to recognize speakers of the recognized speech, where identification of speakers may be provided by applications 104, e.g., in the case of VOIP applications, the originating source/user of an audio/video stream may be known, and may thus be provided to speech recognizer 222 instead.  Veeramani et al., at ¶[0037]: Figure 4, then teaches and illustrates a user interface 400 having speaker identifiers (ID) 432.  Again, one skilled in the art would understand that an identification of a speaker provided by an application is “metadata associated with the media”.  Voice over Internet Protocol (VoIP) enables a user to make a telephone call over the Internet, and Subramanian et al.
Additionally, Veeramani et al., ¶[0023]: Figure 2, clearly teaches “receiving media comprising audio and video”, “buffering the media in a buffer as it is received”, “retrieving audio from the buffer”, and “retrieving video from the buffer”.  Here, Veeramani et al. describes an audio delay buffer 216 and a video delay buffer 236 in Figure 2.  Audio is retrieved from audio delay buffer 216 to perform speech recognition by speech recognizer 222, and video is retrieved from video delay buffer 236 to combine video with closed captioning.  Additionally, Veeramani et al. is directed to embodiments having a same feature of transliterating words in audio from a first language to a second language as provided by Subramanian et al.  Veeramani et al.’s transliterating engines 108 appear to operate to perform the same function of translating audio from a first language to a second language so that an audio/video stream can break away from a language barrier to support a universal audience.  (¶[0003])  
Subramanian et al. reasonably discloses the remaining limitations of the independent claims directed to “identifying vocal characteristics of the identified person that utters the first plurality of spoken words” and generating a translated audio based on “the vocal characteristics of the person that utters the first plurality of spoken words”.  Here, “vocal characteristics of the identified person” is being broadly construed.  Applicants’ Specification only provides a limited description of this term “vocal characteristics”.  The Specification, ¶[0024] - ¶[0025] and ¶[0078], describes these ‘vocal characteristics’ in relation to an actor, Tom Hanks.  However, “vocal characteristics” can be more broadly construed than corresponding to a particular actor Subramanian et al. discloses that a speaker profile specifies a speaker’s language, dialect, geographic region, and personality attributes that define the uniqueness of the speaker’s communication, including the speech patterns that the speaker uses to convey emotion.  (¶[0048])  Voice analyzer 232 performs speaker voice analysis, and context analyzer 230 passes speaker voice pattern information for each speaker profile contained in profile database 212, which then sends a more complete context profile for the speaker.  (¶[0061]: Figure 2)  Subramanian et al.’s ‘voice patterns’ for a speaker generated by a combination of voice analyzer 232 and information from a speaker profile database 212 are equivalent to “vocal characteristics of the identified person”.  That is, these “vocal characteristics” can simply be that a speaker’s unique style of communication is a dialect corresponding to a geographic region of the Southern United States, and does not necessarily require identification of a speaking style of a particular actor or celebrity, e.g., Tom Hanks.
During patent examination, the pending claims must be “given their broadest reasonable interpretation consistent with the specification.”  Phillips v. AWH Corp., 415 F.3d 1303, 1316, 75 USPQ2d 1321, 1329 (Fed. Cir. 2005)  Because applicant has the opportunity to amend the claims during prosecution, giving a claim its broadest reasonable interpretation will reduce the possibility that the claim, once issued, will be interpreted more broadly than is justified. In re Yamamoto, 740 F.2d 1569, 1571 (Fed. Cir. 1984); In re Zletz, 893 F.2d 319, 321, 13 USPQ2d 1320, 1322 (Fed. Cir. 1989) ("During patent examination the pending claims must be interpreted as broadly as their terms reasonably allow."); In re Prater, 415 F.2d 1393, 1404-05, 162 USPQ 541, 550-51 (CCPA 1969).  See MPEP §2111.
Subramanian et al. discloses “generating a translated audio, based on . . . vocal characteristics of the person that utters the first plurality of spoken words”.  That is, Subramanian et al. does not merely disclose emotion translation, but uses context profiles of a user to generate a synthesized voice.  Here, Subramanian et al., ¶[0084], discloses that voice emotion adjuster 260 retrieves voice patterns corresponding to emotion to voice pattern dictionary 222 that are selected according to context profiles for the user.  Emotion translation, then, is produced with information specific to a particular user as determined from context profiles.  An emotion voice synthesizer 270 adjusts a voice pattern of an emotion according to profile information of a speaker.  If speech in a first language is translated to speech in a second language, then an emotion is preserved during the translation that takes into account voice patterns of a speaker determined by a speaker profile database 212.
These new grounds of rejection are necessitated by amendment.  This Office Action, then, is properly FINAL.

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
Fuller et al., Arn et al., Sullivan et al., Amsterdam et al., and Hall et al. disclose related prior art.
Applicants’ amendment necessitated the new grounds of rejection presented in this Office Action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached on Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-






/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        
March 15, 2021