Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-16 and 18-21 are pending.  Claims 1, 15, and 20 are independent.  Claim 17 was canceled by the most recent amendments and new Claim 21 was added which depends from Claim 20.  Independent Claims are amended to include the limitations of canceled Claim 17 and the dependent Claims are amended to include references to the first and second individuals.
This Application was published as US 20200211530.
Apparent priority: March 2019.

Claims 1-16 and 18-19 are allowed.  Claims 20-21 are rejected.

Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection that, if presented, were necessitated by the amendments to the Claims.
This action is Final.

The now-canceled Claim 17 had been found allowable in the previous Office action by the previous Examiner:

    PNG
    media_image1.png
    184
    692
    media_image1.png
    Greyscale

This Claim depended from Claim 15 which is similar to Claim 1 but not to the other independent Claim 20.

Figures 21A, 21B and embodiment 15 discuss the ratios of volumes for two speaker.  See published Application:
[0357] FIG. 21A is a flowchart of an example method 2100 for artificially generating a revoiced media stream in which a ratio of the volume levels between different characters in the revoiced media stream is substantially identical to a ratio of volume levels between the characters in the received media stream. …
[0360] … According to step 2108, the processing device may analyze the media stream to determine voice profiles for the first individual and the second individual, wherein the voice profiles are indicative of a ratio of volume levels between the first and second utterances as they were spoken in the media stream. …

[0367] FIG. 21B is a schematic illustration depicting an implementation of method 2100. In the figure, original media stream 110 includes individual 113 and individual 116 that speak in English. Consistent with disclosed embodiments, the system may determine that a voice profile for each of individual 113 and individual 116. The voice profile may be indicative of a ratio of volume levels between the utterances spoken by individual 113 and utterance spoken by individual 116 in the media stream. In the depicted example, the font size illustrates the volume level. Specifically, individual 113 speak loader than individual 116. The system may artificially generate a revoiced media stream in which the individual speaks the translated transcript. In revoiced media stream 150, the ratio of the volume levels between individual 113 and individual 116 in the revoiced media stream is substantially identical to the ratio of volume levels between the individual 113 and individual 116 in the original media stream.  
Allowable Subject Matter
Pending Claims 1-16 and 18-19 are allowed.
The following is an examiner’s statement of reasons for allowance: In view of each of the particular limitations of the independent Claims when considered in the order established by the Claim language and in the context of the language of the independent Claims when each Claim is considered as a whole, the independent Claims of this Application were not found in the prior art that was viewed.
In particular the concept of actively calculating a ratio of volumes of speech input by first and second speakers and preserving this calculated ratio in the voice profile associated with the input audio stream and using the ratio to adjust the volume of the translated and synthesized speech corresponding to each of the first and second speakers, when considered in the context of the independent Claims as a whole and including each and every limitation of these Claims was not found in the prior art.  The ratio is independently and initially calculated and saved as part of the voice profile and is not an inherent byproduct of conforming the volumes of the translated synthesized portions of speech to the input volumes of the voices of the original speakers.  The independent role of the ratio of volumes is highlighted in dependent Claims 11 and 12.  According to the supporting Specification, the voice profile for each speaker/user includes the ratio that is calculated between the loudness of his voice and voice of the other party to the conversation.  This is different from giving effect to the loudness of the input voices in the synthesized output as shown by the applied reference below.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Close Art of Record
In addition to the art applied to the Claims during the prosecution of the instant Application, note application of Gabryjelski (U.S. 2020/0058289) to the Claims.  Gabryjelski is the closest art and is very very close.  The only deviation is the initial and active calculation of the ratio of the portions of input speech from first and second speakers and using this ratio (as opposed to the actual volumes) in order to obtain the output volumes.  Yun (US 2017/0255616) was considered the closest art to the feature of calculation of “ratios” but Yun applies to calculation of a ratio of loudness/volume of utterances by a single user.
Note the application of Gabryjelski to the Claims and the points of shortcoming:
Regarding Claim 1, Gabryjelski teaches:
1. A computer program product for artificially generating a revoiced media stream, the computer program product embodied in a non-transitory computer-readable medium and including instructions for causing at least one processor to execute a method comprising:
receiving a media stream including a first individual and a second individual speaking in an origin language; [Gabryjelski, Figure 3 shows the receiving of the “bit stream” at the “decoding module 310.”  The bit stream includes “a media content may include a movie, a television program, a video clip, a video game, or any other recorded media content. …”  ([0035]).  A movie would include first and second individuals speaking.  “[0037] The audio signal may be processed at the speech separation module 3204 to obtain speeches from the audio signal.”  “[0046] At speech grouping module 3206, the speeches may be grouped according to different speakers or their voices.”  The system is for “dubbing” and therefore would include a source/origin language and a target language.]
obtaining a transcript of the media stream including a first utterance and a second utterance spoke in the original language; [Gabryjelski, Figure 3, “STT Module 3208” transcribes the received “speeches” into “text.”  “[0057] At speech to text (STT) module 3208, the speeches may be converted into texts. …. In addition, characteristics of the speech such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT module.”]
translating the transcript of the media stream to a target language, wherein the translated transcript includes a first set of words in the target language that corresponds with the first utterance and a second set of words in the target language that corresponds with the second utterance; [Gabryjelski, Figure 3, “MT Module 3210.”  “[0060] At a machine translation (MT) module 3210, the texts generated by STT module 3208 in a first language may be translated automatically to texts in a second language.”]
analyzing the media stream to determine at least one voice profile, [Gabryjelski, Figure 3, the “speeches” are also sent to “Voice Print Creating Module 3212” to generate voice profiles.  [0063] At a voice print creating module 3212, a voice print model may be created for a voice based on the speeches of the voice….”]
wherein the at least one voice profile is indicative of a ratio of volume levels between the first utterance as spoken by the first individual and the second utterance as spoken by the second individual  in the media stream; [Not Taught by Gabryjelski.  This limitation requires that the ratio is calculated and stored in the Voice Profile at the time that the Voice Profile is being generated which is during the capture or analysis of the input audio.]
determining metadata information for the translated transcript, wherein the metadata information includes desired volume levels for each of the first and second sets of words that correspond with the first and second utterances; and  [Gabryjelski, Figure 3, volume information may be obtained from the input speech and provided to the speech synthesizer by metadata:  ”[0066] … As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata.”  “[0075] In an embodiment, in the processing of the extracted speeches, the extracted speeches of the voice in a first language may be translated to the replacement speeches in a second language by utilizing the voice print model. The translated replacement speeches may be generated by further utilizing characteristics of the extracted speeches of the voice, where the characteristics includes at least one of a stress, a tonality, a speed, a volume and an inflection of the speeches, which may be contained in the metadata or may be detected from the speeches.”]
using the determined at least one voice profile, the translated transcript, and the metadata information to artificially generate a revoiced media stream in which the first individual and the second individual sound as they speak the translated transcript,  [Gabryjelski, Figure 3, “TTS Module 3214” converts the “Texts” into “Replacement Speeches” with input from the “Voice Print Creating Module 3212.”  The speeches of the individual speakers/actors are then combined in the “Combining Module 3216” to arrive at the output “Dubbed Audio.”  “[0066] At a text to speech (TTS) module 3214, the TTS conversion may be perform on the translated text in the second language based on the voice print model output by the voice print creating module 3212 to generate a speech in the second language and in the original actor's voice. In addition, the characteristics such as stress, tonality, speed, volume, inflection and so on may be applied during the TTS to generate the speech in the second language. As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata.”  “[0067] At a combining module 3216, the replacement speeches in the second language may be used to replace the corresponding speeches in the first language to obtain the dubbed audio….”]
wherein a ratio of the volume levels between the first and second sets of words in the revoiced media stream is substantially identical to the ratio of volume levels between the first and second utterances in the received media stream. [Gabryjelski, Figure 3, the output “Dubbed Speech” which includes a combination of the translated utterances utilizes the “stress, a tonality, a speed, a volume and an inflection” of the input utterances/speeches in the translated output.  (See claim 6 of Gabryjelski, e.g.).  Accordingly, this limitation is automatically and inherently given effect to because if the volumes stay the same or stay proportional to the input, their ratio/proportion will remain as was.]

    PNG
    media_image2.png
    642
    532
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    480
    1222
    media_image3.png
    Greyscale


    PNG
    media_image4.png
    484
    284
    media_image4.png
    Greyscale


Regarding Claim 2, Gabryjelski teaches:
2.     The computer program product of claim 1, 
wherein the at least one voice profile is further indicative of intonation differences between the first utterance as spoken by the first individual and the second utterance as spoken by the second individual  in the media stream and [Gabryjelski considers the “tonality” of speech of a speaker as one of the characteristics to be preserved and reflected in the synthesized output voice.  Intonation of the Claim is taught by any of “tonality,” “stress,” or “inflection” of Gabryjelski. “[0057] …8. In addition, characteristics of the speech such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT module.”  “[0066] …As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata.”  “[0068] In the illustrated embodiment of FIG. 3, the voice print creating module 3212 may generate a voice print model for a voice or an actor based on the actor's own voice. …”]
  the method further comprising:
generating a revoiced media stream in which intonation differences between the first and second sets of words in the revoiced media stream are substantially identical to the intonation differences between the first and second utterances in the received media stream. [Gabryjelski considers the “tonality” of speech of a speaker as one of the characteristics to be preserved and reflected in the synthesized output voice. “6. The method of claim 5, wherein the translating further comprises: generating the translated replacement speeches by further utilizing characteristics of the extracted speeches of the voice, wherein the characteristics includes at least one of a stress, a tonality, a speed, a volume and an inflection of the speeches.”]

Regarding Claim 3, Gabryjelski teaches:
3.     The computer program product of claim 1, wherein the method is further comprising:
determining that the first utterance as spoken by the first individual was pronounced as a question and that the second utterance as spoken by the second individual was pronounced as a statement; and [Gabryjelski.  Determination of  stress and inflection determines question or statement.  See the detection and preservation of “a stress, a tonality, a speed, a volume and an inflection of the speeches” to be reflected in the synthesized speech.]
generating a revoiced media stream in which the first set of words are pronounced as a question and the second set of words are pronounced as a statement. [Gabrryjelski, “[0063] At a voice print creating module 3212, a voice print model may be created for a voice based on the speeches of the voice….”  “[0066] At a text to speech (TTS) module 3214, the TTS conversion may be perform on the translated text in the second language based on the voice print model output by the voice print creating module 3212 to generate a speech in the second language and in the original actor's voice. In addition, the characteristics such as stress, tonality, speed, volume, inflection and so on may be applied during the TTS to generate the speech in the second language. As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata.”]

Regarding Claim 4, Gabryjelski teaches and thus suggests:
4.     The computer program product of claim 1, wherein the at least one voice profile is further indicative of pitch differences between the first utterance as spoken by the first individual and the second utterance as spoken by the second individual and second utterances as they were spoken in the media stream and [Gabryjelski teaches that “pitch is used as one of the parameters of the speech of the speaker:  “[0048] In an implementation, a voice clustering process may be used to clustering speeches to be associated with different speakers or their voices even in the case of lacking existing knowledge of the speaker's voice characteristics. The voice clustering process may utilize various parameters such as spectrum, pitch, tone and so on. …”  Gabryjelski does not expressly teach that the pitch of the voice of the speakers is detected and preserved in the “voice prints” but the teaching that “stress, tonality, speed, volume, inflection and so on” are preserved and the teaching that “pitch” is a parameter to be considered, taken together, suggest that “pitch” is a candidate parameter to be detected and preserved for use by the TTS.]
the method further comprising:
generating a revoiced media stream in which pitch differences between the first and second sets of words in the revoiced media stream are substantially identical to the pitch differences between the first and second utterances in the received media stream.[Gabryjelski, see [0048] above.  In the case that the actual voice of the speaker is not available, a voice model that is selected based on its pitch and tone is used for the TTS operation. See [0057], [0066], and [0068] and rejection of Claim 2.]

Regarding Claim 5, Gabryjelski teaches and thus suggests:
5.     The computer program product of claim 1, wherein the at least one voice profile is further indicative of accent differences between the first utterance as spoken by the first individual and the second utterance as spoken by the second individual  in the media stream and the method further comprising: generating a revoiced media stream in which accent differences between the first and second sets of words in the revoiced media stream are substantially identical to the accent differences between the first and second utterances in the received media stream. [Gabryjelski suggests this limitation by teaching detection and preservation of “a stress, a tonality, a speed, a volume and an inflection of the speeches” to be reflected in the synthesized speech.  “Stress” is another name for “accent.”]

Regarding Claim 6, Gabryjelski teaches:
6.     The computer program product of claim 1, wherein the method is further comprising:
determining that the first individual shouted the first utterance and that the second individual whispered the second utterance; and [Gabryjelski teaches detection and preservation of  “volume” / “loudness” as included in “a stress, a tonality, a speed, a volume and an inflection of the speeches” to be reflected in the synthesized speech.  See [0057], [0066], and [0068] and rejection of Claim 2.]
generating a revoiced media stream in which the that sounds as the first individual shouts the first set of words in the target language and the second individual whispers the second set of words in the target language. [Gabryjelski reflects the detected Volume of speech in the synthesized output speech.  See [0057], [0066], and [0068] and rejection of Claim 2.]

Regarding Claim 7, Gabryjelski teaches and thus suggests:
7.     The computer program product of claim 1, wherein the method is further comprising: 
determining that that the first individual spoke the first utterance in a cynical voice and that the second individual spoke the second utterance in a regular voice; and [Gabryjelski teaches detection and preservation of  “a stress, a tonality, a speed, a volume and an inflection of the speeches” to be reflected in the synthesized speech.  See [0057], [0066], and [0068] and rejection of Claim 2.  These characteristics together teach whether cynicism is coming through in the voice.]
generating a revoiced media stream that sounds as the first individual pronounces the first set of words in the target language in a cynical voice and the second individual pronounces the second set of words in the target language in a regular voice. [Gabryjelski reflects the detected tone of speech in the synthesized output speech.  See [0057], [0066], and [0068] and rejection of Claim 2.]

Regarding Claim 8, Gabryjelski teaches:
8.     The computer program product of claim 1, wherein the method is further comprising:
analyzing the media stream to determine volume levels for the first utterance as spoken by the first individual and the second utterance as spoken by the second individual in the media stream; and [Gabryjelski teaches detection and preservation of  “volume” / “loudness” as included in “a stress, a tonality, a speed, a volume and an inflection of the speeches” to be reflected in the synthesized speech.  See [0057], [0066], and [0068] and rejection of Claim 2.]
generating a revoiced media stream in which the first and second sets of words are spoken in the target language at the determined levels of volume. [Gabryjelski.  “6. The method of claim 5, wherein the translating further comprises: generating the translated replacement speeches by further utilizing characteristics of the extracted speeches of the voice, wherein the characteristics includes at least one of a stress, a tonality, a speed, a volume and an inflection of the speeches.”]

Regarding Claim 9, Gabryjelski teaches:
9.     The computer program product of claim 1, wherein the method is further comprising:
analyzing the media stream to determine volume levels for the first utterance as spoken by the first individual and the second utterance as spoken by the second individual in the media stream; and [Gabryjelski, “[0066] … As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata….”]
generating a revoiced media stream in which the first and second sets of words are spoken in the target language at lower levels of volume than the determined level of volume in the media stream. [Gabryjelski has controls to set the volume:   “[0021] The media player 10 may include a user interface for interacting with a user. For example, the media player 10 may include a display window for displaying the played video, may include a volume bar for adjusting the volume of the played audio, and may include various menu items. For sake of simplicity, only the menu items related to dubbing are shown in FIG. 1, and the display window, the volume bar and other possible components are not shown in the FIG. 1.”]

Regarding Claim 10, Gabryjelski teaches:
10.     The computer program product of claim 1, wherein the method is further comprising:
analyzing the media stream to determine volume levels for the first utterance as spoken by the first individual and the second utterance as spoken by the second individual in the media stream; and [Gabryjelski, “[0066] … As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata….”]
generating a revoiced media stream in which the first and second sets of words are pronounced in the target language at higher levels of volume than the determined level of volume in the media stream. [Gabryjelski has controls to set the volume:   “[0021] …, may include a volume bar for adjusting the volume of the played audio, a….”]

Claim 11:
11.     The computer program product of claim 1, wherein the method is further comprising:
accessing user settings defining minimum and maximum volume levels in a revoiced media stream; and
determining to deviate from the ratio of volume levels between the first utterance as spoken by the first individual and the second utterance as spoken by the second individual in the received media stream based on user settings.

Claim 12:
12.     The computer program product of claim 1, wherein the method is further comprising:
accessing language settings associated with the target language; and
determining to deviate from the ratio of volume levels between the first utterance as spoken by the first individual and the second utterance as spoken by the second individual in the received media stream based on language settings.

Regarding Claim 13, Gabryjelski teaches or suggests:
13.     The computer program product of claim 1, 
wherein the received media stream is a real-time conversation between the first individual, the second individual, and a particular user associated with the target language, and [Gabryjelski suggests this feature by teaching that “[0073] … It should be appreciated that the automatic dubbing process may be performed in real time while the media content is being played….”  Gabryjelski takes media streams as input and does not say that it pertains to the real-time conversation of two people.  However, it can conduct the dubbing in real-time which makes it applicable to any real-time conversation that replaces a conversation of actors in a movie clip.]
the method further comprising:
maintaining the ratio of the volume levels between the first and second sets of words in the revoiced media stream substantially identical to the ratio of volume levels between the first and second utterances in the real-time conversation. [Gabryjelski as shown with respect to Claim 1 teaches that the output volumes follow the input volumes and as an inherent byproduct of this act the ratios remain constant.]

Regarding Claim 14, Gabryjelski teaches and suggests:
14.     The computer program product of claim 13, wherein the method is further comprising:
recognizing in real-time that the first utterance has no real meaning in the original language and determining to mute the first utterance. [Gabryjelski teaches muting portions of the input speech or changing the noise profile.  This suggests that any unintelligible input by the original speakers can be suppressed and replaced with mute.]

Claim 15 is a system claim with limitations similar to the limitations of Claim 1.
Regarding Claim 15, Gabryjelski teaches:
15.     A system for artificially generating a revoiced media stream, the system comprising:
at least one processor configured to:
receive a media stream including a first individual and a second individual speaking in an origin language;
obtain a transcript of the media stream including a first utterance and a second utterance spoke in the original language;
translate the transcript of the media stream to a target language, wherein the translated transcript includes a first set of words in the target language that corresponds with the first utterance and a second set of words in the target language that corresponds with the second utterance;
analyze the media stream to determine at least one voice profile, 
wherein the at least one voice profile is indicative of a ratio of volume levels between the first utterance as spoken by the first individual and the second utterance as spoken by the second individual  in the media stream; [Not Taught by Gabryjelski.  This limitation requires that the ratio is calculated and stored in the voice profile at the time that the voice profile is being generated.]
determine metadata information for the translated transcript, wherein the metadata information includes desired volume levels for each of the first and second sets of words that correspond with the first and second utterances; and 
use the determined at least one voice profile, the translated transcript, and the metadata information to artificially generate a revoiced media stream in which the first individual and the second individual sound as they speak the translated transcript, 
wherein a ratio of the volume levels between the first and second sets of words in the revoiced media stream is substantially identical to the ratio of volume levels between the first and second utterances in the received media stream.

Regarding Claim 16, Gabryjelski teaches:
16.  The system of claim 15, wherein the at least one processor is further configured to:
determine a first synthesized voice for a first virtual entity intended to dub the first individual, wherein the first synthesized voice has characteristics identical to the characteristics of particular voice of the first individual; [Gabryjelski, Figure 1B use of “voice actors.”  “[0026] … As shown in FIG. 1B, voice print models of a number of actors such as those famous actors may be predefined and provided in a database….”]
determine a second synthesized voice for a second virtual entity intended to dub the second individual, wherein the second synthesized voice has characteristics identical to the characteristics of a particular voice of the second individual; and [Gabryjelski, Figure 1A, the use of the voice print of the speaker himself.]
generate a revoiced media stream in which the translated transcript in the target language is spoken by the  first and second virtual entities. [Gabryjelski, Figures 1 and 3.  Any voice can be selected for synthesis of the output dubbed version of the input.]

17.    Cancelled.

Claim 18 is a system claim with limitations similar to the limitations of Claim 2.

Claim 19 is a system claim with limitations similar to the limitations of Claim 3.
19.     The system of claim 18, wherein the at least one processor is further configured to:
determine that the first utterance was pronounced by the first individual as a question and the second utterance was pronounced by the second individual as an answer; and
generate a revoiced media stream in which the first set of words in the target language are pronounced as a question and the second set of words in the target language are pronounced as an answer.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 20-21 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Gabryjelski (U.S. 2020/0058289).

Claim 21 is new and Claim 20 includes substantial amendments that justify the new grounds of rejection.
Regarding Claim 20, Gabryjelski teaches:
20.     A method for artificially generating a revoiced media stream, the method comprising: [See the mapping provided for Claim 1 which is more comprehensive.]
receiving a media stream including a first individual and a second individual speaking in an origin language, [Gabryjelski, Figure 3 shows the receiving of the “bit stream” at the “decoding module 310.”  The bit stream includes “a media content may include a movie, a television program, a video clip, a video game, or any other recorded media content. …”  ([0035]).]
wherein the first individual is associated with a first particular voice and the second individual is associated with a second particular voice; [Gabryjelski, Figure 3.  A movie would include first and second individuals speaking.  “[0037] The audio signal may be processed at the speech separation module 3204 to obtain speeches from the audio signal.”  “[0046] At speech grouping module 3206, the speeches may be grouped according to different speakers or their voices.”  The system is for “dubbing” and therefore would include a source/origin language and a target language.]
obtaining a transcript of the media stream including a first utterance and a second utterance spoke in the original language; [Gabryjelski, Figure 3, “STT Module 3208” transcribes the received “speeches” into “text” in the original language.  [0057].]
translating the transcript of the media stream to a target language, wherein the translated transcript includes a first set of words in the target language that corresponds with the first utterance and a second set of words in the target language that corresponds with the second utterance; [Gabryjelski, Figure 3, “MT Module 3210.”  [0060].]
analyzing the media stream to determine a first voice profile for the first individual and a second voice profile for the second individual, wherein the first voice profile includes characteristics of the first particular voice and the second voice profile includes characteristics of the second particular voice; [Gabryjelski, Figure 3, the “speeches” are also sent to “Voice Print Creating Module 3212” to generate voice profiles.  [0063] At a voice print creating module 3212, a voice print model may be created for a voice based on the speeches of the voice….” Figure 1 shows the obtaining of the voice and Figure 1A shows different voice prints of different users.  “[0026] … A number of predefined voice print models may be provided in a database. For example, as shown in FIG. 1A, voice print models which are created for users as mentioned above may be provided in a database….” ]
determining a first synthesized voice for a first virtual entity intended to dub the first individual and a second synthesized voice for a second virtual entity intended to dub the second individual, wherein the first synthesized voice has characteristics identical to the characteristics of the first particular voice and the second synthesized voice has characteristics identical to the characteristics of the second particular voice; [Gabryjelski, Figure 1A shows different voice prints of different users.  “[0026] … A number of predefined voice print models may be provided in a database. For example, as shown in FIG. 1A, voice print models which are created for users as mentioned above may be provided in a database….” ]
determining metadata information for the translated transcript, [Gabryjelski, regarding the use of metadata to include information of the media content see [0035], [0038], [0045], [0047], [0050], and [0066]”]
wherein the metadata information includes desired volume levels for each of the first and second sets of words that correspond with the first and second utterances,   [Gabryjelski preserves the volume of the speech of the original speakers in the media content (such as a movie clip) and encodes them in metadata:  “[0066] … In addition, the characteristics such as stress, tonality, speed, volume, inflection and so on may be applied during the TTS to generate the speech in the second language.  As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata.”  These are the “desired volume levels.”]
wherein the desired volume levels are indicative of a ratio of volume levels between the first and second utterances as they were spoken in the media stream; and  [Gabryjelski preserves the volume of the speech of the original speakers in the media content (such as a movie clip) and encodes them in metadata.  The original volumes are “indicative” of the “ratio” of the volumes.  This Claim language just says keep the ratio of the output volumes as it was in the input. Gabryjelski achieves this by keeping the volumes the same.  There is no calculation of the ratio or preserving the initial ratio as part of the voice profile in this limitation.  Accordingly, it is taught by Gabryjelski.]
generating a revoiced media stream in which the translated transcript in the target language is spoken by the first virtual entity and the second virtual entity, [Gabryjelski, Figure 3, “TTS Module 3214” converts the “Texts” into “Replacement Speeches” with input from the “Voice Print Creating Module 3212.”  The speeches of the individual speakers/actors are then combined in the “Combining Module 3216” to arrive at the output “Dubbed Audio.”  “[0066] At a text to speech (TTS) module 3214, the TTS conversion may be perform on the translated text in the second language based on the voice print model output by the voice print creating module 3212 to generate a speech in the second language and in the original actor's voice. In addition, the characteristics such as stress, tonality, speed, volume, inflection and so on may be applied during the TTS to generate the speech in the second language. As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata.”  “[0067] At a combining module 3216, the replacement speeches in the second language may be used to replace the corresponding speeches in the first language to obtain the dubbed audio….”]
wherein volume levels  of the first and second sets of words in the revoiced media stream are associated with the desired volume levels. [Gabryjelski, Figure 3, The output “Dubbed Speech” in Figure 3 which includes a combination of the translated utterances utilizes the “stress, a tonality, a speed, a volume and an inflection” of the input utterances/speeches in the translated output.  (See claim 6 of Gabryjelski, e.g.).  The “desired volume levels” are set by user using the “user interface” of the “media player 10” of Figure 1.  See “[0021] The media player 10 may include a user interface for interacting with a user.  For example, the media player 10 may include a display window for displaying the played video, may include a volume bar for adjusting the volume of the played audio, and may include various menu items….”]

Regarding Claim 21, Gabryjelski teaches:
21. The method of claim 20, further comprising:
accessing user settings defining minimum and maximum volume levels in the revoiced media stream; and [Gabryjelski provides a user interface on its media player for adjusting the volume.  “[0021] The media player 10 may include a user interface for interacting with a user. For example, the media player 10 may include a display window for displaying the played video, may include a volume bar for adjusting the volume of the played audio, and may include various menu items….”]
determining to deviate from desired volume levels for each of the first and second sets of words that correspond with the first and second utterances based on user settings.[Gabryjelski.  The volume bar as manipulated by the user trumps any “desired volume” set by the input volumes or their ratio.]
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
	
Note the previous grounds of rejection by the previous Examiner which also refers to Gabryjelski.
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over No-Name (EP 1928189 A1) in view of Subramanian (US 20070208569 A1) and Yun (US 20170255616) and in further view of Gabryjelski (US 20200058289 A1).
            As per independent claim 20, No-Name teaches a method for artificially generating a revoiced media stream, the method comprising:
            As per independent claim 20, No-Name teaches a method for artificially generating a revoiced media stream, the method comprising:
receiving a media stream including an individual speaking in an origin language (see No-Name [0016], which notes a block-wise speech transmission will be employed from a first user terminal in the source language to a second user terminal or to a plurality of second user terminals in the target language, by means of Push-To-talk (PTT) terminals having a PTT-key button and a translation server. After the PTT-key button has been pressed at the first user terminal, the speech/origin language stream can begin; and see No-Name [0020], which notes at step 100 the PTT-key button has been activated at the first user terminal. The padding-notification-means then provides to the second user terminal a padding notification in the target language, informing that the translation is going to start (step 101)); 
obtaining input data of the media stream including a first utterance and a second utterance spoke in the original language (see No-Name [0020], which notes at step 102, the translation server receives the input data from the first user terminal and stores them in the data-storing-means/transcript (step 103); and see FIG. 4A which notes a first utterance starting at time t.sub.1 and a second utterance starting time T.sub.C); 
translating the media stream to a target language, wherein the translated transcript includes a first set of words in the target language that corresponds with the first utterance and a second set of words in the target language that corresponds with the second utterance (see No-Name [0020], which notes after the translation server has started storing data received from the first user terminal, a storing-status-detector begins to detect the storing status of the data-storing-means (step 106) and such a storing status is transmitted to the first user terminal in real time by a signalling means (step 107). At step 104 the translation process starts and develops. During this process, the stored data are transferred from the data-storing-means to the data-processing-means and are translated from the source to the target language.  Finally, the translation server outputs the processed data/translated transcript sending them to the second user terminal (step 105); and see No-Name FIG. 4A, which notes a first output/translated transcript starting at time T.sub.2 for the first utterance and which notes a second output/translated transcript starting at time T.sub.D for the second utterance).
No-Name shows in FIG. 5 a translated/revoiced output having variations in volume, but fails to specifically teach a method for artificially generating a revoiced media stream, wherein the individual is associated with particular voice; obtaining a transcript of the media stream and translating the transcription of the media stream to a target language, analyzing the media stream to determine a voice profile for the individual, wherein the voice profile is indicative of a ratio of volume levels between the first and second utterances as they were spoken in the media stream; determining a synthesized voice for a virtual entity intended to dub the individual, wherein the synthesized voice has characteristics identical to the characteristics of the particular voice; determining metadata information for the translated transcript, wherein the metadata information includes desired volume levels for each of the first and second sets of words that correspond with the first and second utterances; and using the determined voice profile, the translated transcript, and the metadata information to artificially generate a revoiced media stream in which the individual speaks the translated transcript, wherein a ratio of the volume levels between the first and second sets of words in the revoiced media stream is substantially identical to the ratio of volume levels between the first and second utterances in the received media stream wherein a ratio of the volume levels between the first and second sets of words in the revoiced media stream is substantially identical to the ratio of volume levels between the first and second utterances in the received media stream.
However Subramanian does teach a computer program product for artificially generating a revoiced media stream, the computer program product embodied in a non-transitory computer-readable medium and including instructions for causing at least one processor to execute a method for (see Subramanian [0021], which notes any suitable computer readable medium may be utilized):
obtaining a transcript of the media stream and translating the transcription of the media stream to a target language translating the transcription of the media stream to a target language (see Subramanian [0035], which notes Emotion markup component 210 receives a communication that includes emotion content (such as speech with speech emotion) and recognizes the words in the speech and transcribes the recognized words to text; see Subramanian [0037] Emotion translation component 250 receives a communication, typically text with emotion markup metadata, and parses the emotion content. Emotion translation component 250 synthesizes the text into a natural language and adjusts the tone, cadence and amplitude of the voice delivery for emotion based the emotion metadata accompanying the text. Alternatively, prior to modulating the communication stream, emotion translation component 250 may translate the text and emotion metadata into the language of the listener);
analyzing the media stream to determine a voice profile for the individual, (see Subramanian [0051], which notes emotion recognition may operate similarly by matching concatenated chains of sub-emotion speech patterns extracted from the audio stream to pre-constructed emotion unit models (the results of which are sent directly to markup engine 238). Alternatively, a less computational intensive emotion extraction algorithm may be implemented that matches voice patterns in the audio stream to voice patterns for an emotion (rather than chaining sub-emotion voice pattern units). The voice patterns include specific/respective pitches, tones, cadences and amplitudes/respective volume levels, or combinations thereof, contained in the speech delivery; and see Subramanian [0096], which notes with the dictionaries, the communication stream is received (step 710) and voice recognition proceeds by extracting a word/first utterance from features in the digitized voice (step 712). Next, a check is made to determine if this portion of the speech, essentially just the translated word, has been selected for emotion analysis (step 714). If this portion has not been selected for emotion analysis, the text is output (step 728) and the communication checked for the end (step 730). If not, the process returns to step 710, more speech is received and voice recognized/second utterance for additional text (step 712)); 
determining metadata information for the translated transcript, wherein the metadata information includes desired volume levels for each of the first and second sets of words (see Subramanian [0043], which notes emotion is deduced from a communication by text pattern analysis and voice analysis. Emotion-voice pattern dictionary 222 contains emotion to voice pattern definitions for deducing emotion from voice patterns in a communication. The dictionary definitions can be generic and abstracted across speakers, or specific to a particular speaker, audience and circumstance of a communication; and see Subramanian [0045], which notes a father will choose particular words that convey his displeasure with a son who has committed some offense and alter his normal voice patterns of his delivery to reinforce his anger over the incident. However, for similar incident in the workplace, the same speaker would usually choose different words (and text patterns) and alter his voice patterns differently, from that used the familial circumstance, to convey his anger over an identical incident in the workplace; and see Subramanian [0051], which notes matching voice patterns in the audio stream to voice patterns for an emotion. The voice patterns include specific pitches, tones, cadences and amplitudes, or combinations thereof, contained in the speech delivery) that correspond with the first and second utterances (see Subramanian [0108], which notes a network device may also be configured with local or remote emotion processing capabilities. Recall that emotion communication architecture 200 comprises emotion markup component 210 and emotion translation component 250. Recall also that emotion markup component 210 receives a communication that includes emotion content (such as human speech with speech emotion) and recognizes the words and emotion in the speech and outputs text with emotion markup, thus the emotion in the original communication is preserved. Emotion translation component 250, on the other hand, receives a communication that typically includes text with emotion markup metadata, modifies and synthesizes the text into a natural language and adjusts the tone, cadence and amplitude/volume level of the voice delivery for emotion based on the emotion metadata accompanying the text); and 
generating a revoiced media stream (see Subramanian [0037], which notes emotion translation component 250 receives a communication, typically text with emotion markup metadata, and parses the emotion content. Emotion translation component 250 synthesizes the text into a natural language/transcript and adjusts the tone, cadence and amplitude of the voice delivery for emotion based the emotion metadata accompanying the text. Alternatively, prior to modulating the communication stream, emotion translation component 250 may translate the text and emotion metadata into the language of the listener; and see Subramanian [0083], which notes the synthesized voice is then received at voice emotion adjuster 260, which adjusts the pitch, tone and amplitude of the voice and changes the frequency, or cadence, of the voice delivery based on the emotion information it receives. The emotion information is in the form of emotion metadata that may be received from a source external to emotion translation component 250, such as an email or instant message, a search result, or may instead be translated emotion metadata from emotion translator 254. Voice emotion adjuster 260 retrieves voice patterns corresponding to the emotion metadata from emotion-voice pattern dictionary 222. Here again, the emotion to voice pattern definitions are selected using the context profiles/voice profile for the user).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by No-name with the paralleled emotion analysis of Subramanian in order to select the more accurate analysis result (see Subramanian [0068], which notes the automated communication markup may also identify the most accurate type of emotion analysis for the specific communication and use it to the exclusion of the other. There, both emotion analyzers are initially allowed to reach an emotion result and the results checked for consistency and against each other. Once one emotion analysis is selected over the other, the communication is marked for analysis using the more accurate method).
The combination of No-Name and Subramanian includes predictable results, such as detecting an emotion of an input communication.
The combination of No-Name and Subramanian fails to specifically teach wherein the individual is associated with particular voice; wherein the voice profile includes characteristics of the particular voice and indicative of a ratio of volume levels between the first and second utterances as they were spoken in the media stream; determining a synthesized voice for a virtual entity intended to dub the individual, wherein the synthesized voice has characteristics identical to the characteristics of the particular voice; and generating a revoiced media stream in which the translated transcript in the target language is spoken by the virtual entity, wherein a ratio of the volume levels between the first and second sets of words in the revoiced media stream is substantially identical to the ratio of volume levels between the first and second utterances in the received media stream8.
However, Yun does teach:
8 wherein the voice profile includes characteristics of the particular voice and indicative of a ratio of volume levels between the first and second utterances as they were spoken in the media stream (See Yun [0040], which notes FIG. 1 is a block diagram showing a configuration of an automatic interpretation system for generating a synthetic sound having characteristics similar to those of an original speaker's voice according to an exemplary embodiment of the present invention; and see Yun [0064], which notes at this point, when intensity higher or lower than a reference value preset for a particular word and intonation phrase is measured in the original speech, the corresponding word and intonation phrase in the generated synthetic sound are also assigned intensities…), and 
wherein a ratio of the volume levels between the first and second sets of words in the revoiced media stream is substantially identical to the ratio of volume levels between the first and second utterances in the received media stream8 (see Yun [0064], which notes even when assigning an intensity, the prosody processor 133 sets the gender of the original speaker as a basis and then assigns the intensity to the whole sentence without degrading the natural intensity characteristic that needs to be neutrally generated so that the sentence has the same ratio as that measured from the original speech. At this point, when intensity higher or lower than a reference value preset for a particular word and intonation phrase is measured in the original speech, the corresponding word and intonation phrase in the generated synthetic sound are also assigned intensities to have the same ratio with respect to the reference value, and intensities of remaining words and intonation phrases are adjusted together so that the original speech and the synthetic sound may have an overall intensity at the same level. In this way, it is possible to generate an interpreted synthetic sound with an intensity of an emotion and an intention similar to those of the original speech).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the emotion detection of the systems and methods as taught by Subramanian with the emotion intensity detection of Yun in order to generate an interpreted sound that maintains the intention of the original speech (see Yun [0061], which notes the number of intonation phrases of the original speech corresponds to the number of intonation phrases of the synthesis-target translation, a cadence is assigned so that the translation has the same characteristic as an intonation phrase prosody structure of the original speech corresponding to the translation. For example, assuming that the sound of the phrase “Have you eaten?” is translated into the sentence “[Hangul text] ” in an automatic interpretation from English to Korean, when a cadence is assigned using text data alone, the intonation “L %” [low pitches only] is assigned. When the intonation “L %” is assigned, a meaning of the sentence “[Hangul text]” becomes “I have eaten” in English, which is a sentence having a meaning different from the meaning of the original sentence).
The combination of No-Name and Subramanian with Yun includes predictable results, such as generating an interpreted synthetic sound with an intensity of an emotion and an intention similar to those of the original speech.
Yun fails to specifically teach determining a synthesized voice for a virtual entity intended to dub the individual, wherein the synthesized voice has characteristics identical to the characteristics of the particular voice; and generating a revoiced media stream in which the translated transcript in the target language is spoken by the virtual entity
However, Gabryjelski does teach wherein a method comprising: 
determining a synthesized voice for a virtual entity intended to dub the individual (see Gabryjelski [0066], which notes at a text to speech (TTS) module 3214, the TTS conversion may be perform on the translated text in the second language based on the voice print model output by the voice print creating module 3212 to generate a speech in the second language and in the original actor's voice. In addition, the characteristics such as stress, tonality, speed, volume, inflection and so on may be applied during the TTS to generate the speech in the second language. As mentioned above, the characteristics such as stress, tonality, speed, volume, inflection and so on may be detected from the speech at the STT process or may be obtained from the metadata), wherein the synthesized voice has characteristics identical to the characteristics of the particular voice (see Gabryjelski [0064], which notes at the voice print creating module 3212, at least part of the extracted and grouped speeches of a speaker may be used as training data to train a voice print model of the speaker. Various voice print creating models or processes may be utilized at the module 3212 to create the voice print model. As an example, a seed voice print model, which is trained based on a huge amount of training data, may be used with the speeches of the speaker to train the voice print model of the speaker. By utilizing the seed voice print model, only a limited number of sample speeches or sentences are needed to train the voice print model of the speaker. The voice print model of the speaker may include phonemes that are in line with the speaker's voice, and thus may represent the speaker's voice irrespective of language. As another example, the voice print creating module 3212 may utilize automated speech-to-text process to assign probabilistic phonemes based on the speeches so as to generate the voice print model);
receiving a media stream including an individual speaking in an origin language (see Gabryjelski [0076], which notes the translating from the speeches in the first language to the replacement speeches in the second language may be performed by speech-to-text conversion, text-to-text translation and text-to-speech conversion. The speech-to-text conversion may be performed for the extracted speeches of a voice based on at least one of a closed caption, a subtitle, a script, a transcript and a lyric of the media content. The text-to-text translation/translated transcript for the converted text from the first language to the second language may be performed based on at least one of the characteristics of the speeches, a genre information of the media content, a scene knowledge. The text-to-speech conversion for the translated text may be performed based on the voice print model and the characteristics of the extracted speeches) generate a revoiced media stream in which the translated transcript in the target language (see Gabryjelski [0076], which notes the translating from the speeches in the first language to the replacement speeches in the second language may be performed by speech-to-text conversion, text-to-text translation and text-to-speech conversion. The speech-to-text conversion may be performed for the extracted speeches of a voice based on at least one of a closed caption, a subtitle, a script, a transcript and a lyric of the media content. The text-to-text translation/translated transcript for the converted text from the first language to the second language may be performed based on at least one of the characteristics of the speeches, a genre information of the media content, a scene knowledge. The text-to-speech conversion for the translated text may be performed based on the voice print model and the characteristics of the extracted speeches) is spoken by the virtual entity (see Gabryjelski [0069], which notes the voice print model may be chosen from a predefined set of voice print models. For example, a voice print model of a user or the user's favorite actor may be stored in the database as shown in FIG. 1A, and may be chosen by the user from the database for the above mentioned customized dubbing).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of No-Name, Subramanian, and Yun with the background noise reduction of Gabryjelski in order to obtain cleaner speeches/revoiced dubbing separated from background sound (see Gabryjelski [0039], which notes the media content may include different audio versions in different languages. In this case, the different audio versions may be utilized to obtain cleaner speeches separated from background sound; and see Gabryjelski [0040], which notes an example, in which there is an audio version in a first language such as English and an audio version in a second language such as French. Usually the background sounds of the two audio versions are substantially same while the speeches of the both are different and actually rarely overlapped to each other in time domain and/or frequency domain. By utilizing this fact, a subtracting operation between the two audio versions may effective eliminate the background sound. For example, if the English speeches are desired, for an audio channel or track, the French audio version may be subtracted from the English audio version. In this way, the background sound may be eliminated and the French speeches may be inverted, then the English speeches may be obtained by omitting the inverted signal. On the other hand, background sound may be obtained by subtracting the detected speeches from the original audio track).
The combination of No-Name, Subramanian, and Yun with Gabryjelski includes predictable results, such as generating low-background-noise, artificially generated translations of speech from mixed-content media.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499.  The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/FARIBA SIRJANI/Primary Examiner, Art Unit 2659