DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments with respect to 35 U.S.C. 103 in regards to claims 1-28 have been considered but are moot due to new grounds of rejection necessitated by amendments. See detailed rejection of claims 1 and 15 below in view of Kumar et al. 
Claims 4 and 18 are canceled.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-2, 8, 15-16 and 22 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kumar et al. (US 2020/0111474).

Claims 1 and 15,
Kumar teaches a method comprising: receiving, at data processing hardware, an input text sequence in a first language, the input text sequence to be synthesized into speech in a second language different than the first language ([0022] [0055-0058] the media system translates the received voice of the actor Tom Hanks from the “Oscars” in the English language to Spanish (second or alternative) language; transcribing the first plurality of words in the first language using speech-to-text; translating the transcribed words of the first language into words of the second language); 
obtaining, by the data processing hardware, a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker, the target speaker comprising a native speaker of the second language ([0027-0029] baseline non-linguistic characteristics are modified based on the linguistic characteristics associated with the Spanish speaker; the language translation database stores the baseline non-linguistic characteristics for each type of voice); and 
generating, by the data processing hardware, using a multilingual text-to-speech (TTS) model configured to produce synthesized speech of a phrase in the second language from input text of the phrase in the first language, an output audio feature representation of the input text sequence by processing the input text sequence in the first language and the speaker embedding, the output audio feature representation representing synthesized speech in the second language that clones the voice of the target speaker based on having the voice characteristics of the target speaker specified by the speaker embedding ([0030] [0052-0053] [0076] outputting a synthesized speech of the transcribed words of the second language using text-to-speech software that is corresponding to the modified baseline non-linguistic characteristics of the Spanish speaker; synthesis of speech in the alternate language includes assembling the translated spoken words with the non-linguistic characteristics; control circuitry assigns non-linguistic characteristics to the translated words based on the determined ethnicity/emotion/gender of the speaker to synthesize the translated speech; because the spoken words are being translated to Spanish, the media system synthesizes the speech using the non-linguistic characteristics of a native Spanish speaker; control circuitry 204 is able to modify the non-linguistic characteristics of the baritone determined to reflect nonlinguistic characteristics of a Spanish man).

Claims 2 and 16,
Kumar further teaches the method of claim 1, further comprising: obtaining, by the data processing hardware, a language embedding, the language embedding specifying language-dependent information, wherein processing the input text and the speaker embedding further comprises processing the input text, the speaker embedding, and the language embedding to generate the output audio feature representation of the input text, the output audio feature representation further having the language-dependent information specified by the language embedding ([0029] the media system may determine an emotion state based on the spoken words; the media system is able retrieve an emotion associated with the spoken words from the language translation database; the language translation database then retrieve non-linguistic characteristics associated with the emotion, which are then used to synthesize speech in the second language).

Claims 8 and 22,
Kumar further teaches the method of claim 1, wherein the output audio feature representation comprises mel-frequency spectrograms ([0026-0027] outputting the synthesized speech of the Spanish man based on the pitch).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 3 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Kumar et al. (US 2020/0111474) and further in view of Qian et al. (US 2012/0253781).

Claims 3 and 17,
Kumar teaches all the limitations in claim 2. The difference between the prior art and the claimed invention is that Kumar does not explicitly teach wherein: the language-dependent information is associated with the second language of the target speaker; and the language embedding specifying the language-dependent information is obtained from training utterances spoken in the second language by one or more different speakers.
Qian teaches wherein: the language-dependent information is associated with the second language of the target speaker; and the language embedding specifying the language-dependent information is obtained from training utterances spoken in the second language by one or more different speakers ([Fig. 1] [0018] [0027] the target speaker speech corpus includes speech waveform of Mandarin Chinese as spoken by a second speaker; the speech synthesis engine 104 uses speech corpus as training data for HMM-based text-to-speech synthesis).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Kumar with teachings of Qian by modifying the method and system for generating alternate audio for a media stream as taught by Kumar to include wherein: the language-dependent information is associated with the second language of the target speaker; and the language embedding specifying the language-dependent information is obtained from training utterances spoken in the second language by one or more different speakers as taught by Qian for the benefit of yielding synthesized speech that is natural sounding (Qian [0002]).

Claims 5-7, 12-13, 19-21 and 26-27 are rejected under 35 U.S.C. 103 as being unpatentable over Kumar et al. (US 2020/0111474) and further in view of Arik et al. (US 2019/0122651).

Claims 5 and 19,
Kumar teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Kumar does not explicitly teach wherein generating the output audio feature representation of the input text comprises, for each of a plurality of time steps: processing, using an encoder neural network, a respective portion of the input text sequence for the time step to generate a corresponding text encoding for the time step; and processing, using a decoder neural network, the text encoding for the time step to generate a corresponding output audio feature representation for the time step.
Arik teaches wherein generating the output audio feature representation of the input text
comprises, for each of a plurality of time steps: processing, using an encoder neural network, a
respective portion of the input text sequence for the time step to generate a corresponding text
encoding for the time step; and processing, using a decoder neural network, the text encoding for the
time step to generate a corresponding output audio feature representation for the time step ([Figs. 1-2]
[0060-0065] the encoder network (e.g., encoder 105/705) begins with an embedding layer, which
converts characters or phonemes into trainable vector representations; the decoder network (e.g.,
decoder 130/730) generates audio in an autoregressive manner by predicting a group of r future audio
frames conditioned on the past audio frames).
Kumar is analogous art with Arik because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed
invention, to modify the teachings of Kumar with teachings of Arik by modifying the method and system for generating alternate audio for a media stream as taught by Kumar to include an encoder which converts characters or phonemes into trainable vectors representations and a decoder to generate audio frames of the input text as taught by Arik for the benefit of improving speaker text-to-speech systems (Arik [0005]).

Claims 6 and 20,
Arik further teaches the method of claim 5, wherein the encoder neural network comprises a convolutional subnetwork and a bidirectional long short-term memory (LSTM) layer ([0061] encoder network includes an embedding layer and fully-connected layer).

Claims 7 and 21,
Arik further teaches the method of claim 5, wherein the decoder neural network comprises an autoregressive neural network comprising a long short-term memory (LTSM) subnetwork, a linear transform, and a convolutional subnetwork ([0063-0064] decoder network includes fully-connected layers with rectified linear unit (ReLU) nonlinearities to preprocess input Mel-spectrograms followed by a series of decoder blocks comprising casual convolution block and attention block).

Claims 12 and 26,
Kumar teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Kumar does not explicitly teach wherein the input text sequence corresponds to a character input representation.
Arik further teaches wherein the input text sequence corresponds to a character input representation ([0035] textual features includes characters or phonemes representation of the input text).
Kumar is analogous art with Arik because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed
invention, to modify the teachings of Kumar with teachings of Arik by modifying the method and system for generating alternate audio for a media stream as taught by Kumar to include wherein the input text sequence corresponds to a character input representation as taught by Arik for the benefit of improving speaker text-to-speech systems (Arik [0005]).

Claims 13 and 27,
Kumar teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Kumar does not explicitly teach wherein the input text sequence corresponds to a phoneme input representation.
Arik further teaches wherein the input text sequence corresponds to a phoneme input representation ([0035] textual features includes characters or phonemes representation of the input text).
Kumar is analogous art with Arik because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed
invention, to modify the teachings of Kumar with teachings of Arik by modifying the method and system for generating alternate audio for a media stream as taught by Kumar to include wherein the input text sequence corresponds to a phoneme input representation as taught by Arik for the benefit of improving speaker text-to-speech systems (Arik [0005]).

Claims 9 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Kumar et al. (US 2020/0111474) and further in view of Sharman (US 5,970,453).

Claims 9 and 23,
Kumar teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Kumar does not explicitly teach inverting, by the data processing hardware, using a waveform synthesizer, the output audio feature representation into a time-domain waveform; and generating, by the data processing hardware, using the time-domain waveform, a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second language.
Sharman teaches inverting, by the data processing hardware, using a waveform synthesizer, the output audio feature representation into a time-domain waveform; and generating, by the data processing hardware, using the time-domain waveform, a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second language ([Fig. 6] [col. 7 lines 15-27] synthesizing speech includes receiving text to be synthesized; generates a sequence of phonemes which have been derived from text to be synthesized; a hidden Markova model is used at step 610 to determine an underlying sequence of fenemes which may give rise to the sequence of phonemes; each underlying sequence of fenemes is converted from the frequency domain into their time domain equivalent using an inverse Fourier transform at step 625; the sequence of time domain equivalents are concatenated to produce the synthesized speech at step 630).
Kumar is analogous art with Sharman because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Kumar with teachings of Sharman by modifying the method and system for generating alternate audio for a media stream as taught by Kumar to include converting each underlying sequence of fenemes from the frequency domain into their time domain equivalent using inverse Fourier transform as taught by Sharman for the benefit of precisely modelling the speech characteristics of a given human speaker and so achieve a more natural speech quality (Sharman [col. 1 lines 50-57]).

Claims 10-11 and 24-25 are rejected under 35 U.S.C. 103 as being unpatentable over Kumar et al. (US 2020/0111474) and further in view of GAO (US 2021/0020161).

Claims 10 and 24,
Kumar teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Kumar does not explicitly teach wherein the TTS model is trained on: a first language training set comprising a plurality of utterances spoken in the first language and corresponding reference text; and a second language training set comprising a plurality of utterances spoken in the second language and corresponding reference text.
wherein the TTS model is trained on: a first language training set comprising a plurality of utterances spoken in the first language and corresponding reference text; and a second language training set comprising a plurality of utterances spoken in the second language and corresponding reference text ([0098] the training dataset comprises a plurality of speech signal segments from a plurality of speakers comprising a plurality of languages and text information corresponding to the speech signal segments).
Kumar is analogous art with GAO because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Kumar with teachings of GAO by modifying the method and system for generating alternate audio for a media stream as taught by Kumar to include wherein the TTS model is trained on: a first language training set comprising a plurality of utterances spoken in the first language and corresponding reference text; and a second language training set comprising a plurality of utterances spoken in the second language and corresponding reference text as taught by GAO for the benefit of improving spoken language translation systems (GAO [0003]).

Claims 11 and 25,
GAO further teaches the method of claim 10, wherein the TTS model is further trained on one or more additional language training sets, each additional language training set of the one or more additional language training sets comprising a plurality of utterances spoken in a respective language and corresponding reference text, the respective language of each additional language training set different than the respective language of each other additional language training set and different than the first and second languages ([0098] the training dataset comprises a plurality of speech signal segments from a plurality of speakers comprising a plurality of languages and text information corresponding to the speech signal segments).

Claims 14 and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Kumar et al. (US 2020/0111474) and further in view of Toiyama et al. (US 2010/0191533).

Claims 14 and 28,
Kumar teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Kumar does not explicitly teach wherein the input text sequence corresponds to an 8-bit Unicode Transformation Format (UTF-8) encoding sequence.
Toiyama teaches wherein the input text sequence corresponds to an 8-bit Unicode
Transformation Format (UTF-8) encoding sequence ([0067] input text string includes an 8-bit character type).
Kumar is analogous art with Toiyama because they both involve speech synthesis. Therefore, it
would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed
invention, to modify the teachings of Kumar with teachings of Toiyama by modifying the method and system for generating alternate audio for a media stream as taught by Kumar to include an 8-bit character type representation of the input text as taught by Toiyama for the benefit of enabling various types of settings to be changed flexibly, and additionally text string buffer can be implemented at low cost (Toiyama [0067]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Vincent (US 2015/0288797) teaches a system that incorporates multi-language voice-to-voice, voice-to-text, text- to-text, text-to-voice translation and voice recognition features so that users can speak in their desired language as well as type in the word or phrase they want translated and then transmits in real time and receives in real time either an audio response or a text or both a text and audio response which is crucial in conversations between doctors who speak different languages and in conversation with patients who speak different languages.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/Examiner, Art Unit 2656