DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments with respect to 35 U.S.C. 102 in regards to claims 1-4, 8, 15-18 and 22 have been considered but are moot due to new grounds of rejection necessitated by amendments. See GAO (US 2021/0020161) paragraphs [0121-0125].

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-4, 8, 10-11, 15-18, 22 and 24-25 are rejected under 35 U.S.C. 103 as being unpatentable over Qian et al. (US 2012/0253781) in view of GAO (US 2021/0020161).

Claims 1 and 15,
Qian teaches a method comprising: 5obtaining, by the data processing hardware, a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker, the target speaker comprising a native speaker of a second language different than the first language; and  10generating, by the data processing hardware, using a text-to-speech (TTS) model, an output audio feature representation of the input text sequence by processing the input text sequence and the speaker embedding, the output audio feature representation having the voice characteristics of the target speaker specified by the speaker embedding  ([Fig. 1] [0017-0019] speech synthesis for cross-lingual voice transformation; the speech transformation engine transforms the voice characteristics of a speech corpus 108 provided by a target speaker in a target language (L2) based on voice characteristics of a speech corpus 110 provided by a source speaker in the source language (L1); the source speaker speech corpus 110 includes speech waveforms of North American-Style English as spoken by a first speaker, which the target speaker speech corpus 108 includes speech waveforms of Mandarin Chinese as spoken by a second speaker; speech waveforms are a repertoire of speech utterance units for a particular language; the speech synthesis engine 104 uses the transformed target speaker speech corpus 112 to generate synthesized speech 114 based on input text 116; the synthesized speech 114 has the voice characteristics of the source speaker who provided the speech corpus 110 in the source language).  
The difference between the prior art and the claimed invention is that Qian does not explicitly teach receiving, at data processing hardware, an input text sequence in a first language, the input text sequence to be synthesized into speech in a second language different than the first language.
([Fig. 2(a)] [0121-0124] a segment of a first speech signal 110 comprising a source language A (second language) is inputted; the speech signal 110 is input to a speech recognition module 101 for producing text 111 in the source language A; generating a first text signal from a segment of the first speech signal, the first text signal comprising the source language A; the source language text 111 is then input to a text-to-text translation module 102, producing output text 112 in the target language B; generating a second text signal 112 from the first text signal 111, the second text signal comprising the target language).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Qian with teachings of GAO by modifying the frame mapping approach for cross-lingual voice transformation as taught by Qian to include receiving, at data processing hardware, an input text sequence in a first language, the input text sequence to be synthesized into speech in a second language different than the first language as taught by GAO for the benefit of improving spoken language translation systems (GAO [0003]).

Claims 152 and 16,
Qian further teaches the method of claim 1, further comprising: obtaining, by the data processing hardware, a language embedding, the language embedding specifying language-dependent information, wherein processing the input text and the speaker embedding further comprises processing the input text, the speaker embedding, and the language embedding to 20generate the output audio feature representation of the input text, the output audio feature representation further having the language-dependent information specified by the language embedding ([Fig. 1] [0018] target speaker speech corpus 108 includes speech waveforms of Mandarin Chinese as spoken by a second speaker).  

Claims 3 and 17,
Qian further teaches the method of claim 2, wherein:  25the language-dependent information is associated with the second language of the target speaker; and the language embedding specifying the language-dependent information is obtained from training utterances spoken in the second language by one or more different speakers ([Fig. 1] [0018] [0027] the target speaker speech corpus includes speech waveform of Mandarin Chinese as spoken by a second speaker; the speech synthesis engine 104 uses speech corpus as training data for HMM-based text-to-speech synthesis).  

Claims 4 and 18,
Qian further teaches the method of claim 2, wherein: the language-dependent information is associated with the first language; and the language embedding specifying the language-dependent information is obtained from training utterances spoken in the first language by one or more different 5speakers ([Fig. 1] [0018] [0027] the source speaker speech corpus includes speech waveform of North American-Style English as spoken by a first speaker; the speech synthesis engine 104 uses speech corpus as training data for HMM-based text-to-speech synthesis).

Claims 8 and 22,
Qian further teaches the method of claim 1, wherein the output audio feature representation comprises Mel-frequency spectrograms ([0023] spectrum of the waveforms containing LPC spectrums and fundamental frequencies).  


GAO further teaches wherein the TTS model is trained on: a first language training set comprising a plurality of utterances spoken in the first language and corresponding reference text; and  5a second language training set comprising a plurality of utterances spoken in the second language and corresponding reference text ([0098] the training dataset comprises a plurality of speech signal segments from a plurality of speakers comprising a plurality of languages and text information corresponding to the speech signal segments).  

Claims 11 and 25,
GAO further teaches the method of claim 10, wherein the TTS model is further trained on one or more additional language training sets, each additional language training set of the one or more 10additional language training sets comprising a plurality of utterances spoken in a respective language and corresponding reference text, the respective language of each additional language training set different than the respective language of each other additional language training set and different than the first and second languages ([0098] the training dataset comprises a plurality of speech signal segments from a plurality of speakers comprising a plurality of languages and text information corresponding to the speech signal segments).  

Claims 5-7, 12-13, 19-21 and 26-27 are rejected under 35 U.S.C. 103 as being unpatentable over Qian et al. (US 2012/0253781) in view of GAO (US 2021/0020161) and further in view of Arik et al. (US 2019/0122651).

Claims 5 and 19,
Qian and GAO teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Qian and GAO do not explicitly teach wherein generating the output audio 
Arik teaches wherein generating the output audio feature representation of the input text comprises, for each of a plurality of time steps: processing, using an encoder neural network, a respective portion of the input text 10sequence for the time step to generate a corresponding text encoding for the time step; and processing, using a decoder neural network, the text encoding for the time step to generate a corresponding output audio feature representation for the time step ([Figs. 1-2] [0060-0065] the encoder network (e.g., encoder 105/705) begins with an embedding layer, which converts characters or phonemes into trainable vector representations; the decoder network (e.g., decoder 130/730) generates audio in an autoregressive manner by predicting a group of r future audio frames conditioned on the past audio frames).
Qian is analogous art with Arik because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Qian with teachings of Arik by modifying the frame mapping approach for cross-lingual voice transformation as taught by Qian to include an encoder which converts characters or phonemes into trainable vectors representations and a decoder to generate audio frames of the input text as taught by Arik for the benefit of improving speaker text-to-speech systems (Arik [0005]).

Claims 156 and 20,
([0061] encoder network includes an embedding layer and fully-connected layer).  

Claims 7 and 21,
Arik further teaches the method of claim 5, wherein the decoder neural network comprises an autoregressive neural network comprising a long short-term memory (LTSM) 20subnetwork, a linear transform, and a convolutional subnetwork ([0063-0064] decoder network includes fully-connected layers with rectified linear unit (ReLU) nonlinearities to preprocess input Mel-spectrograms followed by a series of decoder blocks comprising casual convolution block and attention block).  

Claims 1512 and 26,
Qian and GAO teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Qian and GAO do not explicitly teach wherein the input text sequence corresponds to a character input representation.  
Arik teaches wherein the input text sequence corresponds to a character input representation ([0035] textual features includes characters or phonemes representation of the input text).  
Qian is analogous art with Arik because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Qian with teachings of Arik by modifying the frame mapping approach for cross-lingual voice transformation as taught by Qian to include textual features which include characters representation of the input text as taught by Arik for the benefit of improving speaker text-to-speech systems (Arik [0005]).


Qian and GAO teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Qian and GAO does not explicitly teach wherein the input text sequence corresponds to a phoneme input representation.
Arik teaches wherein the input text sequence corresponds to a phoneme input representation ([0035] textual features includes characters or phonemes representation of the input text).  
Qian is analogous art with Arik because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Qian with teachings of Arik by modifying the frame mapping approach for cross-lingual voice transformation as taught by Qian to include textual features which include phonemes representation of the input text as taught by Arik for the benefit of improving speaker text-to-speech systems (Arik [0005]).

Claims 9 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Qian et al. (US 2012/0253781) in view of GAO (US 2021/0020161) and further in view of Sharman (US 5,970,453).

Claims 259 and 23,
Qian and GAO teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Qian and GAO do not explicitly teach inverting, by the data processing hardware, using a waveform synthesizer, the output audio feature representation into a time-domain waveform; and generating, by the data processing hardware, using the time-domain waveform, a synthesized speech representation of the input text sequence that clones the voice of the 30target speaker in the second language.
([Fig. 6] [col. 7 lines 15-27] synthesizing speech includes receiving text to be synthesized; generates a sequence of phonemes which have been derived from text to be synthesized; a hidden Markova model is used at step 610 to determine an underlying sequence of fenemes which may give rise to the sequence of phonemes; each underlying sequence of fenemes is converted from the frequency domain into their time domain equivalent using an inverse Fourier transform at step 625; the sequence of time domain equivalents are concatenated to produce the synthesized speech at step 630).
Qian is analogous art with Sharman because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Qian with teachings of Sharman by modifying the frame mapping approach for cross-lingual voice transformation as taught by Qian to include converting each underlying sequence of fenemes from the frequency domain into their time domain equivalent using inverse Fourier transform as taught by Sharman for the benefit of precisely modelling the speech characteristics of a given human speaker and so achieve a more natural speech quality (Sharman [col. 1 lines 50-57]).

Claims 14 and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Qian et al. (US 2012/0253781) in view of GAO (US 2021/0020161) and further in view of Toiyama et al. (US 2010/0191533).

Claims 14 and 28,

Toiyama teaches wherein the input text sequence corresponds to an 8-bit Unicode Transformation Format (UTF-8) encoding sequence ([0067] input text string includes an 8-bit character type).
Qian is analogous art with Toiyama because they both involve speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Qian with teachings of Toiyama by modifying the frame mapping approach for cross-lingual voice transformation as taught by Qian to include an 8-bit character type representation of the input text as taught by Toiyama for the benefit of enabling various types of settings to be changed flexibly, and additionally text string buffer can be implemented at low cost (Toiyama [0067]).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689.  The examiner can normally be reached on Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/Examiner, Art Unit 2656