DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 07/29/2022 has been entered.
This communication is in response to the Amendments and Arguments filed on   07/29/2022. 
Claims 1-11 and 13-20 are pending and have been examined.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the examiner. 
	Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). Receipt is acknowledged of some certified copies of papers required by 37 CFR 1.55, where the received documents are the certified copies of the foreign applications, and none of the copies are in English. The Examiner notes that there are multiple claims to foreign priority with differing priority dates. In evaluating the priority claims to application KR10-2018-0004047, filed 01/11/2018, to application KR10-2018-0036377, filed 03/29/2018, and to application KR10-2019-0004188, filed 01/11/2019, the drawings in the certified copies of the foreign application do not clearly support the claims of the instant application in their entirety. As such, neither the claim to priority date 01/11/2018 nor to 03/29/2018 will be recognized by the Examiner until a certified English translation of the foreign application(s) is/are filed, and the English translation(s) demonstrate(s) full support of all claim limitations presented by the Applicant. The priority date of 01/11/2019 will be recognized as pertaining to the instant application being a continuation of PCT/KR2019/000513.

Response to Arguments
Applicant's arguments filed 07/29/2022 have been fully considered but they are not persuasive. 
Applicant asserts on page 8 that the speaker vector in Agiomyrgiannakis is different from the speaker vector in the instant application because the speaker vector in Agiomyrgiannakis is parameterized speech sound and does not correspond to speaker information associated with the learning speech data. The Examiner respectfully disagrees with this assertion, as there is nothing to indicate in the claim language that speaker information could not be interpreted as parameterized speech sounds extracted from the speech of a particular speaker. Thus, Agiomyrgiannakis teaches speaker information associated with the learning speech data ([0038-9],[0054]).
Applicant further asserts on pages 9-10 that Agiomyrgiannakis does not teach training a multilingual neural network model using features corresponding to those recited in the claim. The Examiner respectfully disagrees with this assertion, as Agiomyrgiannakis teaches training a neural network to associate a transcribed form of text with parameterized speech using a set of speaker vectors [0050], where the set of speaker vectors is made using samples of speech recited by both a reference and colloquial speaker, i.e. learning speech data of the first and second language corresponding to the learning text of the first and second language, respectively, and reference text strings in a reference and colloquial language, i.e. learning text of the first and second language [0035-6],[0038-9],[0047-9], and where, as previously discussed, a speaker vector can be extracted from speech of the reference or colloquial speaker, i.e. first and second speaker information [0038-9]. It should also be noted that the claim recites the training of the model is ‘based on’ the aforementioned parameters, which is not the same as requiring the parameters to be the precise inputs during training. As Agiomyrgiannakis teaches training the model using an input speech signal, text string, and speaker vectors to generate labels. Because the training information is derived from the speaker vectors and enriched transcriptions previously presented, Agiomyrgiannakis teaches that the model is trained ‘based on’ the required parameters [0051-2],[0109-112].  Additionally, Agiomyrgiannakis teaches that the reference and colloquial languages need not be the same, which reads on a multilingual model [0037:1-3].
Applicant asserts on pages 11-13 that Gabryjelski in view of Agiomyrgiannakis does not teach generating output speech data by inputting the text of the second language and articulatory feature of the speaker regarding the first language into the neural network model. The Examiner respectfully disagrees with this assertion, beginning with the piecemeal analysis of references. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Gabryjelski teaches a TTS module that uses the translated text in the second language, characteristics such as tonality detected in a speech in the original language, and a voice print model, to generate a speech in the second language in the original actor’s voice [0023],[0057],[0066]. What Gabryjelski does not teach is that this is specifically done using a neural network model. Agiomyrgiannakis, however, teaches that the TTS system, which uses a trained neural network [0028], receives an input text string to produce a spoken rendering of the input text string using the features of the reference speaker to synthesize the speech in the voice of the reference speaker [0045],[0059],[0092]. The TTS system does this by developing an enriched transcription, which is labeling text with temporal parameters, previously discussed as being parameterized speech features, or ‘articulatory features’ [0038-9],[0054]. Further, Applicant’s arguments characterize the cited portions of the office action as pertaining solely to training, which is not representative of the cited text, and Applicant further points to training-specific portions of the text that are not relied upon by the Examiner, which has no bearing on whether or not Agiomyrgiannakis teaches the claim language in question. 
The remaining arguments on page 14 with respect to the amendments to the independent claims have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. 
Hence, Applicant’s arguments are not persuasive.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 2, 5-7, 9-11, 13, and 15-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gabryjelski et al. (U.S. PG Pub No. 2020/0058289), hereinafter Gabryjelski, in view of Agiomyrgiannakis et al. (U.S. PG Pub No. 2016/0140951), hereinafter Agiomyrgiannakis, and further in view of Kim et al. (“Character-Aware Neural Language Models”, 2016), hereinafter Kim.

Regarding claims 1 and 7, Gabryjelski teaches
(claim 1) A speech translation method using a multilingual text-to-speech synthesis model (an automatic dubbing method [0004]), comprising:
(claim 7) A video translation method using a multilingual text-to-speech synthesis 20model (an automatic dubbing method [0004]), comprising:

generating a ... multilingual text-to-speech synthesis model trained based on a learning text of a first language, learning speech data of the first language corresponding to the learning text of the first language, first speaker information associated with the learning speech data of the first language ... (a voice print model, i.e. text-to-speech synthesis model, is created, i.e. generating, for a voice based on the speeches of the voice in a first language, and may be trained based on training data that includes the speeches of the speaker, i.e. learning speech data of the first language corresponding to the learning text of the first language, and the associated phonemes from a STT process, i.e. learning text of a first language [0060], [0063-4], where a voice print model may be associated with a particular speaker or extracted characteristics, such as male voice characteristics of the extracted speeches, i.e. first speaker information associated with the learning speech data of the first language [0070], and where the voice print model may be used to generate speech in a second language, i.e. multilingual TTS synthesis model [0066]);
(claim 1) receiving input speech data of a speaker's speech regarding the first language and an articulatory feature of the speaker regarding the first language (the audio processing module extracts the speech of a voice from an audio portion of media content, i.e. receiving input speech data of a speaker’s speech [0032], where the speech is in an original language, i.e. regarding the first language [0023], and characteristics of the speech such as tonality of the speech may be detected, i.e. articulatory feature of the speaker regarding the first language [0057]);
(claim 7) receiving video data including input speech data of a speaker’s speech regarding the first language, a text of the first language corresponding to the input speech data of the first language, and an articulatory feature of the speaker regarding the first language (the audio processing module extracts the speech of a voice from an audio portion of media content, where the content can be a movie, TV program, video clip, or video game, i.e. receiving video data including input speech data of a speaker’s speech [0032],[0035], where the speech is in an original language, i.e. regarding the first language [0023], a speech to text module converts the speech into text, i.e. a text of the first language corresponding to the input speech data of the first language, where the speech is in an first language and the text is also in the first language [0023],[0057],[0060], and characteristics of the speech such as tonality of the speech may be detected, i.e. articulatory feature of the speaker regarding the first language [0057]);
(claim 7) deleting the input speech data of the first language from the video data (the extracted speech of the voice from the media content, i.e. input speech data of the first language, is replaced with the generated replacement speeches, i.e. deleting…from the video data [0032]);
(claim 1) converting the input speech data of the first language into a text of the first language (a speech to text module converts the speech, into text, i.e. converting the input speech data…into a text, where the speech is in a first language and the text is also in the first language [0023],[0057],[0060]);
converting the text of the first language into a text of the second language (a machine translation module translates the text in a first language into text in a second language [0060]); and
generating output speech data for the text of the second language that simulates the speaker's speech regarding the first language by inputting the text of the second language and the articulatory feature of the speaker regarding the first language to the ... multilingual text-to-speech synthesis model (the translated text in the second language is used by the TTS module, i.e. inputting the text of the second language, along with characteristics such as tonality detected in a speech in the original language, i.e. inputting …the articulatory feature of the speaker regarding the first language, and based on the voice print model, i.e. multilingual TTS synthesis model, to generate a speech in the second language in the original actor’s voice, i.e. generating output speech data for the text of the second language that simulates the speaker's speech regarding the first language [0023],[0057],[0066]).  
 (claim 7) combining the output speech data for the text of the second language with the video data (the extracted speech of the voice from the media content is replaced with, i.e. combining…with the video data, the generated replacement speeches in the second language, i.e. output speech data for the text of the second language [0032],[0066]).  
While Gabryjelski provides the use of a trained model and speech characteristics for the synthesis into speech of translated text, and a STT assigning phonemes to speech, Gabryjelski does not specifically teach that the model is a neural network, or that the training data is text divided into syllables, characters, or phonemes and embedded into vectors, and thus does not teach
... a single artificial neural network multilingual text-to-speech synthesis model trained based on a learning text of a first language, and learning speech data of the first language corresponding to the learning text of the first language, first speaker information associated with the learning speech data of the first language, a learning text of a second language, and learning speech data of the second language corresponding to the learning text of the second language, and second speaker information associated with the learning speech data of the second language;
generating output speech data for the text of the second language that simulates the speaker's speech regarding the first language by inputting the text of the second language and the articulatory feature of the speaker regarding the first language to the single artificial neural network multilingual text-to-speech synthesis model; and 
wherein each of the learning text of the first language and the learning text of the second language includes a plurality of text embedding vectors corresponding to text divided by units of a syllable, a character, or a phoneme.  
Agiomyrgiannakis, however, teaches generate a single artificial neural network multilingual text-to-speech synthesis model trained based on a learning text of a first language, learning speech data of the first language corresponding to the learning text of the first language, first speaker information associated with the learning speech data of the first language, a learning text of a second language, learning speech data of the second language corresponding to the learning text of the second language, and second speaker information associated with the learning speech data of the second language (a neural network may be used to generate speech parameters to synthesize speech in reference or colloquial languages, where the languages need not be the same, i.e. a single artificial neural network multilingual text-to-speech synthesis model, where the NN is trained, i.e. generate...trained [0028],[0037:1-6], to associate a transcribed form of text with parameterized speech using a set of speaker vectors [0050], where a speaker vector can be extracted from speech of the reference speaker, i.e. first speaker information associated with the learning speech data of the first language [0038], and the set of speaker vectors is made [0047-9], using samples of speech recited by a reference speaker, i.e. learning speech data of the first language corresponding to the learning text of the first language, and reference text strings in a reference language, i.e. learning text of a first language [0035],[0038], and samples of speech recited by a colloquial speaker, i.e. learning speech data of the second language corresponding to the learning text of the second language, and reference text strings in a colloquial language, i.e. learning text of a second language [0036],[0039], and where a speaker vector can be extracted from speech of the colloquial speaker, i.e. second speaker information associated with the learning speech data of the second language [0039], where the reference language and the colloquial language need not be the same [0037:1-3]);
generating output speech data for the text of the second language that simulates the speaker's speech regarding the first language by inputting the text of the second language and the articulatory feature of the speaker regarding the first language to the single artificial neural network multilingual text-to-speech synthesis model (TTS synthesis system, which may use a trained neural network, i.e. single artificial neural network multilingual text-to-speech synthesis model [0028], receives an input text string, i.e. inputting the text of the second language, to produce a spoken rendering of the input text string, i.e. generating output speech data for the text of the second language [0092], and the features of the reference speaker are used by the TTS system to synthesize speech in a voice of the reference speaker, i.e. simulates the speaker's speech regarding the first language by inputting…the articulatory feature of the speaker regarding the first language [0045],[0059]).
Where Gabryjelski teaches that the text has been translated into the second language [0066].
Gabryjelski and Agiomyrgiannakis are analogous art because they are from a similar field of endeavor in translating and synthesizing speech using particular voice characteristics. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of a trained model and speech characteristics for the synthesis into speech of translated text teachings of Gabryjelski with the use of a trained neural network as the model as taught by Agiomyrgiannakis. It would have been obvious to combine the references to use a NN to generate parametric representations of speech that can be used to alter or adjust characteristics of the synthesized voice based on different forms of statistical adaptation (Agiomyrgiannakis [0028]).
While Gabryjelski in view of Agiomyrgiannakis provides labeling text strings where each label identifies a phonetic speech unit, Gabryjelski in view of Agiomyrgiannakis does not specifically teach that the speech units are embedded into vectors, and thus does not teach
wherein each of the learning text of the first language and the learning text of the second language includes a plurality of text embedding vectors corresponding to text divided by units of a syllable, a character, or a phoneme.  
Kim, however, teaches wherein each of the learning text of the first language and the learning text of the second language includes a plurality of text embedding vectors corresponding to text divided by units of a syllable, a character, or a phoneme (the RNN language model, i.e. artificial neural network...model, receives a training corpus, i.e. learning text, where the input to the model is a set of character embeddings, i.e. plurality of text embedding vectors corresponding to text divided by units of ... a character...(p.2743, col.1), and the model is applied to various languages, i.e. each of the learning text of the first language and the learning text of the second language (p.2744 col.1)).  
Gabryjelski, Agiomyrgiannakis, and Kim are analogous art because they are from a similar field of endeavor in the processing of many languages. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the labeling text strings where each label identifies a phonetic speech unit teachings of Gabryjelski, as modified by Agiomyrgiannakis, with the use of character-level embeddings of different languages as taught by Kim. It would have been obvious to combine the references to enable a model with significantly fewer parameters for applications where model size may be an issue (Kim (p.2741 col.2)).

Regarding claim 2, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claim 1, and Agiomyrgiannakis further teaches
the articulatory feature of the speaker regarding the first language is generated by extracting a feature vector from the input speech data of the first language (the speech features, i.e. articulatory feature of the speaker regarding the first language, are extracted from a plurality of recorded reference speech utterances of a reference speaker, i.e. generated by extracting … from the input speech data of the first language, to generate a set of reference-speaker vectors, i.e. feature vector [0045]).  
Where the motivation to combine is the same as previously presented.

Regarding claims 5 and 9, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claims 1 and 7, and Gabryjelski further teaches
generating a prosody feature of the speaker regarding the first language from the input speech data of the first language (the STT module detects characteristics from the extracted speech of the media content, i.e. generating a…feature of the speaker regarding the first language from the input speech data of the first language [0032],[0057], where the characteristics include stress, tonality, speed, and inflection, i.e. prosody feature [0032]), 
wherein the generating the output speech data for the text of the second language that simulates the speaker's speech regarding the first language includes generating output speech data 10for the text of the second language that simulates the speaker's speech regarding the first language by inputting the text of the second language, the articulatory feature, and the prosody feature of the speaker regarding the first language to … text-to- speech multilingual synthesis model (the translated text in the second language is used by the TTS module, i.e. inputting the text of the second language, along with characteristics such as tonality, i.e. inputting …the articulatory feature, stress, tonality, speed, and inflection, i.e. inputting …the prosody feature of the speaker, and based on the voice print model, i.e. text-to-speech multilingual synthesis model, to generate a speech in the second language in the original actor’s voice, i.e. generating output speech data for the text of the second language that simulates the speaker's speech regarding the first language [0066]).  
Where Agiomyrgiannakis teaches that the model is a trained neural network [0028], as previously cited, and the motivation to combine is the same as previously presented.

Regarding claim 6, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claim 5, and Gabryjelski further teaches 
the prosody feature includes at least one of information on utterance speed, information on accentuation, information on voice pitch, and information on pause duration (the STT module detects characteristics from the extracted speech of the media content, i.e. prosody feature, and where the characteristics include, i.e. at least one of, stress, i.e. information on accentuation, tonality, i.e. information on voice pitch, speed, i.e. information on utterance speed, and inflection, i.e. information on accentuation [0032]).  

Regarding claim 10, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claim 1, and Gabryjelski further teaches
A non-transitory computer readable storage medium having recorded thereon a program comprising instructions for performing the steps of the method (a computer system includes a computer readable storage medium on which are stored computer readable instructions, i.e. having recorded thereon a program comprising instructions, which can be executed by the one or more processors, i.e. performing the steps of the method [0064],[0087]).

Regarding claims 11 and 15, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claims 1 and 7, and Gabryjelski further teaches
the articulatory feature of the speaker regarding the first language includes a speaker ID or a speaker embedding vector (voice characteristic data of speeches, i.e. articulatory feature of the speaker regarding the first language, may be stored in a database associated with a particular actor, and a speech may be given a speaker ID, i.e. includes a speaker ID [0047]).  

Regarding claims 13 and 17, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claims 1 and 7, and Gabryjelski further teaches
obtaining a speaker ID of the speaker (voice characteristic data of speeches may be stored in a database associated with a particular actor, and a speech may be given a speaker ID, i.e. speaker ID of the speaker, the may be obtained from the metadata, i.e. obtaining [0047]); and
obtaining a speaker --voice characteristic-- for the speaker based on the speaker ID, wherein the articulatory feature of the speaker includes the speaker --voice characteristic-- (the voice characteristic data of an actor, i.e. speaker voice characteristic for the speaker, where voice characteristics may include spectrum, pitch, and tone, i.e. articulatory feature of the speaker includes the speaker voice characteristic, and may be stored in a database associated with a particular actor’s voice with a speaker ID, i.e. based on the speaker ID [0047-8]).  
And Agiomyrgiannakis further teaches a speaker embedding vector for the speaker ..., wherein the articulatory feature of the speaker includes the speaker embedding vector (the speech features are extracted from a plurality of recorded speech utterances of the reference and colloquial speakers, i.e. articulatory feature of the speaker, where the features are used to generate, respectively, reference-speaker vectors and colloquial-speaker vectors, i.e. speaker embedding vector for the speaker... articulatory feature of the speaker includes the speaker embedding vector [0045-6]).  
Where the motivation to combine is the same as previously presented.

Regarding claim 16, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claim 7, and Kim further teaches
the learning text of the first language and learning text of the second language includes a plurality of text embedding vectors separated by letters or phonemes (the RNN language model receives a training corpus, i.e. learning text, where the input to the model is a set of character embeddings, i.e. plurality of text embedding vectors separated by letters...(p.2743, col.1), and the model is applied to various languages, i.e. learning text of the first language and the learning text of the second language (p.2744 col.1)).  
Where the motivation to combine is the same as previously presented.

Claim(s) 3, 4, and 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gabryjelski, in view of Agiomyrgiannakis, in view of Kim, and further in view of Meng et al. (U.S. Patent No. 9342509), hereinafter Meng.

Regarding claims 3 and 8, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claims 1 and 7, and Gabryjelski further teaches
wherein the generating the output speech data for the text of the second language that simulates the speaker's speech regarding the first language includes generating output speech data for the text of the second language that simulates the speaker's speech regarding the first language by inputting the 30text of the second language, the articulatory feature, and the emotion feature of the speaker regarding the first language to the … multilingual text-to- speech synthesis model (the translated text in the second language is used by the TTS module, i.e. inputting the text of the second language, along with characteristics such as tonality, i.e. inputting …the articulatory feature, stress, tonality, speed, and inflection, and stress and tonality are indicative of an emotion, such as anger or being haughty, where a voice model can be identified based on a particular emotion, i.e. emotion feature, and based on the voice print model, i.e. multilingual text-to- speech synthesis model, to generate a speech in the second language in the original actor’s voice, i.e. generating output speech data for the text of the second language that simulates the speaker's speech regarding the first language [0026],[0062-3],[0066]).  
Where Agiomyrgiannakis teaches that the model is a trained neural network [0028], as previously cited, and the motivation to combine is the same as previously presented.
While Gabryjelski in view of Agiomyrgiannakis and Kim provides recognition that speech signals carry information indicative of emotion, and using the information to generate synthesized speech with specific emotions, Gabryjelski in view of Agiomyrgiannakis and Kim does not specifically teach the generation of an emotion feature, and thus does not teach
generating an emotion feature of the speaker regarding the first language from the input speech data of the first language.
Meng, however, teaches generating an emotion feature of the speaker regarding the first language from the input speech data of the first language (non-text information, such as emotional expressions, are extracted, i.e. generating an emotion feature, from the source speech of an original speaker in a language to be translated, i.e. speaker regarding the first language from the input speech data of the first language (1:37-40),(2:12-28)).
Gabryjelski, Agiomyrgiannakis, Kim, and Meng are analogous art because they are from a similar field of endeavor in translating and synthesizing speech using particular voice characteristics. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the recognition that speech signals carry information indicative of emotion, and using the information to generate synthesized speech with specific emotions teachings of Gabryjelski, as modified by Agiomyrgiannakis and Kim, with the specific extraction of emotional expressions as taught by Meng. It would have been obvious to combine the references to enable assisting in the understanding of the meaning of the original speaker by preserving emotional expressions (Meng (1:37-40)).

Regarding claim 4, Gabryjelski in view of Agiomyrgiannakis, Kim, and Meng teaches claim 3, and Meng further teaches
wherein the emotion feature includes information on emotions inherent in a content uttered by the speaker (emotional expressions, i.e. emotional feature, include laughter and sigh in the source speech, i.e. content uttered by the speaker (2:51-61), where the emotional expression identifies the real intention of the speech, i.e. emotions inherent in a content (4:4-13)).  
Where the motivation to combine is the same as previously presented.

Claim(s) 14 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gabryjelski, in view of Agiomyrgiannakis, in view of Kim, and further in view of Agiomyrgiannakis et al. (U.S. PG Pub No. 2016/0005403), hereinafter Agiomyrgiannakis 2.

Regarding claims 14 and 18, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claims 1 and 7.
While Gabryjelski in view of Agiomyrgiannakis and Kim provides the simulation of speech using voice characteristics of different speakers, Gabryjelski in view of Agiomyrgiannakis and Kim does not specifically teach the use of the features of second speaker rather than the first or original speaker, and thus does not teach
the output speech data for the text of the second language includes simulated articulatory feature of the speaker regarding the second language that corresponds to the articulatory feature of the speaker regarding the first language.  
Agiomyrgiannakis 2, however, teaches the output speech data for the text of the second language includes simulated articulatory feature of the speaker regarding the second language that corresponds to the articulatory feature of the speaker regarding the first language (a voice conversion system, such as a text-to-speech system, for speech synthesis, i.e. output speech data for the text...includes simulated articulatory feature, allows converting the first voice characteristics of recorded speech, i.e. articulatory feature of the speaker regarding the first language, to the corresponding second voice characteristics associated with the second voice, i.e. simulated articulatory feature of the speaker regarding the second language, where the first and second voice characteristics are associated based on a comparison between speech sounds associated with the two voices, i.e. simulated articulatory feature...corresponds [0017],[0019-20]).  
Where Gabryjelski further teaches that the output speech data is for text in the second language [0066].
And Agiomyrgiannakis further teaches that the speakers are associated with the first and second languages [0037:1-3],[0045-6].
Gabryjelski, Agiomyrgiannakis, Kim, and Agiomyrgiannakis 2 are analogous art because they are from a similar field of endeavor in synthesizing speech using particular voice characteristics, such as in speech translation. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the simulation of speech using voice characteristics of different speakers teachings of Gabryjelski, as modified by Agiomyrgiannakis and Kim, with the use of features of a second voice rather than the first voice as taught by Agiomyrgiannakis 2. It would have been obvious to combine the references to enable the speech synthesis to have the characteristics of a target voice instead of the source voice (Agiomyrgiannakis 2 [0019]).

Claim(s) 19 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gabryjelski, in view of Agiomyrgiannakis, in view of Kim, and further in view of Lee et al. (U.S. Patent No. 10867136), hereinafter Lee.

Regarding claim 19, Gabryjelski in view of Agiomyrgiannakis and Kim teaches claim 1, and Agiomyrgiannakis further teaches
the single artificial neural network multilingual text-to-speech synthesis model ... is configured to (a neural network may be used to generate speech parameters to synthesize speech, i.e. single artificial neural network multilingual text-to-speech synthesis model [0028]):
acquire an embedding vector of the speaker based on the articulatory feature of the speaker (feature vectors are derived or extracted from speech of a reference or colloquial speaker, i.e. acquire an acquire an embedding vector of the speaker [0038-9], where extracted speech features such as envelope parameters, fundamental frequencies, and voicing may be used to generate the feature vectors, i.e. based on the articulatory feature of the speaker [0054]); and
 generate an output ... based on the embedding vector of the speaker (the TTS system uses the speaker vectors, i.e. based on the embedding vector of the speaker, to synthesize speech in the voice of the reference speaker, i.e. generate an output [0059]).  
While Gabryjelski in view of Agiomyrgiannakis and Kim provides the functions of the decoder performed by an artificial neural network, Gabryjelski in view of Agiomyrgiannakis and Kim does not specifically teach that the neural network includes an encoder and a decoder, and thus does not teach
the single artificial neural network multilingual text-to-speech synthesis model includes an encoder and a decoder, and the decoder is configured to:
generate an output of the decoder based on the embedding vector of the speaker.  
Lee, however, teaches the single artificial neural network multilingual text-to-speech synthesis model includes an encoder and a decoder (the automated interpretation apparatus may have a neural network, i.e. the single artificial neural network multilingual text-to-speech synthesis model, implemented by encoders and decoders, i.e. includes an encoder and a decoder (4:23-40),(12:12-17),(17:45-47)), and the decoder is configured to:
generate an output of the decoder based on the embedding vector of the speaker (the recognition decoder may determine, i.e. generate an output of the decoder, a first language sentence based on the first feature vector, which is based on the voice signal, i.e. based on the embedding vector of the speaker (4:23-40)).  
Gabryjelski, Agiomyrgiannakis, Kim, and Lee are analogous art because they are from a similar field of endeavor in speech translation. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the functions of a decoder as performed by a neural network teachings of Gabryjelski, as modified by Agiomyrgiannakis and Kim, with the specific recitation of an encoder and decoder as taught by Lee. It would have been obvious to combine the references to enable the different components of the system to have individualized training (Lee (18:59-63)).

Regarding claim 20, Gabryjelski in view of Agiomyrgiannakis, Kim, and Lee teaches claim 19, and Agiomyrgiannakis further teaches
the encoder includes a module that outputs a hidden state in response to receiving the embedding vector of the speaker, and the hidden state includes information indicating from which input text a speech is to be synthesized (the neural network can include a hidden layer, i.e. module that outputs a hidden state, and can map enriched transcriptions to parameterized speech, i.e. hidden state includes information indicating from which input text a speech is to be synthesized, where a text string, i.e. input text, is processed into a symbolic representation by adding a sequence of labels, such as temporal parameters that are typically referred to as a speaker feature vector, i.e. receiving the embedding vector of the speaker [0038],[0092],[0116]).
Where Lee teaches that the functionality of processing text is performed by an encoder (4:23-40),(12:12-17),(17:45-47).
And where the motivation to combine is the same as previously presented.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NICOLE A K SCHMIEDER/Examiner, Art Unit 2659                                                                                                                                                                                                        

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659