DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-10 are pending in this application.
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). Receipt is acknowledged of some of the necessary certified copies of papers required by 37 CFR 1.55. A translation of said application has not been made of record in accordance with 37 CFR 1.55. See MPEP §§ 215 and 216.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/10/2020 and 02/26/2021 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Specification
The disclosure is objected to because of the following informalities: On pg. 23, line 26 of the as-filed specification, the text is described as being input into “encoder 910” as per Fig. 9. The associated figure, however, labels the encoder as element 810.  
Appropriate correction is required.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets 
Claims 1-3 and 5-10 are rejected on the ground of nonstatutory double patenting as being unpatentable over copending Application No. 16/682390 which is allowed, but not issued, in view of Gabryjelski (US 2020/0058289), and, specifically for claims 3 and 8, further in view of Meng (US 9342509). With respect to each of the dependent claims and independent claims, each claim corresponds numerically, please see the mapping that follows: Instant application claim (I) - Allowed Patent Application (A): Claim 1 (I):Claim 1 (A), Claim 2 (I):Claim 3 (A), Claim 3 (I):Claim 4 (A), Claim 5 (I):Claim 5 (A), Claim 6 (I):Claim 6 (A), Claim 7 (I):Claim 1 (A), Claim 8 (I):Claim 4 (A), Claim 9 (I):Claim 5 (A), Claim 10 (I):Claim 10 (A). Please see the following for more detail regarding the grounds of rejection.
Claims 1, 7, and 10 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1 and 10 of copending Application No. 16/682390, which is allowed, but not issued, in view of Gabryjelski (US 2020/0058289). 
The claims of the co-pending case are similar to that of the instant application except for the limitations that are left unbolded for the instant application as shown below. The missing limitations are “receiving input speech data of the first language …; converting the input speech data of the first language into a text of the first language; converting the text of the first language into a text of the second language”. However, the Examiner notes that such is well known in the art. Gabryjelski teaches the receipt of 
Additionally, claim 7 has the missing limitations “receiving video data…; deleting the input speech data of the first language from the video data; …and combining the output speech data with the video data”, which is also noted to be well known in the art. Gabryjelski teaches the receipt of video media content that includes audio speech, and the replacement of extracted speech from the media content with generated replacement speech (see [0032]). Please see below claim mappings.
Therefore, it would have been obvious to one of ordinary skill in the art to have modified the copending application with the teachings of Gabryjelski for the purpose of enabling a user to customize dubbing, such as applying a translating function with a particular actor’s voice for automatic cross-language dubbing (see Gabryjelski [0030]).

Instant Application: 16925888
Copending Application No. 16/682390
Claim 1: A speech translation method using a multilingual text-to-speech synthesis model, comprising:






acquiring a single artificial neural network text-to-speech synthesis model trained based on a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language, and a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language;





receiving input speech data of the first language and an articulatory feature of a speaker regarding the first language;



converting the text of the first language into a text of the second language; and

generating output speech data for the text of the second language that simulates the speaker's speech by inputting the text of the second language and the articulatory feature of the speaker to the single artificial neural network text-to-speech synthesis model.  



receiving first learning data including a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language; 
receiving second learning data including a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language; 
generating a single artificial neural network text-to-speech synthesis model by learning similarity information between phonemes of the first language and phonemes of the second language based on the first learning data and the second learning data; 

receiving an articulatory feature of a speaker regarding the first language; 






receiving an input text of the second language; and 



generating output speech data for the input text of the second language that simulates the speaker's speech by inputting the input text of the second language and the articulatory feature of the speaker regarding the first language to the single artificial neural network text-to-speech synthesis model.


As to claims 5 and 9, these claims are rejected over copending Application, in view of Gabryjelski, which teaches the missing limitation of “generating a prosody feature…” (see Gabryjelski [0032],[0057] where the STT module detects characteristics from the extracted speech of the media content including stress, tonality, speed, and inflection). The motivation to combine is the same as previously presented.
As to claims 3 and 8, these claims are rejected over copending Application No. 16/682390, in view of Gabryjelski, and further in view of Meng, which teaches the missing limitation of “generating an emotion feature…” (see Meng (1:37-40),(2:12-280 where non-text information, such as emotional expressions, are extracted). Therefore, it would have been obvious to one of ordinary skill in the art to have modified the copending application and Gabryjelski, with the teachings of Meng for the purpose of assisting in the understanding of the meaning of the original speaker by preserving emotional expressions (Meng (1:37-40)).
Claims 1 is further provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 7 of copending Application No. 16/682390, which is allowed, but not issued. Although the claims as allowed are not identical, they are not patentably distinct from each other because the claims of the copending application anticipate the claim of the instant application. Please see below for the mapping in the table, where the bolded limitations indicate the corresponding limitations between the copending application and instant application. 
Instant Application: 16925888
Copending Application No. 16/682390
Claim 1: A speech translation method using a multilingual text-to-speech synthesis model, comprising:






acquiring a single artificial neural network text-to-speech synthesis model trained based on a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language, and a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language;




receiving input speech data of the first language and an articulatory feature of a speaker regarding the first language;



converting the input speech data of the first language into a text of the first language;

converting the text of the first language into a text of the second language; and

generating output speech data for the text of the second language that simulates the speaker's speech by inputting the text of the second language and the articulatory feature of the speaker to the single artificial neural network text-to-speech synthesis model.  


receiving first learning data including a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language; 
receiving second learning data including a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language; 
generating a single artificial neural network text-to-speech synthesis model by learning similarity information between phonemes of the first language and phonemes of the second language based on the first learning data and the second learning data;

receiving an input speech of the first language; extracting a feature vector from the input speech of the first language to generate an articulatory feature of a speaker regarding the first language; 
converting the input speech of the first language into an input text of the first language; 

converting the input text of the first language into an input text of the second language; and 

generating output speech data of the second language for the input text of the second language that simulates the speaker's speech by inputting the input text of the second language and the articulatory feature of the speaker regarding the first language to the single artificial neural network text-to-speech synthesis model.


This is a provisional nonstatutory double patenting rejection.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 2, 5-7, 9, and 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gabryjelski et al. (U.S. PG Pub No. 2020/0058289), hereinafter Gabryjelski, in view of Agiomyrgiannakis et al. (U.S. PG Pub No. 2016/0140951), hereinafter Agiomyrgiannakis.

Regarding claims 1 and 7, Gabryjelski teaches
(claim 1) A speech translation method using a multilingual text-to-speech synthesis model (an automatic dubbing method [0004]), comprising:
(claim 7) A video translation method using a multilingual text-to-speech synthesis 20model (an automatic dubbing method [0004]), comprising:

5acquiring a … text-to-speech synthesis model trained based on a learning text … and learning speech data … (a voice print model, i.e. text-to-speech synthesis model, is created for a voice based on the speeches of the voice, i.e. acquiring, and may be trained based on training data that includes the speeches of the speaker, i.e. ;
(claim 1) 10receiving input speech data of the first language and an articulatory feature of a speaker regarding the first language (the audio processing module extracts the speech of a voice from an audio portion of media content, i.e. input speech data [0032], where the speech is in an original language, i.e. first language [0023], and characteristics of the speech such as tonality of the speech may be detected, i.e. articulatory feature of a speaker regarding the first language [0057]);
(claim 7) receiving video data including input speech data of the first language, a text of the first language corresponding to the input speech data of the first language, and an articulatory feature of a speaker regarding the first language (the audio processing module extracts the speech of a voice from an audio portion of media content, where the content can be a movie, TV program, video clip, or video game, i.e. receiving video data including input speech data [0032],[0035], where the speech is in an original language, i.e. first language [0023], and characteristics of the speech such as tonality of the speech may be detected, i.e. articulatory feature of a speaker regarding the first language [0057]);
(claim 7) deleting the input speech data of the first language from the video data (the extracted speech of the voice from the media content, i.e. input speech data of the first language, is replaced with the generated replacement speeches, i.e. deleting…from the video data [0032]);
converting the input speech data of the first language into a text of the first language (a speech to text module converts the speech, into text, i.e. converting the ;
converting the text of the first language into a text of the second language (a machine translation module translates the text in a first language into text in a second language [0060]); and
 15generating output speech data for the text of the second language that simulates the speaker's speech by inputting the text of the second language and the articulatory feature of the speaker to the … model (the translated text in the second language is used by the TTS module, i.e. inputting the text of the second language, along with characteristics such as tonality, i.e. inputting …the articulatory feature of the speaker, and based on the voice print model, i.e. model, to generate a speech in the second language in the original actor’s voice, i.e. generating output speech data for the text of the second language that simulates the speaker's speech [0066]); and
(claim 7) combining the output speech data with the video data (the extracted speech of the voice from the media content is replaced with, i.e. combining…with the video data, the generated replacement speeches, i.e. output speech data [0032]).  
 While Gabryjelski provides the use of a trained model and speech characteristics for the synthesis into speech of translated text, Gabryjelski does not specifically teach that the model is a neural network, and thus does not teach
acquiring a single artificial neural network text-to-speech synthesis model trained based on a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language, and a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language;
generating output speech data for the text of the second language that simulates the speaker's speech by inputting the text of the second language and the articulatory feature of the speaker to the single artificial neural network text-to-speech synthesis model.
Agiomyrgiannakis, however, teaches acquiring a single artificial neural network text-to-speech synthesis model trained based on a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language, and a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language (a neural network may be used to generate speech parameters to synthesize speech, i.e. a single artificial neural network text-to-speech synthesis model, where the NN is trained, i.e. trained [0028], to associate a transcribed form of text with parameterized speech using a set of speaker vectors [0050], and the set of speaker vectors is made [0047-9], using samples of speech recited by a reference speaker, i.e. learning speech data of the first language corresponding to the learning text of the first language, and reference text strings in a reference language, i.e. learning text of a first language [0035],[0038], and samples of speech recited by a colloquial speaker, i.e. learning speech data of the second language corresponding to the learning text of the second language, and reference text strings in a colloquial ;
generating output speech data for the text of the second language that simulates the speaker's speech by inputting the text of the second language and the articulatory feature of the speaker to the single artificial neural network text-to-speech synthesis model (TTS synthesis system, which may use a trained neural network, i.e. single artificial neural network text-to-speech synthesis model [0028], receives an input text string, i.e. inputting the text of the second language, to produce a spoken rendering of the input text string, i.e. generating output speech data for the text of the second language [0092], and the features of the reference speaker are used by the TTS system to synthesize speech in a voice of the reference speaker, i.e. inputting…the articulatory feature of the speaker [0045],[0059]).
Gabryjelski and Agiomyrgiannakis are analogous art because they are from a similar field of endeavor in translating and synthesizing speech using particular voice characteristics. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of a trained model and speech characteristics for the synthesis into speech of translated text teachings of Gabryjelski with the use of a trained neural network as the model as taught by Agiomyrgiannakis. The motivation to do so would have been to achieve a predictable result of generate parametric representations of speech that can be used to alter or adjust characteristics of the synthesized voice (Agiomyrgiannakis [0028]).


the articulatory feature of the speaker regarding the first language is generated by extracting a feature vector from speech data uttered by the speaker in the first language (the speech features, i.e. articulatory feature of the speaker regarding the first language, are extracted from a plurality of recorded reference speech utterances of a reference speaker, i.e. generated by extracting … from speech data uttered by the speaker in the first language, to generate a set of reference-speaker vectors, i.e. feature vector [0045]).  
Where the motivation to combine is the same as previously presented.

Regarding claims 5 and 9, Gabryjelski in view of Agiomyrgiannakis teaches claims 1 and 7, and Gabryjelski further teaches
generating a prosody feature of the speaker regarding the first language from the input speech data of the first language (the STT module detects characteristics from the extracted speech of the media content, i.e. generating a…feature of the speaker regarding the first language from the input speech data of the first language [0032],[0057], where the characteristics include stress, tonality, speed, and inflection, i.e. prosody feature [0032]), 
wherein the generating the output speech data for the text of the second language that simulates the speaker's speech includes generating output speech data 10for the text of the second language that simulates the speaker's speech by inputting the text of the second language, the articulatory feature, and the prosody feature of the speaker regarding the first language to … text-to- speech synthesis model (the translated text in the second language is used by the TTS module, i.e. inputting the text of the second language, along with characteristics such as tonality, i.e. inputting …the articulatory feature, stress, tonality, speed, and inflection, i.e. inputting …the prosody feature of the speaker, and based on the voice print model, i.e. model, to generate a speech in the second language in the original actor’s voice, i.e. generating output speech data for the text of the second language that simulates the speaker's speech [0066]).  
Where Agiomyrgiannakis teaches that the model is a trained neural network [0028], as previously cited, and the motivation to combine is the same as previously presented.

Regarding claim 6, Gabryjelski in view of Agiomyrgiannakis teaches claim 5, and Gabryjelski further teaches 
the prosody feature includes at least one of information on utterance speed, information on accentuation, information on voice pitch, and information on pause duration (the STT module detects characteristics from the extracted speech of the media content, i.e. prosody feature, and where the characteristics include, i.e. at least one of, stress, i.e. information on accentuation, tonality, i.e. information on voice pitch, speed, i.e. information on utterance speed, and inflection, i.e. information on accentuation [0032]).  


A non-transitory computer readable storage medium having recorded thereon a program comprising instructions for performing the steps of the method (a computer system includes a computer readable storage medium on which are stored computer readable instructions, i.e. having recorded thereon a program comprising instructions, which can be executed by the one or more processors, i.e. performing the steps of the method [0064],[0087]).

Claim(s) 3, 4, and 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gabryjelski, in view of Agiomyrgiannakis, and further in view of Meng et al. (U.S. Patent No. 9342509), hereinafter Meng.

Regarding claims 3 and 8, Gabryjelski in view of Agiomyrgiannakis teaches claims 1 and 7, and Gabryjelski further teaches
wherein the generating the output speech data for the text of the second language that simulates the speaker's speech includes generating output speech data for the text of the second language that simulates the speaker's speech by inputting the 30text of the second language, the articulatory feature, and the emotion feature of the speaker regarding the first language to the … text-to- speech synthesis model (the translated text in the second language is used by the TTS module, i.e. inputting the text of the second language, along with characteristics such as tonality, i.e. inputting …the articulatory feature, stress, tonality, speed, and .  
Where Agiomyrgiannakis teaches that the model is a trained neural network [0028], as previously cited, and the motivation to combine is the same as previously presented.
While Gabryjelski in view of Agiomyrgiannakis provides recognition that speech signals carry information indicative of emotion, and using the information to generate synthesized speech with specific emotions, Gabryjelski in view of Agiomyrgiannakis does not specifically teach the generation of an emotion feature, and thus does not teach
generating an emotion feature of the speaker regarding the first language from the input speech data of the first language.
Meng, however, teaches generating an emotion feature of the speaker regarding the first language from the input speech data of the first language (non-text information, such as emotional expressions, are extracted, i.e. generating an emotion feature, from the source speech of an original speaker in a language to be translated, i.e. speaker regarding the first language from the input speech data of the first language (1:37-40),(2:12-28)).


Regarding claim 4, Gabryjelski in view of Agiomyrgiannakis and Meng teaches claim 3, and Meng further teaches
wherein the emotion feature includes information on emotions inherent in a content uttered by the speaker (emotional expressions, i.e. emotional feature, include laughter and sigh in the source speech, i.e. content uttered by the speaker (2:51-61), where the emotional expression identifies the real intention of the speech, i.e. emotions inherent in a content (4:4-13)).  
Where the motivation to combine is the same as previously presented.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
Liu et al. (U.S. PG Pub No. 2009/0037179): Automatically converting a voice using source voice and text information.
Chun et al. (U.S. Patent No. 9922641): Speech models for generating speech data in a second language different from a first language.
Kent (U.S. PG Pub No. 2011/0238407): Speech-to-speech translation system.
Rossano et al. (U.S. Patent No. 9552807): Automatic video dubbing using prosody evaluation.	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-





/NICOLE A K SCHMIEDER/Examiner, Art Unit 2659                                                                                                                                                                                                        

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659