DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Introduction
This office action is in response to amendment filed on 06/15/2022. Claims 1-16 are pending, and likewise Claims 1-16 have been examined.

Response to Amendment
Amendment filed 06/15/2022 has been considered by Examiner. Amendments to the Specification have been considered by Examiner and the objections to the Specification have been withdrawn. Amendments to the Claims have been considered by Examiner and the objections to Claims 4, 8, 9, and 12 have been withdrawn. Amendments to the Claims have been considered by Examiner and the rejections of Claims 4, 5, 8 and 10, under 35 U.S.C. 112(b) have been withdrawn, with the exception of the rejection of Claim 4, “the speaker’s voice” on Ln 4-5.

Response to Arguments
Applicant's arguments filed 06/15/2022, with regard to rejections under 35 U.S.C 103 of claims 1-16 have been fully considered but they are not persuasive.
Applicant argues that in Claim 1 and 7, while Kano discloses training data using TTS, none of the references alone or in combination discloses a multi-lingual system with these training attributes.
Examiner argues that the combination of references and citations of Claim 1 (Weiss: Abstract Ln 1-3; Abstract Ln 1; Abstract Ln 9; 4. Experiments Ln 4-5; Abstract Ln 11; Abstract Ln 1-6. Johnson: Abstract Ln 10; 1 Introduction Zero-shot translation; 4.1 Datasets.  Kano: 5.1 Experimental Set-Up Ln 1-6. Gao: [0023] Ln 3-4) and Claim 7 (Weiss: Abstract Ln 1-3; Abstract Ln 1; Abstract Ln 9; Abstract Ln 11; Abstract Ln 1-6. Johnson: Abstract Ln 10; 1 Introduction Zero-shot translation; 4.1 Datasets. Kano: 5.1 Experimental Set-Up Ln 1-6. Gao: [0026] 5" Line from end) in combination teaches the claimed limitations, as described in the rejections to the claims.
Applicant does not explain why the citations provided by the examiner fail to teach the claimed limitations. Therefore Examiner believes the claimed limitations are taught by their respective cited references, and likewise dependent Claims 2-6, 8 and 9.

Applicant argues that in Claim 10, none of the references, including Patel, teach at least configuring a text to speech module to generate speech based on the text stream output and punctuation prediction information, speaker diarization and voice characteristics meta-information.
Examiner argues that the combination of references and citations of Claim 10 (Gao: Abstract Ln 1; [0026] Ln 1-5; [Abstract] Ln 1-4; [0026]; [0026] Ln 11-13, Ln 16-18; [0026] Ln 21- 24; [0020] Ln 9-10; [0026] last 6 lines; [0026]; [0026] last 6 lines; [0020] Ln 9-10. Weiss: Abstract Ln 1-3. Johnson: Abstract Ln 10; Abstract Ln 1-2. Patel: [0027] 9-12; [0040] Ln 14-16; [0079] Ln 3-11; [0040] Ln 14-16; [0079] Ln 3-11; [0003] last 3 lines) in combination teaches the claimed limitations, as described in the rejections to the claims.
Applicant does not explain why the citations provided by the examiner fail to teach the claimed limitations. Therefore Examiner believes the claimed limitations are taught by their respective cited references, and likewise dependent Claims 11 and 12.

Applicant argues that in Claim 13, none of the references teaches multilingual parallel data with different source languages used to create training data for direct speech translation models using multilingual encoder-decoder architectures with attention mechanisms.
Examiner argues that the combination of references and citations of Claim 13 (Kano: 5. Experimental setup; Conclusion Ln 1-2; Conclusion Ln 3-4; 5. Experimental set-up, Ln 2; 5. Experimental set-up, Ln 1- 7; 5. Experimental set-up, Ln 1-7. Johnson: 4 Experiments and Results, Ln 1; 4 Experiments and Results, Ln 1; Abstract Ln 4; Abstract Ln 10. Gao: [0023] Ln 3-4; [0026] 5'* Line from end; [0023] Ln 3-4; [0024] Last 6 Lines; [0038] Ln 1-7; [0033] Ln 1-5; FIG 1) in combination teaches the claimed limitations, as described in the rejections to the claims.
Applicant does not explain why the citations provided by the examiner fail to teach the claimed limitations. Therefore Examiner believes the claimed limitations are taught by their respective cited references, and likewise dependent Claims 14-16.

Therefore Examiner believes Claims 1-16 are taught by their respective combinations of references as shown in the previous rejections.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: voice activity module, direct multi-lingual speech translation module, text to speech module, in claim 10, and subtitle segmentation module in claim 11.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 4 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 4 recites the limitation "the speaker’s voice" in Ln 4-5.  There is insufficient antecedent basis for this limitation in the claim.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3 and 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ron J. Weiss et al. “Sequence-to-Sequence Models Can Directly Translate Foreign Speech” hereinafter Weiss, and further in view of Melvin Johnson et al. “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation” hereinafter Johnson, and further in view of Takatomo Kano et al. “Structured-based Curriculum Learning for End-to-end English-Japanese Speech Translation” hereinafter Kano, and further in view of Qin Gao et al. (US 20160147740 A1) hereinafter Gao.

Regarding Claim 1:
Weiss teaches a system for translating speech associated with source languages into another target language(Abstract Ln 1-3, speech in one language into text in another), 
comprising: performing direct speech translation for language pairs(Abstract Ln 1-3, speech in one language into text in another) using models having been trained using (i) encoder-decoder(Abstract Ln 1, encoder-decoder) architectures with attention mechanisms()(Abstract Ln 9, attention-based models), 
process an input audio file having speech therein spoken in at least one language(4. Experiments Ln 4-5, Spanish transcriptions) to create text output in a target language(Abstract Ln 11, Spanish-English & Abstract Ln 1-6, Speech in source, to text in target, does not use text in source). 
Weiss does not specifically teach a multi-lingual system for translating speech associated with at least two source languages into another target language…..for performing direct speech translation for more than two language pairs using……. parallel text training data in more than two different languages.
In the same field of sequence-to-sequence model machine translation, Johnson teaches a multi-lingual system for translating speech associated with at least two source languages into another target language (Abstract Ln 10, French to English, German to English)
…..for performing speech translation for more than two language pairs using(1 Introduction Zero-shot translation, Portuguese to English, English to Spanish, Portuguese to Spanish)
……. parallel text training data in more than two different languages(4.1 Datasets, English to Japanese, English to Korean, English to Spanish, and English to Portuguese).
It would have been obvious for one skilled in the art at the effective time of filling to modify Weiss with the multilingual approach of Johnson, as Weiss explicitly suggests to do this(Conclusion Ln 20-24, multilingual speech translation system following [34]) and the multilingual system of Johnson allows for zero-shot translation(Johnson, 1 Introduction, Zero-shot translation, Lines 1-2, A surprising benefit).
The combination of Weiss and Johnson do not specifically teach training data using TTS.
In the same field of direct speech to text translation, Kano teaches training data using TTS(5.1 Experimental Set-Up Ln 1-6, Google text-to-speech).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss and Johnson, with the text to speech of Kano, as it allows them to use training data with a text source language(5.1 Experimental Set-Up Ln 1-6, speech utterances.. are unavailable).
The combination of Weiss, Johnson and Kano do not specifically teach a memory, including program instructions….. a processor coupled to the memory for executing the program instructions to process. 
In the same field of speech to speech translation systems, Gao teaches a memory, including program instructions([0023] Ln 3-4, memory, instructions)….. a processor coupled to the memory for executing the program instructions to process([0023] Ln 3-4, processor, instructions).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss, Johnson and Kano, with the memory, processor, and instructions of Gao, as it would allow the functions and operations of the system to be performed([0024] Last 6 Lines, perform various functions and/or operations).

Regarding Claim 2:
The combination of Weiss, Johnson, Kano and Gao teaches The system according to claim 1, but does not teach, wherein the processor is configured to further execute program instructions to convert the text output into speech in the target language using TTS.
In the same field of speech to speech translation systems, Gao teaches wherein the processor is configured to further execute program instructions(Instructions have been already modified in the combination of Gao in Claim 1) to convert the text output into speech in the target language using TTS([0026] 5th Line from end, TTS engine).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss, Johnson, Kano and Gao, with the text to speech engine of Gao, as it would allow the translated output to be rendered through a speaker([0026] Last 2 lines, outputted as audio via speaker).

Regarding Claim 3:
The combination of Weiss, Johnson, Kano and Gao teaches the system according to claim 2, but does not teach wherein the processor is configured to further execute program instructions to generate the target language text output with punctuation marks and casing.
In the same field of speech to speech translation systems, Gao teaches wherein the processor is configured to further execute program instructions(Instructions have been already modified in the combination of Gao in Claim 1) 
to generate the target language text output with punctuation marks and casing(casing, punctuation, these are generated as shown in [0020] Ln 9. All of paragraphs [0003]-[0006] discuss training the system to deal with the “damages” of input speech, of which missing punctuation and casing are part of as shown in [0004] Ln 10-11).
It would have been obvious for one skilled In the art, at the effective time of filling, to modify the combination of Weiss, Johnson, Kano and Gao, with the output including punctuation and casing, of Gao, because it lowers discontinuities between ASR output and formal written text([0004] Ln 1-4).

Regarding Claim 7:
Weiss teaches a method for translating speech into another target language (Abstract Ln 1-3, speech in one language into text in another), 
comprising: training direct speech to text translation models using (i) encoder-decoder(Abstract Ln 1, encoder-decoder) architectures with attention mechanisms(Abstract Ln 9, attention-based models); 
processing an input speech signal to output a stream of text in a target language(Abstract Ln 11, Spanish-English & Abstract Ln 1-6, Speech in source, to text in target, does not use text in source).
Weiss does not specifically teach translating speech from at least two source languages….. training multi-lingual models for more than two language pairs….. parallel text training data in the more than two different language pairs…… processing an input speech signal in at least one of the languages among the at least two language pairs.
In the same field of sequence-to-sequence model machine translation, Johnson teaches translating speech from at least two source languages(Abstract Ln 10, French to English, German to English)
…… training multi-lingual models for more than two language pairs(1 Introduction Zero-shot translation, Portuguese to English, English to Spanish, Portuguese to Spanish)
….. parallel text training data in the more than two different language pairs(4.1 Datasets, English to Japanese, English to Korean, English to Spanish, and English to Portuguese)
….. processing an input speech signal in at least one of the languages among the at least two language pairs(4.1 Datasets, English to Japanese, English to Korean, English to Spanish, and English to Portuguese).
It would have been obvious for one skilled in the art at the effective time of filling to modify Weiss with the multilingual approach of Johnson, as Weiss explicitly suggests to do this(Conclusion Ln 20-24, multilingual speech translation system following [34]) and the multilingual system of Johnson allows for zero-shot translation(Johnson, 1 Introduction, Zero-shot translation, Lines 1-2, A surprising benefit). 
Weiss and Johnson do not specifically teach training data using TTS.
In the same field of direct speech to text translation, Kano teaches training data using TTS(5.1 Experimental Set-Up Ln 1-6, Google text-to-speech).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss and Johnson, with the text to speech of Kano, as it allows them to use training data with a text source language(5.1 Experimental Set-Up Ln 1-6, speech utterances.. are unavailable).
The combination of Weiss, Johnson and Kano do not teach converting the text output into speech in the target language using TTS.
In the same field of speech to speech translation systems, Gao teaches converting the text output into speech in the target language using TTS. ([0026] 5th Line from end, TTS engine).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss, Johnson, Kano and Gao, with the text to speech engine of Gao, as it would allow the translated output to be rendered through a speaker([0026] Last 2 lines, outputted as audio via speaker).

Claims 4-6,8 and 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Weiss, Johnson, Kano and Gao, and further in view of Rupal Patel et al. (US 20160379622 A1), hereinafter Patel.

Regarding Claim 4:
The combination of Weiss, Johnson, Kano and Gao teaches the system according to claim 2, but does not teach wherein the program instructions stored in the memory further include program instructions for receiving meta-information characteristics of speech in an input stream and adjusting characteristics of the TTS speech output based on the characteristics of speech in the input stream to mimic the speaker's voice in the input speech.
In the same field of text to speech, Patel teaches wherein the program instructions stored in the memory further include program instructions(Instructions have been already modified in the combination of Gao in Claim 1) 
for receiving meta- information characteristics of speech in an input stream([0079] Ln 3-11, donor vocal tract information) 
and adjusting characteristics of the TTS speech output based on the characteristics of speech in the input stream to mimic the speaker's voice in the input speech([0079] Ln 3-11, voicing source of recipient combined with donor vocal tract information. [0003] last 3 lines, adequately represent all the sounds in speech).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss, Johnson, Kano and Gao, with the combining of vocal tract information of Patel, in order to produce morphed speech([0079] Ln 8-11).

Regarding Claim 5:
The combination of Weiss, Johnson, Kano and Gao teaches the system according to claim 2, but does not teach wherein the program instructions stored in the memory further include program instructions for receiving sentiment characteristics of speech in an input stream and adjusting prosody characteristics of the TTS speech output based on the sentiment characteristics.
In the same field of text to speech, Patel teaches wherein the program instructions stored in the memory further include program instructions(Instructions have been already modified in the combination of Gao in Claim 1) 
for receiving sentiment characteristics of speech in an input stream([0079] Ln 3-11, donor vocal tract information) 
and adjusting prosody characteristics of the TTS speech output([0079] Ln 3-11, voicing source of recipient) based on the sentiment characteristics([0079] Ln 3-11, voicing source of recipient combined with donor vocal tract information. Types of information: [0038] Ln 7-8, prosodic… ect., [0039] Ln 1-3, emotions, which includes prosody(step 360). As the instant application’s specification does not define sentiment characteristics in any way that would exclude it from being interpreted as part of emotion, and because this is related to text to speech and not binary sentiment classification, sentiment characteristics are interpreted as part of emotion. [0039] Last 2 lines, voices created with different emotions).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss, Johnson, Kano and Gao, with the combining of vocal tract information of Patel, in order to produce morphed speech([0079] Ln 8-11).

Regarding Claim 6:
The combination of Weiss, Johnson, Kano and Gao teaches the system according to claim 4, but does not teach wherein the meta-information includes pitch, accent, speaker diarization, and language identification.
In the same field of text to speech, Patel teaches wherein the meta-information includes pitch([0064] Ln 3-10, pitch is used), 
accent([0057] Page 6, first 3 lines, accent is identified and recorded), 
speaker diarization([0040] Ln 14-16, determining speaker using diarization), 
and language identification([0027] 9-12 Language is identified and recorded)(Items are recorded information for TTS [0003]).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss, Johnson, Kano and Gao, with input stream characteristics information of Patel, as obtaining voice data can improve TTS quality([0032] Last 4 Lines).

Regarding Claim 8:
The combination of Weiss, Johnson, Kano and Gao teaches the method according to claim 4 but does not teach further comprising: receiving prosody characteristics associated with the speech in the input stream and adjusting prosody characteristics of the TTS speech output based on the prosody characteristics associated with the input speech.
In the same field of text to speech, Patel teaches receiving prosody characteristics associated with the speech in the input stream([0038] Ln 5-7, features from audio including prosodic, [0003] data is recorded for TTS) 
and adjusting prosody characteristics of the TTS speech output based on the prosody characteristics associated with the input speech([0077] Last 2 lines, Gives example of changing emotion, this would include prosody characteristics as [0039] first 5 lines, states step 360([0038]) is used to create emotion).
It would be obvious for one skilled in the art, at the effective time of filling to modify the combination of Weiss, Johnson, Kano, and Gao, with the use of the prosody characteristics of Patel, in order to create a specific emotion in the output speech([0077] last 6 lines, which would help in creating morphed speech([0079] Ln 8-11).

Regarding Claim 9:
The combination of Weiss, Johnson, Kano and Gao teaches the method according to claim 4 but does not teach further comprising: receiving sentiment characteristics associated with the speech in the input stream and adjusting the sentiment characteristics of the TTS speech output based on the sentiment characteristics.
In the same field of text to speech, Patel teaches receiving sentiment characteristics associated with the speech in the input stream([0039] Ln 1-3, obtaining speech with different emotions) 
and adjusting the sentiment characteristics of the TTS speech output based on the sentiment characteristics([0039] Last 2 lines, generate speech with different emotions. As the instant application’s specification does not define sentiment characteristics in any way that would exclude it from being interpreted as part of emotion, and because this is related to text to speech and not binary sentiment classification, sentiment characteristics will be interpreted as part of emotion).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Weiss, Johnson, Kano and Gao, with the use of the emotion characteristics of Patel, in order to create a specific emotion in the output speech([0077] last 6 lines, which would help in creating morphed speech([0079] Ln 8-11).

Claims 10-12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gao, and further in view of Weiss, and further in view of Johnson, and further in view of Patel.

Regarding Claim 10:
Gao teaches a system for translating speech into another target
language(Abstract Ln 1, A speech-to-speech translation system), 
comprising: a voice activity module, coupled to a source of speech signals([0026] Ln 1-5, audio signal), that is configured to receive and process source speech signals([0026] Ln 1-5,convert received audio signal); 
a speech translation module([Abstract] Ln 1-4, MT engine), coupled to the source of speech signals and the voice activity module([0026] shows all elements being coupled), configured to receive and process the source speech signals and the voice activity module output([0026] Ln 11-13, receive audio, Ln 16-18, include features extracted) and generate a text stream output in a target language([0026] Ln 21-24, translate into text) with punctuation prediction information([0020] Ln 9-10, shows punctuation being part of MT output); 
and a text to speech module([0026] last 6 lines, TTS engine), coupled to the source of speech signals, the voice activity module and the speech translation module([0026] shows elements are coupled), configured to generate speech, based on the text stream output and punctuation prediction information([0026] last 6 lines, synthesize speech output from MT translation output. [0020] Ln 9-10, shows punctuation being part of MT output).
Gao does not specifically teach a direct speech translation module.
In the same field of NMT, Weiss teaches a direct speech translation module(Abstract Ln 1-3, speech in one language into text in another).
It would have been obvious for one skilled in the art, at the effective time of filling,
to modify Gao with the direct speech to text translation of Weiss, to avoid compounding
error and improve latency(1. Introduction, Para 4, Ln 6-11).
The combination of Gao and Weiss does not specifically teach translating
speech associated with at least two source languages into another target language…….
a multi-lingual speech translation module.
In the same field of NMT, Johnson teaches translating speech associated with
at least two source languages into another target language(Abstract Ln 10, French to
English, German to English)……. a multi-lingual speech translation module(Abstract Ln
1-2).
It would have been obvious for one skilled in the art, at the effective time of filling,
to modify the combination of Gao and Weiss with the multilingual approach of Johnson,
as the multilingual system of Johnson allows for zero-shot translation(Johnson, 1
Introduction, Zero-shot translation, Lines 1-2, A surprising benefit) and Weiss explicitly suggests to combine it’s direct speech translation approach(previously modified to
Gao), with Johnson’s (Conclusion Ln 20-24, multilingual speech translation system
following [34]).
The combination of Gao, Weiss and Johnson, do not teach a voice activity
module, that is configured to receive and process source speech signals and output
language labels, speaker diarization and voice characteristics meta-information
associated with the speech signals…….generate speech, based on.. speaker
diarization and voice characteristics meta-information, that mimics in the speech
translated to the target language a speaker's voice reflected in the source speech.
In the same field of text to speech, Patel teaches a voice activity module, that
is configured to receive and process source speech signals and output language
labels([0027] 9-12 Language is identified and recorded), speaker diarization([0040] Ln
14-16, determining speaker using diarization) and voice characteristics meta-
information([0079] Ln 3-11, donor vocal tract information) associated with the speech
signals
…….generate speech, based on.. speaker diarization([0040] Ln 14-16,
determining speaker using diarization) and voice characteristics meta-
information([0079] Ln 3-11, donor vocal tract information), 
that mimics in the speech translated to the target language a speaker's voice reflected in the source speech(Items are recorded information for TTS [0003] last 3 lines, adequately represent all the sounds in speech).
It would have been obvious for one skilled in the art, at the effective time of filling
to modify the combination of Gao, Weiss and Johnson, with the use of voice data of
Patel, as they help recreate the speech accurately in the output([0003] last 3 lines,
adequately represent all the sounds in speech).

	Regarding Claim 11:
The combination of Gao, Weiss, Johnson and Patel teach the system of
claim 10 and Gao teaches further comprising: a subtitle segmentation module([0020] Ln 10-13, displaying text of translated output), 
coupled to the direct multi-lingual speech translation module and the voice activity module that is configured to generate subtitles in the target language corresponding to the source speech([0020] Ln 10-13, displaying text of translated output).

Regarding Claim 12:
The combination of Gao, Weiss, Johnson and Patel teach the system of
claim 10, and Gao teaches where the direct multi-lingual speech translation module is
configured to generate full-sentence target language translation based on the predicted
sentence boundaries(these are generated as shown in [0020] Ln 9).
The combination of Gao, Weiss, Johnson and Patel do not teach where the
direct multi-lingual speech translation module is configured to determine predicted sentence boundaries based on the speaker diarization and language labeling.
In the same field of NLP Patel teaches determine predicted sentence
boundaries based on the speaker diarization([0040] Ln 14-16, determining speaker
would include sentence boundaries, using diarization) and language labeling([0040] Ln
14-17, determining speaker would include sentence boundaries, using phonemes per
second, and [0028] Ln 1-5 phonemes are based on language).
It would have been obvious for one skilled in the art, at the effective time of filling,
to modify the combination of Gao, Weiss, Johnson and Patel, with the methods for
determining the speaker and sentence boundaries of Patel, because it helps to
determine if someone is being interrupted([0040] Ln 10-15).

Claims 13-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over
Kano, and further in view of Johnson, and further in view of Gao.

	Regarding Claim 13:
Kano teaches a system for training(5. Experimental setup) a direct speech to
text translation module to translate speech into another target language(Conclusion Ln
1-2, end-to-end translation without ASR error), 
comprising training direct speech translation models using encoder-decoder architectures with attention mechanisms(Conclusion Ln 3-4, attentional-based ST systems); 
a source of speech in a source language for translation to the target language(5. Experimental set-up, Ln 2, The BTEC English-Japanese parallel corpus), 
produce training data using a TTS system and the source language side of parallel text training data(5. Experimental set-up, Ln 1-7, English-Japanese parallel corpus, text-to-speech to generate speech corpus), 
and wherein the produced training data includes the TTS generated speech signal generated from the parallel data(5. Experimental set-up, Ln 1-7, English-Japanese parallel corpus ,text-to-speech to generate speech corpus).
Kano does not specifically teach a multi-lingual direct speech to text
translation module….. models using multilingual encoder-decoder architectures with
attention mechanisms…… including multilingual parallel data with different source
languages.
In the same field of sequence-to-sequence model machine translation,
Johnson teaches a multi-lingual speech to text translation module(4 Experiments and
Results, Ln 1, multilingual models)
…. models using multilingual encoder-decoder architectures with attention mechanisms(4 Experiments and Results, Ln 1, multilingual models. Abstract Ln 4, encoder, decoder and attention)
…… including multilingual parallel data with different source languages(Abstract Ln 10, French to English, German to English).
It would have been obvious for one skilled in the art at the effective time of filling
to modify Kano with the multilingual approach of Johnson, as the multilingual system of
Johnson allows for zero-shot translation(Johnson, 1 Introduction, Zero-shot translation,
Lines 1-2, A surprising benefit).
The combination of Kano and Johnson does not specifically teach a
memory, including program instructions for training… a text to speech (TTS) system associated with the target language that generates speech from source text in the target
language….. a processor coupled to the memory…. the processor configured to
execute the program instructions, to produce training data using the TTS system and
the source language(specifically using the same TTS system).
In the same field of speech to speech translation systems, Gao teaches a
memory, including program instructions for training(([0023] Ln 3-4, memory,
instructions))
….. a text to speech (TTS) system associated with the target language that generates speech from source text in the target language([0026] 5th Line from end, TTS
engine)
….. a processor coupled to the memory process([0023] Ln 3-4, processor,
instructions)…. the processor configured to execute the program instructions, to
produce training data([0024] Last 6 Lines, perform various functions and/or operations)
using the TTS system and the source language([0038] Ln 1-7, parallel text data through
TTS to produce phonemes. [0033] Ln 1-5, TTS 212 same as TTS engine 134 of FIG 1).
It would have been obvious for one skilled in the art, at the effective time of filling,
to modify the combination of Kano and Johnson, with the memory, processor,
instructions and text to speech engine of Gao, as it would allow the functions and
operations of the system to be performed([0024] Last 6 Lines, perform various functions
and/or operations) and it would allow the translated output to be rendered through a
speaker([0026] Last 2 lines, outputted as audio via speaker).

Regarding Claim 14:
The combination of Kano, Johnson and Gao teaches the system according to
claim 13, but does not teach wherein the processor is further configured to execute
program instructions to process multi-lingual parallel training data to train an end-to-end
multi- lingual speech-to-speech system.
In the same field of sequence-to-sequence model machine translation,
Johnson teaches wherein the processor is further configured to execute program
instructions(Instructions have been already modified in the combination of Gao in Claim 13)
to process(4 Experiments and Results, Ln 1, train multilingual models) multi-lingual parallel training data(Abstract Ln 10, French to English, German to English) to train an end-to-end multi- lingual(4 Experiments and Results, Ln 1, multilingual models) speech-to-speech system(Direct speech to text from Kano Claim 13, Last step of text to speech was combined from Gao in Claim 13).
It would have been obvious for one skilled in the art, at the effective time of filling,
to modify the combination of Kano, Johnson and Gao, with the training of the
multilingual speech translation system of Johnson, as it would allow the model to
translate between pairs of languages(4 Experiments and Results, Para 3, Ln 3, model
learns to translate between pairs of languages).

	Regarding Claim 15:
The combination of Kano, Johnson and Gao teaches the system according to
claim 14, but does not teach wherein the processor is further configured to execute
program instructions to perform multilingual, multi-objective training to enhance the
model training multi-lingual parallel training data to train the end-to-end multi-lingual
speech-to-speech system.
In the same field of sequence-to-sequence model machine translation,
Johnson teaches wherein the processor is further configured to execute program
instructions(Instructions have been already modified in the combination of Gao in Claim
13) 
to perform multilingual(multilingual already modified in combination of Johnson
Claim 13), multi-objective training(4.4 Many to Many, objectives: Ln 1, multiple source
multiple target translation, Ln 6 keep accuracy high. 4.4 Many to Many, Data related
objectives: Para 3, Ln 1-3, oversampling had tradeoffs between languages) 
to enhance the model training multi-lingual parallel training data(4.4 Many to Many, Para 3, ln 2-3, helps smaller language pairs at the expense of larger) to train the end-to-end multi-lingual speech-to-speech system(Speech to text taught in claim 13, Final step of text to speech is combined using Gao in Claim 13, multilingual from Johnson as in Claim 13).
It would have been obvious for one skilled in the art at the effective time of filling
to modify the combination of Kano, Johnson and Gao, with the multilingual multi-
objective training of Johnson, as it helps find a medium between possibly contradicting
objectives(4.4 Many to Many, Para 3, Ln 2-3, helps smaller language pairs at the
expense of larger).

Regarding Claim 16:
The combination of Kano, Johnson and Gao teaches the system according to
claim 13, but does not teach wherein the processor is further configured to execute
program instructions to execute multilingual, multi-task training to enhance the model
training.
In the same field of sequence-to-sequence model machine translation,
Johnson teaches wherein the processor is further configured to execute program
instructions(Instructions have been already modified in the combination of Gao in Claim
13) 
to execute multilingual(multilingual already modified in combination of Johnson
Claim 13), multi-task training(2. Related Work, Para 4, Ln 1, Our approach…multitask
learning)
to enhance the model training(4.1 Datasets, Training.., Last 4 Lines, Preventing catastrophic forgetting).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify the combination of Kano, Johnson and Gao, with the multitask learning of
Johnson, as it helps prevent forgetting(4.1 Datasets, Training.., Last 4 Lines, Preventing
catastrophic forgetting).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Fructuoso et al. (US 9195656 B2)
Text to speech using prosody information.
	Akagi et al. “Emotional Speech Recognition and Synthesis in Multiple Languages toward Affective Speech-to-Speech Translation System”
Speech to Speech translation with emotion information.
	Waibel et al. (US 20150254238 A1)
Speech to speech translation.

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER G MARLOW whose telephone number is (571)272-4536. The examiner can normally be reached Monday - Thursday 10:00 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richmond Dorvil can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ALEXANDER G MARLOW/Assistant Examiner, Art Unit 2658                                                                                                                                                                                                        /DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655