DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “means for receiving one or more control parameters” in claim 29.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim limitation(s) recite(s) sufficient structure, materials, or acts to entirely perform the recited function.  Such claim limitation(s) is/are: “means for processing, using a multi-encoder” in claim 29.
Because this/these claim limitation(s) is/are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are not being interpreted to cover only the corresponding structure, material, or acts described in the specification as performing the claimed function, and equivalents thereof.
If applicant intends to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to remove the structure, materials, or acts that performs the claimed function; or (2) present a sufficient showing that the claim limitation(s) does/do not recite sufficient structure, materials, or acts to perform the claimed function.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-30 are rejected under 35 U.S.C. 102{a)(1) as being anticipated by Trueba et al. (US 2020/0394997).

Claims 1, 15, 27 and 29,
Trueba teaches a device for speech generation comprising: one or more processors configured to: receive one or more control parameters indicating target speech characteristics; and process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics (]0023] [0025] [0028] to generate audio output that resembles the style, tone, language, or other vocal attribute of a particular speaker using training data from one or more human speakers; a spectrogram estimator estimates a spectrogram corresponding to input text data using a sequence-to-sequence (seq2seq) model; the seq2seq model includes a plurality of encoders; each encoder receives one of a plurality of different types of acoustic data (control parameters) corresponding to input text data; generating, using a speech model and based at least in part on the estimated spectrogram data, output speech data to the attributes of a particular speaker).

Claims 2 and 16,
Trueba further teaches the device of claim 1, wherein the control parameters indicate a target person whose speech characteristics are to be used, a target emotion, a target rate of speech, or a combination thereof ([0025] emotions, speech characteristics).

Claims 3 and 17,
Trueba further teaches the device of claim 1, wherein the one or more processors are further configured to generate merged style data based on the one or more control parameters, and wherein the merged style data is used by the multi-encoder during processing of the input representation ([0028] [0067] generating based at least in part on the first weighted feature vector and the second weighted feature vector, estimated spectrogram data corresponding to the input text data; each encoder 702a, 702b, . . . 702N and/or corresponding attention network 706a, 706b, . . . 706N corresponds to a merged or combined speaking style corresponding to multiple speaking styles, types of person, and/or particular person).

Claims 4 and 18,
Trueba further teaches the device of claim 1, wherein the multi-encoder includes: a first encoder configured to encode the input representation independently of the one or more control parameters to generate first encoded data; and one or more second encoders configured to encode the input representation based on the one or more control parameters to generate second encoded data, wherein the encoded data includes the first encoded data and the second encoded data ([0028] receives (130) first acoustic-feature data corresponding to a first segment of input text data; the server(s) 120 receives (132) second acoustic-feature data corresponding to a second segment of the input text data larger than the first segment of input text data to generate estimated spectrogram data).

Claims 5 and 19,
Trueba further teaches the device of claim 4, wherein the one or more processors are further configured to: process, at a speech characteristic encoder, the input representation based on at least one of the one or more control parameters to generate an encoded input speech representation; generate, at an encoder pre-network, merged style data based at least in part on the encoded input speech representation; provide the input representation to the first encoder to generate the first encoded data; and provide the merged style data to the one or more second encoders to generate the second encoded data ([Figs. 6-7] [0055-0063] [0064-0069] see Figs. 6-7; the spectrogram estimator 238 includes one or more encoders 602 for encoding one or more types of acoustic-feature data 502 into one or more feature vectors; the encoder 602 receive the acoustic-feature data 502 and/or input text data 210 and generate character embeddings 608 based thereon. The character embeddings 608 may represent the acoustic-feature data 502 and/or input text data 210 as a defined list of characters,).

Claims 6 and 20,
Trueba further teaches the device of claim 4, further comprising a multi-encoder transformer including the multi-encoder and a decoder, wherein the first encoder includes a first attention network, wherein each of the one or more second encoders includes a second attention network, and wherein the decoder includes a decoder attention network ([Figs. 6-7] [0062] [0064-0069] encoder 602 and 702a-N; encoder 604 and decoder 708; attention network 706a-N;).

Claims 7 and 21,
Trueba further teaches the device of claim 6, wherein: the first encoder comprises: a first layer including the first attention network, wherein the first attention network corresponds to a first multi-head attention network; and a second layer including a first neural network, and each of the one or more second encoders comprises: a first layer including the second attention network, wherein the second attention network corresponds to a second multi-head attention network; and a second layer including a second neural network ([0056] a spectrogram estimator 238 in accordance with embodiments of the present invention; the spectrogram estimator 238 includes N encoders 702a, 702b, . . . 702N and attention layers 704 that include N attention networks 706a, 706b, . . . 706N; with reference to FIG. 6, each encoder 702a, 702b, . . . 702N may include character embeddings that transform input acoustic-feature data 701a, 701b, . . . 701N into one or more corresponding vectors, may include one or more convolution layer(s), which may apply one or more convolution operations to the vectors corresponding to the character embeddings, and/or may include a bidirectional LSTM layer to generate encodings corresponding to the acoustic-feature data 701a, 701b, . . . 701N; the attention network may a RNN, DNN, or other network discussed herein, and may include nodes having weights and cost functions arranged into one or more layers).

Claims 8 and 22,
Trueba further teaches the device of claim 4, further comprising: a decoder coupled to the multi-encoder, the decoder including a decoder network that is configured to generate output spectral data based on the first encoded data and the second encoded data; and a speech synthesizer configured to generate, based on the output spectral data, the audio signal that represents the version of the speech based on the target speech characteristics ([Figs. 7] [0064] each encoder 702a, 702b, . . . 702N may include character embeddings that transform input acoustic-feature data 701a, 701b, . . . 701N into one or more corresponding vectors, may include one or more convolution layer(s), which may apply one or more convolution operations to the vectors corresponding to the character embeddings, and/or may include a bidirectional LSTM layer to generate encodings corresponding to the acoustic-feature data 701a, 701b, . . . 701N).

Claim 9,
Trueba further teaches the device of claim 8, wherein the decoder network includes a decoder attention network comprising: a first multi-head attention network configured to process the first encoded data; one or more second multi-head attention networks configured to process the second encoded data; and a combiner configured to combine outputs of the first multi-head attention network and the one or more second multi-head attention networks ([0059-0062] the attention network 614 may, for example, weight certain portions of the context vector by increasing their value and may weight other portions of the context vector by decreasing their value; the increased values may correspond to acoustic features to which more attention should be paid by the decoder 604 and the decreased values may correspond to acoustic feature to which less attention should be paid by the decoder 604;).

Claim 10,
Trueba further teaches the device of claim 9, wherein the decoder further comprises: a masked multi-head attention network coupled to an input of the decoder attention network; and a decoder neural network coupled to an output of the decoder attention network ([0060] the attention network 614 may allow the decoder 604 to “attend” to different parts of the acoustic-feature data 502 at each step of output generation. The attention network 614 may allow the encoder 602 and/or decoder 604 to learn what to attend to based on the acoustic-feature data 502 and/or produced spectrogram data 606).

Claims 11 and 24,
Trueba further teaches the device of claim 1, wherein the one or processors are further configured to: generate one or more estimated control parameters from the audio signal; and based on a comparison of the one or more control parameters and the one or more estimated control parameters, train one or more neural network weights of the multi-encoder, one or more speech characteristic encoders, an encoder pre-network, a decoder network, or a combination thereof ([0059] the spectrogram estimator 238 includes an attention network 614 that summarizes the full encoded sequence output by the bidirectional LSTM layer 612 as fixed-length context vectors corresponding to output step of the decoder 604; the attention network 614 includes nodes having weights and/or cost functions arranged into one or more layers; the attention network 614 weights certain values of the context vector before sending them to the decoder 604; the attention network 614 may, for example, weight certain portions of the context vector by increasing their value and may weight other portions of the context vector by decreasing their value; the increased values may correspond to acoustic features to which more attention should be paid by the decoder 604 and the decreased values may correspond to acoustic feature to which less attention should be paid by the decoder 604).

Claims 12 and 25,
Trueba further teaches the device of claim 1, wherein the one or more processors are further configured to: receive an input speech signal; and generate the input representation based on the input speech signal ([0059] the user input audio containing the user's own speech; a speech-to-text system  generate text based on the user's speech, and the spectrogram estimator 238).

Claims 13 and 26,
Trueba further teaches the device of claim 1, wherein the one or more processors are further configured to receive the input representation ([0059] the user input audio containing the user's own speech; a speech-to-text system  generate text based on the user's speech, and the spectrogram estimator 238).

Claims 14 and 28,
Trueba further teaches the device of claim 1, wherein the input representation includes text, mel- scale spectrograms, fundamental frequency (FO) features, or a combination thereof ([0054] receives input text).

Claim 23,
The method of claim 22, further comprising: processing the first encoded data at a first multi-head attention network of a decoder attention network, wherein the decoder network includes the decoder attention network; processing the second encoded data at one or more second multi-head attention networks of the decoder attention network; and combining, at a combiner, outputs of the first multi-head attention network and the one or more second multi-head attention networks. (Claim 23 contains subject matter similar to claims 9-10, and thus is rejected under similar rationale)

Claim 30,
Trueba further teaches the apparatus of claim 29, wherein the means for receiving and the means for processing are integrated into at least one of a virtual assistant, a home appliance, a smart device, an internet of things (IoT) device, a communication device, a headset, a vehicle, a computer, a display device, a television, a gaming console, a music player, a radio, a video player, an entertainment unit, a personal media player, a digital video player, a camera, or a navigation device ([Fig. 1]  device 110).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kim et al. (KR 20210052921) teaches speech synthesis in a noisy environment is disclosed. A speech synthesis method according to an embodiment of the present specification may generate a synthesized speech to which the Lombard effect is applied by using a feature vector generated from a speech feature. The speech synthesis method and the apparatus of the present specification can be associated with an artificial intelligence module, a drone (Unmanned Aerial Vehicle, UAV), a robot, an augmented reality (AR) device, a virtual reality (virtual reality, VR) device, a 5G service, and the like. The speech synthesis method includes the steps of: generating a feature vector; applying parameters to a text-to-speech synthesis model; and generating synthesized speech data.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/               Examiner, Art Unit 2656