Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This office action is in response to correspondence 08/04/22 regarding application 17/037,023, in which claims 1, 2, 4, 5, 7-14, and 16-19 were amended and new claim 21 was added. Claims 1-21 are pending and have been considered.

Response to Arguments
Amended claim 13 overcomes the objection for a minor informality, and so it is withdrawn.  
Amended claim 13 overcomes the 35 U.S.C. 112(b) rejections of claims 13-16 as being indefinite, and so the rejections are withdrawn.
Applicant’s arguments on page 10 with regard to Claim Interpretation have been fully considered and are persuasive. Applicant argues that because the dependent claims include the sufficient structure similar to the base independent claim, the dependent claims should not invoke 35 U.S.C. 112(f). In other words, the “memory; and at least one processor coupled to the memory” in claim 1 provide sufficient structure to implement the “a reference encoder sub-module” in claim 2, “an attention sub-module” in claim 4, “a variational autoencoder sub-module” In claim 5, “an audio encoder sub-module” in claim 7, “the audio encoder sub-module” in claim 8, “a text encoder sub-module” in claim 12, “a guided attention sub-module” in claim 13, “an audio decoder sub-module” in claim 14, and “the interpolation and extrapolation module” in claim 18. The examiner agrees, and these limitations are no longer interpreted as invoking 35 U.S.C. 112(f). 
Applicant’s arguments regarding the 35 U.S.C. 102(a)(2) rejections based on Semenov and the 35 U.S.C. 103 rejections based on Semenov, Battenberg, Diamos, Prabhavalkar, Yang, Valvin, and Bardino have been fully considered and are not persuasive.
First, on page 12 Applicant argues that Semenov teaches only audio is input, and does not disclose input text and reference audio style file. In response, paragraph 32 of Semenov discloses “audio of a speaker(s) can be passed through a number of style subnetworks (e.g. speaker, prosody, etc.) to generate tokens” and “style tokens, along with a set of text, can be passed into an audio data generation engine”. Since the input audio of a speaker is processed by a computer implementing the subnetworks to generate the style tokens, it is fairly considered a “reference audio style file”.
Next, on page 12 Applicant argues that Semenov does not disclose “an expressivity characterization module” which outputs the expression vector using an expressive acoustic model. In response, Semenov, Paragraph 32, lines 4-8 discloses style tokens (i.e. expression vectors) are produced (i.e. generated) from an input audio (i.e. reference audio style file) using a prosody subnetwork (i.e. expressivity characterization module); It should be noted that, in regards to expression vectors, Applicant discloses: “each expression vector is a representation of prosodic information in a reference audio style file” (Abstract); meanwhile, Semenov notes: “(e.g. tokens) that reflect particular features of the input” (Paragraph 38); also “generates… tokens using… subnetworks…”  (Paragraph 83); also “subnetworks… include… prosody networks for identifying various characteristics of audio input” (Paragraph 83); Thus, the expression vector, as disclosed, is a representation of the same information as the disclosed style tokens of Semenov, i.e. prosodic information. Further, Semenov in Paragraph 45 teaches autoregressively generating frames of audio based on a set of text features, style tokens, which are expressive features, and previously generated frames, including spectrograms which is an acoustic feature.
Finally, the arguments on pages 13-14 regarding independent claim 19, dependent claims 7-9, 12-14, and 20, the 35 U.S.C. 103 rejections of dependent claims 2-6, 10, 11, and 15-18, and new claim 21 are similar to those addressed above, and are not persuasive for similar reasons.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 7-9, 12-14, and 19-21 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Semenov et al. (U.S. Patent Application Publication 2020/0402497 A1, hereinafter "Semenov").
In regards to claim 1, Semenov teaches:
A system for synthesizing expressive speech, the system comprising (Paragraph 32, lines 1-6: the system modifies the prosody or style of its generated audio; that is, it makes the synthesized speech expressive):
an interface (Paragraph 32, lines 12-16: a set of text (i.e. input) can be passed into an audio data generation engine (i.e. received for conversion to speech));
a memory (Paragraph 94; see also Fig. 13, element 1320);	
and at least one processor coupled to the memory and configured to (Paragraph 93; see also Fig. 13, element 1305):
control the interface to receive an input text for conversion to speech (Paragraph 32, lines 12-16: a set of text (i.e. input) can be passed into an audio data generation engine (i.e. received for conversion to speech);
receive a reference audio style file using an expressivity characterization module (Paragraph 32, lines 4-8: style tokens (i.e. expression vectors) are produced (i.e. generated) from an input audio (i.e. reference audio style file); Since the input audio of a speaker is processed by a computer implementing the subnetworks to generate the style tokens, it is fairly considered a “reference audio style file”);
generate, using the expressivity characterization module, a plurality of expression vectors based on prosodic information in the received reference audio style file, wherein each expression vector is a representation of prosodic information in the reference audio style file (Paragraph 32, lines 4-8: style tokens (i.e. expression vectors) are produced (i.e. generated) from an input audio (i.e. reference audio style file) using a prosody subnetwork (i.e. expressivity characterization module); It should be noted that, in regards to expression vectors, applicant discloses: “each expression vector is a representation of prosodic information in a reference audio style file” (Abstract); meanwhile, Semenov notes: “(e.g. tokens) that reflect particular features of the input” (Paragraph 38); also “generates… tokens using… subnetworks…”  (Paragraph 83); also “subnetworks… include… prosody networks for identifying various characteristics of audio input” (Paragraph 83); Thus, the expression vector, as disclosed, is a representation of the same information as the disclosed style tokens of Semenov, i.e. prosodic information); and
obtain pre-recorded or pre-synthesized speech features using an expressive acoustic model (Paragraph 45, autoregressively generating frames of audio based on a set of text features, style tokens, which are expressive features, and previously generated frames, including spectrograms which is an acoustic feature);
synthesize expressive speech from the input text based on the at least one expression vector of the plurality of expression vectors and the obtained pre-recorded or pre-synthesized speech features using the expressive acoustic model, wherein the expressive acoustic model comprises a deep convolutional neural network that is conditioned by at least one expression vector of the plurality of expression vectors (Paragraph 32, lines 12-17: a convolutional neural network is used to generate audio data; Paragraph 45, autoregressively generating frames of audio based on a set of text features, style tokens, which are expressive features, and previously generated frames, including spectrograms which is an acoustic feature; Paragraph 51: notably, the convolutional neural network may be a deep convolutional neural network as taught by Tachibana (2014), which has also been incorporated by reference into the applicant’s disclosure; Paragraph 32, lines 1-3: the generated audio may include speech; Paragraph 34: audio is generated with the characteristics of (i.e. conditioned by) the tokens (i.e. expression vectors).
In regards to claim 7, Semenov further teaches:
The system as claimed in claim 1, wherein the expressive acoustic model comprises an audio encoder (paragraph 44; also see Fig. 3, element 310) sub-module configured to: 
receive the pre-recorded or pre-synthesized speech features and the at least one expression vector (Paragraph 45; spectrogram of a previous frames of audio and style tokens, and 
generate a vector corresponding to the received speech features based on the at least one expression vector (Paragraph 45; generating an audio frame using the style token);
wherein the expressive acoustic model further comprises an audio decoder sub-module (Fig. 3, element 320) configured to:
receive the at least one expression vector used by the audio encoder sub-module (paragraphs 44 and 45; style token at the audio decoder), and
generate acoustic features based on the received at least one expression vector (Paragraph 50: audio decoders generate audio data one frame at a time; Paragraph 45, frames of audio are generated based on, among other things, style tokens (e.g. expression vectors)).

In regards to claim 8, Semenov further teaches:
The system as claimed in claim 7, wherein the audio encoder sub-module is further configured to:
generate a vector corresponding to the received speech features, conditioned by the received at least one expression vector (Fig. 1, Elements 145, 150, and 115: the prosody and speaker tokens 145 and 150 (expression vectors) are inputs to the audio data generation engine 115; Fig. 3 shows that the audio data generation engine 115 may include an audio encoder; see also Paragraph 32, lines 11-15; Paragraph 45 notes that audio data generation engines generate frames of audio based on (i.e. conditioned by), among other things, style tokens (i.e. expression vectors); these generated frames are fed through an audio encoder, which generates encodings of the received audio (i.e. speech)). 
In regards to claim 9, Semenov further teaches:
The system as claimed in claim 8, wherein the at least one expression vector of the plurality of expression vectors comprises a user-selected expression vector (Paragraph 33: Semenov notes that known methods may have a user “turn a knob” to select a voice, then condition inference on that token (i.e. expression vector)).
In regards to claim 12, Semenov incorporates by reference (Paragraph 51) the teachings of Tachibana (2014, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention”). Applicant notes in Paragraph 45 of the specification that the disclosed text encoder, as well as its generation of the first and second matrix are explained in this same reference.
Thus, Semenov teaches:
The system as claimed in claim 7, wherein the expressive acoustic model comprises a text encoder (Semenov, Paragraph 46; see also Tachibana, Section 3.1, line 3) sub-module configured to: 
receive phonemes or graphemes corresponding to the received input text (Semenov, Paragraph 76; see also Tachibana, Section 3.1, lines 4-6: the sentence consists of phonemes and graphemes); 
generate a first matrix V representing a value of each phoneme or grapheme in the received input text (Semenov, Paragraph 46 and 76; see also Tachibana, Section 3.1, line 6); 
and generate a second matrix K representing a unique key associated with each value (Semenov, Paragraph 46 and 76; see also Tachibana, Section 3.1, line 6).
In regards to claim 13, Applicant notes that the disclosed guided attention sub-module is unchanged from the original DC-TTS described by Tachibana (2014). Thus, Semenov further teaches:
The system as claimed in claim 12, wherein the expressive acoustic model further comprises a guided attention (Semenov, Fig. 3, Element 315) sub-module configured to: 
compare a generated matrix Q and the generated first and second matrices (Semenov, Paragraph 76; see also Tachibana, section 3.1, lines 6-9); 
and determine a similarity between each character in the received input text with a sound represented in the generated matrix (Semenov, Paragraph 76; see also Tachibana, section 3.1, lines 10-11).
In regards to claim 14, Semenov further teaches:
The system as claimed in claim 13, wherein the audio decoder (Fig. 3, element 320) sub-module is configured to: 
generate acoustic features corresponding to an output of the guided attention sub-module, conditioned by the received expression vector (Paragraph 50: audio decoders generate audio data one frame at a time; Paragraph 45, frames of audio are generated based on, among other things, style tokens (e.g. expression vectors)).
In regards to claim 19, claim 19 is a method claim analogous to the system of claim 1. Thus, it is rejected on similar grounds.
In regards to claim 20, Semenov further teaches:
A non-transitory computer-readable storage medium storing a code which, when executed by a processor, causes the processor to execute the method of claim 19 (Paragraph 16).
In regards to claim 21, Semenov further teaches:
The method as claimed in claim 19, wherein the obtaining of the pre-recorded or pre-synthesized speech features using the expressive acoustic model comprises receiving, by an audio encoder sub-module, the pre-recorded or pre-synthesized speech features and the at least one expression vector (Paragraph 45; spectrogram of a previous frames of audio and style tokens), wherein the method further comprises:
generating, by the audio encoder sub-module, a vector corresponding to the received speech features based on the at least one expression vector (Paragraph 45; generating an audio frame using the style token);
receiving, by an audio decoder sub-module (Fig. 3, element 320), the at least one expression vector used by the audio encoder sub-module (paragraphs 44 and 45; style token at the audio decoder), and
generating, by the audio decoder sub-module, acoustic features based on the received at least one expression vector (Paragraph 50: audio decoders generate audio data one frame at a time; Paragraph 45, frames of audio are generated based on, among other things, style tokens (e.g. expression vectors)).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 2 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Semenov as applied to claim 1 above, and further in view of Stanton et al. Battenberg et al. (U.S. Patent Application Publication 2020/0372897 A1, hereinafter "Battenberg"), and further in view of Gournay et al. (U.S. Patent Application Publication 2002/0065655 A1, hereinafter “Gournay”).
In regards to claim 2, Semenov further teaches:
The system as claimed in claim 1, wherein the expressivity characterization module comprises:
an interface configured to receive the reference audio style file (Paragraph 33: audio is passed in through an encoder);
and a reference encoder sub-module configured to receive prosodic information of the received reference audio style file into a vector (Paragraph 33: Semenov describes passing audio through an encoder to generate a style token (i.e. expression vector)).
However, Semenov fails to teach compressing the reference audio style file into a fixed-length vector.
In a related art, Battenberg teaches a system for expressive speech synthesis (Paragraphs 26 and 27). Notably, Battenberg teaches that their system may include a reference encoder that generates a fixed-length prosody embedding (i.e. vector) from the reference audio signal (Paragraph 8, lines 7-10; also Paragraph 62). Battenberg further notes that the reference encoder they describe was originally disclosed in earlier non patent literature that has also been disclosed by the applicant (Skerry-Ryan et al., 2018). Battenberg teaches that their system may capture characteristics of the reference audio signal independent of phonetic information and idiosyncratic speaker traits (Paragraph 62).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Semenov to incorporate the teachings of Battenberg to include the reference encoder in their expressive text to speech system. Doing so would have helped to capture characteristics of the reference audio signal independent of phonetic information and idiosyncratic speaker traits, as taught by Battenberg.
Thus, the combination of Semenov and Battenberg teaches:
The system as claimed in claim 1, wherein the expressivity characterization module comprises:
an interface configured to receive the reference audio style file (Semenov, Paragraph 33: audio is passed in through an encoder);
and a reference encoder sub-module configured to receive prosodic information of the received reference audio style file into a fixed-length vector (Battenberg, Paragraph 62: Stanton teaches that their system may contain a reference encoder that generates a fixed-length prosody embedding (i.e. vector) from the reference audio signal (i.e. reference audio style file)).
However, Semenov and Battenberg fail to teach compressing the prosodic information.
In a related art, Gournay teaches compressing prosodic information (Paragraph 42: compress the prosody at very low bit rates).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Semenov and Battenberg to incorporate the teachings of Gournay to include compressing the prosody information. Doing so would have helped to provide a “complete encoder in this field of application” of encoding speech at very low bit rates, as taught by Gournay (see Paragraphs 32 and 33).
Thus, the combination of Semenov, Battenberg, and Gournay teaches:
The system as claimed in claim 1, wherein the expressivity characterization module comprises:
an interface configured to receive the reference audio style file (Semenov, Paragraph 33: audio is passed in through an encoder);
and a reference encoder sub-module configured to compress prosodic information (Gournay, paragraph 42) of the received reference audio style file into a fixed-length vector (Battenberg, Paragraph 62: Stanton teaches that their system may contain a reference encoder that generates a fixed-length prosody embedding (i.e. vector) from the reference audio signal (i.e. reference audio style file)).


In regards to claim 5, Battenberg further teaches:
The system as claimed in claim 2, wherein the expressivity characterization module further comprises a variational autoencoder sub-module comprising a plurality of fully- connected layers configured to (Paragraph 56: the transfer model includes a variational autoencoder; Paragraph 66: Battenberg describes a six-layer convolutional layer network with a fully connected layer, and describes how its integrated with multilayer perceptron in order to generate the variational embedding; that is, the convolutional layer network’s fully connected layer is a part of the variational autoencoder; furthermore, a multilayer perceptron itself inherently consists of a plurality of fully-connected layers): 
receive the fixed-length vector from the reference encoder sub-module (Paragraph 62: the reference encoder generates a fixed-length vector); 
generate a latent space corresponding to the prosodic information of the received reference audio style file (Paragraph 56: the variational autoencoder network determines a variational embedding (i.e. latent space) for the reference audio signal as output. See also Fig. 4, elements 410 and 420: the reference encoder output is required to generate the latent space); 
and output an expression vector for the reference audio style file (Paragraph 56: the variational embedding (i.e. expression vector) enables the synthesized speech produced by the TTS model to sound like the reference audio signal input to the reference encoder; see also Paragraph 59; it should be noted that, in regards to claim 1, “style token” was construed as a possible expression vector in view of Semenov; given its broadest reasonable interpretation in view of claim 1, the expression vector may still be taught by the style token of Semenov, though such a construal is insufficient in regards to the limitations of claim 5).
	It should be additionally noted that Semenov teaches using an audio encoder that maps input text features to features in latent space (Paragraph 47), though they do not specifically teach a variational autoencoder.
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Semenov, Battenberg, and Gournay as applied to claim 2 above, and further in view of Diamos et al. (U.S. Patent Application Publication 2018/0061439 A1, hereinafter "Diamos").
In regards to claim 3, Semenov, Battenberg, and Gournay fail to teach the reference encoder comprising a plurality of max pooling layers and residual connections for generating the fixed length vector. Notably, however, Battenberg does teaches that their reference encoder comprises a plurality of 2D convolutional layers (Fig. 5, element 504: “Conv2D” indicates a 2 Dimensional Convolution; see also Paragraph 63). In addition, Battenberg also teaches using a plurality of convolutional layers, max pooling layers, and residual connections in a post-processing neural network for generating a waveform synthesizer input (Paragraph 39). This may have rendered the claimed invention obvious as a use of a known technique to improve similar devices in the same way; however, Battenberg fails to teach specifically 2-dimensional convolutional layers in combination with max pooling layers and residual connections.
In a related art, Diamos teaches a system that extracts features from a raw audio signal. Notably, Diamos teaches using 2D convolutional layers (Paragraph 38), max pooling layers (Paragraph 38), and later, residual connections over individual layers (Paragraph 41) in their system. Diamos suggests that their particular architecture may have greater performance compared to other architectures (Paragraph 36: 2D convolution layers provide greater performance compared with fully or recurrent layers), though, as evidenced by Battenberg, a similar architecture is already known to be favorable in the field of audio processing with neural networks.
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Semenov, Battenberg, and Gournay to incorporate a plurality of 2D convolutional layers, max pooling layers, and residual connections in the reference encoder. Doing so would have been use of a known technique to improve similar devices in the same way (MPEP 2143(c)).
Battenberg teaches a reference encoder which uses a convolutional neural network comprising a plurality of 2D convolutional layers to receive prosodic information into a fixed-length vector. (“Base” device)
Diamos teaches a system that extracts features from a raw audio signal using a plurality of 2D convolutional layers, max pooling layers, and residual connections over individual layers. (“Comparable” device)
One of ordinary skill in the art could have applied the technique taught by Diamos to the device taught by Battenberg as they are both devices that utilize convolutional neural networks in the art of processing audio signals.
Thus, Semenov, Battenberg, Gournay, and Diamos together teach:
The system as claimed in claim 2, wherein the reference encoder sub-module (Battenberg, Paragraph 62) comprises a plurality of two-dimensional convolutional layers (Diamos, Paragraph 38), max pooling layers (Diamos, Paragraph 38) and residual connections (Diamos, Paragraph 41) for generating the fixed-length vector (Stanton, Paragraph 62).
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Semenov, Battenberg, and Gournay as applied to claim 2 above, and further in view of Prabhavalkar et al. (U.S. Patent Application Publication 2020/0027444 A1, hereinafter "Prabhavalkar").
In regards to claim 4, Battenberg teaches the use of an attention module that receives a fixed-length prosody embedding output from the reference encoder (Paragraph 68, lines 15-19: the prosody embedding PE is the output from the reference encoder and is used as an input to the attention module; also Paragraph 62, lines 8-9: the prosody embedding may be of fixed-length) and in turn outputs a set of weights corresponding to the prosodic information of received audio (Paragraph 68, lines 24-27: the set of combination weights represent the contribution of each style token to the encoded prosody embedding; also Paragraph 65: the encoded prosody embedding is from a reference audio signal). However, Battenberg does not explicitly teach the use of multi-headed attention.
In a related art, Prabhavalkar teaches a system that receives audio data, inputs audio data into an encoder, and utilizes a multi-headed attention mechanism (Paragraph 14) to output, from a fixed-sized sequence of outputs from the encoder (i.e. fixed length vector from the encoder), an attention distribution (i.e. a set of weights) corresponding to the input audio (Paragraph 42). Prabhavalkar suggests that that their model has improved performance due to its use of multi-headed attention architecture (Paragraph 6).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Semenov, Battenberg, and Gournay to incorporate the teachings of Prabhavalkar to utilize a multi-headed attention module. Doing so would have been use of a known technique to improve similar devices in the same way (MPEP 2143(c)).
Battenberg teaches an attention module which receives a fixed-length vector and in turn outputs a set of weights corresponding to the prosodic information of received audio. (“Base” device)
Prabhavalkar teaches an attention module for processing input audio data that performs a similar function, but utilizes specifically a multi-headed attention module. (“Comparable” device)
One of ordinary skill in the art could have applied the technique taught by Prabhavalkar to the device taught by Battenberg as they are both attention modules acting upon a speech input. In addition, Prabhavalkar suggests that the use of specifically multi-headed attention may lead to improved performance.
Thus, Semenov, Battenberg, Gournay, and Prabhavalkar together teach:
The system as claimed in claim 2, wherein the expressivity characterization module further comprises an attention sub-module configured to: 
receive the fixed-length vector from the reference encoder sub-module (Battenberg, Paragraph 68, lines 15-19: the prosody embedding PE is the output from the reference encoder and is used as an input to the attention module; also Paragraph 62, lines 8-9: the prosody embedding may be of fixed-length); 
generate a set of weights corresponding to the prosodic information of the received reference audio style file (Battenberg, Paragraph 68, lines 24-27: the set of combination weights represent the contribution of each style token to the encoded prosody embedding; also Paragraph 65: the encoded prosody embedding is from a reference audio signal); 
and output an expression vector comprising the set of weights (Battenberg, Paragraph 68, lines 24-27: the set of combination weights is output; see also Fig. 6, Element 616, 616a-n; it should be noted that, in regards to claim 1, “style token” was construed as a possible expression vector in view of Semenov; given its broadest reasonable interpretation in view of claim 1, the expression vector may still be taught by the style token of Semenov, though such a construal is insufficient in regards to the limitations of claim 4), for the reference audio style file, wherein the attention sub-module is a multi-head attention sub-module (Prabhavalkar, Paragraph 14).
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Semenov as applied to claim 1 above, and further in view of Stanton et al. Battenberg et al. (U.S. Patent Application Publication 2020/0372897 A1, hereinafter "Battenberg")
In regards to claim 6, Semenov does not explicitly teach a storage configured to store expression vectors for reference audio style files.
In a related art, Battenberg teaches a system for expressive speech synthesis (Paragraphs 26 and 27). Notably, Battenberg teaches that their system may include a reference encoder that generates a fixed-length prosody embedding (i.e. vector) from the reference audio signal (Paragraph 8, lines 7-10; also Paragraph 62). Battenberg further notes that the reference encoder they describe was originally disclosed in earlier non patent literature that has also been disclosed by the applicant (Skerry-Ryan et al., 2018). Furthermore, Battenberg notes that the reference encoder permits sampling of variational embeddings (i.e. expression vectors) previously produced by the encoder so that a greater variety of prosodic and style information is capable of representing the input text (Paragraph 60). Battenberg teaches that their system may capture characteristics of the reference audio signal independent of phonetic information and idiosyncratic speaker traits (Paragraph 62), while also additionally noting that sampling previously produced variational embeddings may remove the need for a reference audio signal.
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Semenov to incorporate the teachings of Battenberg to include the reference encoder in their expressive text to speech system. Doing so would have helped to capture characteristics of the reference audio signal independent of phonetic information and idiosyncratic speaker traits, while also possibly removing the need for a reference audio signal as taught by Battenberg.
Thus, Semenov and Battenberg together teach:
The system as claimed in claim 1, further comprising: a storage configured to store expression vectors for reference audio style files (Battenberg, Paragraph 60, lines 4-9: Battenberg notes that their system permits sampling of variational embeddings (i.e. expression vectors) that were previously produced. Thus, these previously produced variational embeddings must inherently be stored somewhere so that they may be resampled later).
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Semenov as applied to claim 8 above, and further in view of Yang et al. (U.S. Patent Application Publication 2020/0035216 A1, hereinafter "Yang").
In regards to claim 10, Semenov fails to teach selection of an expression vector to suit a context from which the input text was obtained.
In a related art, Yang teaches a speech synthesis system based on emotion information (Abstract). Notably, Yang teaches analyzing the context information of a scenario to realize speech synthesis containing emotions (Paragraph 7) by calculating an emotion vector (i.e. expression vector) based on at least that context analysis, and suggests that utilizing the context information helps the system further clarify, complement, or additionally define information included in recognized text (Paragraph 217) which is used for inferring the emotion that will be synthesized with the text.
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Semenov to incorporate the teachings of Yang to analyze context information to infer emotion to be synthesized with the text. Doing so would be simple substitution of one known element for another to obtain predictable results (MPEP 2143(B)).
Semenov teaches generation of a plurality of style tokens (i.e. expression vectors), without consideration of the context of the input text.
Yang teaches generation of an emotion vector (i.e. expression vector) on the basis of context analysis on the input text.
One of ordinary skill in the art could have substituted the context analyzing element of Yang into the system of Semenov in order to generate an emotion vector through context analysis, since both systems are directed towards expressive text to speech.
Thus, Semenov and Yang together teach:
The system as claimed in claim 8, wherein the at least one of the plurality of expression vectors comprises an expression vector selected to suit a context from which the received input text is obtained (Yang, paragraph 24 and 217).
Claims 11 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Semenov as applied to claims 7 and 14 above, and further in view of Valin et al. (2019, "LPCNet: Improving Neural Speech Synthesis Through Linear Prediction", hereinafter "Valin"). 
In regards to claim 11, while Semenov does teach that the speech received by the audio encoder comprises a plurality of audio frames (Paragraph 45), Semenov does not explicitly teach that these audio frames comprise twenty Bark-based cepstrum features, a period feature and a correlation feature.
Valin teaches a speech synthesis model called “LPCNet” that takes, as input, 18 Bark-scale cepstral coefficients, a period feature, and a correlation feature (Section 3, lines 6-7). Valin notes that they are “limiting” the input of the synthesis to just this many features, suggesting that the size of the input may be a parameter for a user to tune (Section 3, lines 6-7). Valin teaches that their speech synthesis model may significantly improve the efficiency of speech synthesis and make it easier to deploy neural synthesis applications on lower-power devices (Abstract).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Semenov to incorporate the teachings of Valin to utilize LPCNet in their speech synthesis system. Semenov’s system generates audio data, but is not directed to the generation of the actual speech (Paragraph 4, etc.: Semenov teaches generation of audio data, but makes no teaching on the actual generation of voice); LPCNet utilizes audio data to generate actual speech. Thus, the combination of the two would be a logical step for utilization of Semenov’s system in practice. One would then have to modify the audio data created by Semenov’s system in order to make it usable by the system of Valin. Doing so may have allowed one to take advantage of Valin’s efficient speech synthesis and made it easier to deploy the neural synthesis application on lower-power devices, as taught by Valin. In addition, it would have been obvious to modify the audio frame to contain 20 Bark-scale cepstral coefficients, instead of 18 – Valin, by noting that they limited the input of the synthesis to just 20 features, suggests that the number of features may be tuned by a future user of LPCNet. A change from 18 to 20 Bark-scale cepstral coefficients is a relatively small change, and the resulting effect on the performance and efficiency on the system would be fairly predictable.
Thus, Semenov and Valin together teach:
The system as claimed in claim 7, wherein the speech received by the audio encoder sub-module comprises a plurality of audio frames (Semenov, Paragraph 45), each audio frame comprising twenty Bark-based cepstrum features, a period feature and a correlation feature (Valin, Section 3, lines 6-7).
In regards to claim 15, Semenov and Valin further teach:
The system as claimed in claim 14, wherein the acoustic features generated by the audio decoder sub-module represent a plurality of audio frames (Semenov, Paragraph 50), each audio frame comprising twenty Bark-based cepstrum features, a period feature and a correlation feature (Valin, Section 3, lines 6-7).
In regards to claim 16, Semenov and Valin further teach:
The system as claimed in claim 14, further comprising a vocoder for synthesizing speech using the acoustic features generated by the audio decoder sub-module, wherein the vocoder comprises an LPCNet model (Valin: Valin’s model is referred to as LPCNet).
Claims 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Semenov as applied to claim 1 above, and further in view of Bardino et al. (U.S. Patent Application Publication 2010/0235166 A1, hereinafter “Bardino”).
In regards to claim 17, Semenov teaches that known methods may have a “knob to turn” that can select a style for a synthesized voice. That is, known methods may interpolate between styles to select a particular expression. However, Semenov does not explicitly teach an interpolation and extrapolation module.
In a related art, Bardino teaches a method of audio processing for transforming audio characteristics of an audio recording (Abstract). Notably, Bardino teaches interpolating and extrapolating between “metadata sets” (i.e. expression vectors) to modify audio recordings. Bardino teaches that this interpolation and extrapolation may better approximate the target emotional level when the available metadata sets do not cover the desired emotional scale value exactly (Paragraph 120).
It would have been obvious to one of ordinary skill in the art at the time of filing to modify Semenov to incorporate the teachings of Bardino to include the interpolation and extrapolation described of metadata sets. Doing so would have helped to better approximate target emotional levels in audio outputs when the available data does not cover the exact desired emotion, as taught by Bardino.
Thus, Semenov and Bardino together teach:
The system as claimed in claim 1, wherein the at least one processor is further configured to: 
generate, using an interpolation and extrapolation module, a user-defined expression vector (Paragraph 120: metadata sets may be interpolated and extrapolated to generate the desired emotional scale value; Abstract, lines 4-6: metadata sets (i.e. expression vectors) comprise transformation profiles; Paragraph 98: Transformation profiles may contain user-definable values; see also Paragraph 105 and accompanying Table (unlabeled)) for use by the expressive acoustic model to generate expressive speech (Bardino, Paragraph 120: a metadata set may make a voice sound slightly tired (i.e. expressive) from the input text (Semenov, Paragraph 32: Semenov modifies input text to add expression).
In regards to claim 18, Semenov and Bardino further teach:
The system as claimed in claim 17, wherein the interpolation and extrapolation module is configured to: 
obtain, from a storage, a first expression vector and a second expression vector, each of the first expression vector and the second expression vector representing a distinct style (Bardino Paragraph 120: parameters may be interpolated between metadata sets; i.e. between a first and second expression vector); 
perform a linear interpolation or extrapolation between the first expression vector and the second expression vector, using a user-defined scaler value (Bardino, Paragraph 98: Transformation profiles (and by extension, metadata sets) may contain user-definable values; see also Paragraph 105 and accompanying table (unlabeled)); 
and generate the user-defined expression vector, wherein the user-defined expression vector and received input text is input into the expressive acoustic model to generate expressive speech from the received input text (Semenov, Paragraph 32).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jesse Pullias whose telephone number is 571/270-5135. The examiner can normally be reached M-F 8:30 AM - 4:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/JESSE S PULLIAS/Primary Examiner, Art Unit 2655                                                                                11/03/22