Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The independent claims 1 and 13 recites “generating a plurality of style tokens from a set of audio inputs; generating an input feature vector based on the plurality of style tokens and a set of text features; and generating audio data based on the input feature vector”.
The limitations of “generating audio data” as drafted covers a human organizing  activities (mathematical algorithm). More specifically, the system takes in audio input and generates audio output based on the input. The steps are mathematical in nature.
The judicial exception is not integrated into a practical application. In particular, claim 13 recites an additional element of “processor” as per the independent claim (claim 1 comprises no additional limitations). For example, in [0022] of the field specification, there is description of using a general purpose computing environment. Accordingly, these additional elements does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a computer is .
With respect to claims 2 and 14, the claims relate to generating speaker and prosody tokens using their respective subnetworks. This reads on a system generating content based on input. No additional limitations are present. With respect to claims 3, the claim relates to one of the subnetworks being pre-trained. This relates to an insignificant solution activity of outputting the information. The subnetwork being pre-trained is an additional limitation but serves the purpose of outputting the information which has already been determined. With respect to claims 4 and 14, the claims relate to the audio inputs containing desired characteristics and the audio outputs reflect these desired characteristics. This relates to an insignificant solution activity of outputting the information. With respect to claims 5 and 15, the claims relate to generating an input feature vector with the use of at least one of averaging, concatenating, and adding a subset of plurality of style tokens. This reads on a system generating content based on input. This relates to an insignificant solution activity of outputting the information. With respect to claims 6 and 17, the claims relate to the set of text features comprising at least one of raw text, audio data, parts of speech, and phonemes. This reads on computing the output based on features of the input. This relates to an insignificant solution activity of outputting the information. With respect to claims 7 and 18, the claims relate to utilizing a convolutional neural network to generate a spectrogram. This relates to an insignificant solution activity of outputting the information. With respect to claim 8, the claim relates to utilizing teacher and student networks to generate the audio data. This relates to an insignificant solution activity of outputting the information. With 
Claim Rejections - 35 USC § 102
	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
	

Regarding claim 1, McDuff discloses a method for generating audio data, the method comprising ([0149] - A method comprising: … generating a response dialogue based on the content of the speech):
generating a plurality of style tokens from a set of audio inputs ([0030] - The speech recognizer 206 outputs the recognized speech of the user 102, as text or in another format, to a neural dialogue generation 210, a linguistic style extractor 212, and a custom intent recognizer 214. [0032] - The linguistic style extractor 212 identifies non prosodic components of the user's conversational style that may be referred to as "content variables”);
generating an input feature vector based on the plurality of style tokens and a set of text features ([0035] - The custom intent recognizer 214 recognizes intents in the speech identified by the speech recognizer 206. If the speech recognizer 206 outputs text, then the custom intent recognizer 214 acts on the text rather than on audio or another representation of the user's speech 104. Intent recognition identifies one or more intents in natural language using machine learning techniques trained from a labeled dataset. The labeled dataset may be a collection of text labeled with intent data. An intent recognizer may be created by training a neural network (either deep or shallow) or using any other machine learning techniques such as Naïve Bayes, Support Vector Machines (SVM), and Maximum Entropy with n-gram features);
and generating audio data based on the input feature vector ([0037] - The dialogue manager 216 captures input from the linguistic style extractor 212 and the custom intent recognizer 214 to generate for dialogue that will be produced by the conversational agent).

Regarding claim 3, McDuff discloses the method of claim 2, wherein at least one of the speaker subnetwork and the prosody subnetwork is a pre-trained network ([0035] - An intent recognizer may be created by training a neural network (either deep or shallow) or using any other machine learning techniques such as Naïve Bayes, Support Vector Machines (SVM), and Maximum Entropy with n-gram features. [0037] - domain-specific scripted dialogue from the custom intent recognizer 214).
Regarding claim 4, McDuff discloses the method of claim 1, wherein the set of audio inputs comprises a set of samples with a desired characteristic, wherein the generated audio data reflects the desired characteristic ([0046] - Thus, the speech synthesizer 220 will generate synthetic speech which not only provides appropriate response content in response to an utterance of the user 102 but also is modified based on the content variables and acoustic variables identified in the user's utterance).
Regarding claim 5, McDuff discloses the method of claim 1, wherein generating the input feature vector comprises at least one of averaging, concatenating, and adding a subset of the plurality of style tokens ([0035] - Intent recognition identifies one or more intents in natural language using machine learning techniques trained from a labeled dataset. An intent may be the 
Regarding claim 6, McDuff discloses the method of claim 1, wherein the set of text features comprises at least one of raw text, audio data, parts of speech, and phonemes ([0035] - The labeled dataset may be a collection of text labeled with intent data).
Regarding claim 13, McDuff discloses a non-transitory machine readable medium containing processor instructions for generating audio data, where execution of the instructions by a processor causes the processor to perform a process that comprises ([0022] - The local computing device 106 may include one or more processor (s) 112, a memory 114, and one or more communication interface (s) 116. [0116] - Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, compact disc read-only memory ( CD ROM)… or any other non transmission medium that can be used to store information for access by a computing device):
generating a plurality of style tokens from a set of audio inputs ([0030] - The speech recognizer 206 outputs the recognized speech of the user 102, as text or in another format, to a neural dialogue generation 210, a linguistic style extractor 212, and a custom intent recognizer 214. [0032] - The linguistic style extractor 212 identifies non prosodic components of the user's conversational style that may be referred to as "content variables”);
generating an input feature vector based on the plurality of style tokens and a set of text features ([0035] - The custom intent recognizer 214 recognizes intents in the speech identified by the speech recognizer 206. If the speech recognizer 206 outputs text, then the custom intent recognizer 214 acts on the text rather than on audio or another representation of the user's speech 104. Intent recognition identifies one or more intents in natural language using machine learning 
and generating audio data based on the input feature vector ([0037] - The dialogue manager 216 captures input from the linguistic style extractor 212 and the custom intent recognizer 214 to generate for dialogue that will be produced by the conversational agent).
Regarding claim 14, McDuff discloses the non-transitory machine readable medium of claim 13, wherein generating the plurality of style tokens comprises: generating a speaker token using a speaker subnetwork; and generating a prosody token using a prosody subnetwork ([0027] - A voice activity recognizer 204 processes the microphone input 202 to extract voiced segments. [0028] - The microphone input 202 that corresponds to voice activity is passed to the speech recognizer 206. [0029] - Output from the voice activity recognizer 204 is also provided to a prosody recognizer 208 that performs paralinguistic parameter recognition on the audio segments that contain voice activity).
Regarding claim 15, McDuff discloses the non-transitory machine readable medium of claim 13, wherein the set of -29-R34-05693 audio inputs comprises a set of samples with a desired characteristic, wherein the generated audio data reflects the desired characteristic ([0046] - Thus, the speech synthesizer 220 will generate synthetic speech which not only provides appropriate response content in response to an utterance of the user 102 but also is modified based on the content variables and acoustic variables identified in the user's utterance).
Regarding claim 16, McDuff discloses the non-transitory machine readable medium of claim 13, wherein generating the input feature vector comprises at least one of averaging, 
Regarding claim 17, the non-transitory machine readable medium of claim 13, wherein the set of text features comprises at least one of raw text, audio data, parts of speech, and phonemes ([0035] - The labeled dataset may be a collection of text labeled with intent data).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 7, 11-12, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over McDuff (U.S. Publication No. 20200279553) in view of Arik (U.S. Publication No. 20190355347).
Regarding claim 7, McDuff discloses all of the limitations as in claim 1, above.
However, McDuff does not disclose, wherein generating the audio data comprises utilizing a convolution neural network (CNN) to generate a spectrogram (claim 7).
Arik does teach the method of claim 1, wherein generating the audio data comprises utilizing a convolution neural network (CNN) to generate a spectrogram ([0012] - FIG. 3 depicts a general methodology for training a convolution neural network (CNN) that may be used to 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Arik in order to implement the method of claim 1, wherein generating the audio data comprises utilizing a convolution neural network (CNN) to generate a spectrogram. Doing so allows for efficient neural network architecture that achieves a high compute intensity and fast inference (Arik [0003]).
Regarding claim 11, McDuff discloses all of the limitations as in claim 1, above.
However, McDuff does not disclose the method of claim 1, wherein the generated audio data is a mel spectrogram.
Arik does teach the method of claim 1, wherein the generated audio data is a mel spectrogram ([0047] - Deep neural networks have recently demonstrated excellent results in generative audio applications. A major one is text-to-speech. For example, the WaveNet architecture has been proposed, which synthesizes speech in an autoregressive way conditioned on the linguistic features. WaveNet has been approximated by a parallelizable architecture learned via distillation. On the other hand, many successful approaches for text-to-speech are based on separating the problem into text-to-spectrogram (or text-to-mel spectrogram) conversion, combined with a spectrogram inversion technique).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Arik in order to implement the method of claim 1, wherein the generated audio data is a mel 
Regarding claim 12, McDuff in view of Arik teaches all of the limitations as in claim 11, above.
However, McDuff does not disclose the method of claim 11, wherein the method further comprises generating audio waveforms from the generated spectrogram.
Arik does teach the method of claim 11, wherein the method further comprises generating audio waveforms from the generated spectrogram ([0007] - One common use case of spectrograms is the audio domain. Autoregressive modeling of waveforms, in particular for audio, is a common approach).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Arik in order to implement the method of claim 11, wherein the method further comprises generating audio waveforms from the generated spectrogram. Doing so allows for efficient neural network architecture that achieves a high compute intensity and fast inference (Arik [0003]).
Regarding claim 18, McDuff discloses all of the limitations as in claim 13, above.
However, McDuff does not disclose the non-transitory machine readable medium of claim 13, wherein generating the audio data comprises utilizing a convolution neural network (CNN) to generate a spectrogram.
Arik does teach the non-transitory machine readable medium of claim 13, wherein generating the audio data comprises utilizing a convolution neural network (CNN) to generate a spectrogram ([0012] - FIG. 3 depicts a general methodology for training a convolution neural network (CNN) that may be used to generate a synthesized waveform from an input 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Arik in order to implement the non-transitory machine readable medium of claim 13, wherein generating the audio data comprises utilizing a convolution neural network (CNN) to generate a spectrogram. Doing so allows for efficient neural network architecture that achieves a high compute intensity and fast inference (Arik [0003]).
Claims 8-10 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over McDuff (U.S. Publication No. 20200279553) in view of Ping (U.S. Publication No. 20190180732).
Regarding claim 8, McDuff discloses all of the limitations as in claim 1, above.
However, McDuff does not disclose the method of claim 1, wherein generating the audio data comprises utilizing -28-R34-05693teacher and student networks to generate the audio data.
Ping does teach the method of claim 1, wherein generating the audio data comprises utilizing -28-R34-05693teacher and student networks to generate the audio data ([0003] - Typically , these systems first transform text into a compact audio representation, and then convert this representation into audio using an audio waveform synthesis method called a vocoder. [0030] - Most recently, Oord et al. (Parallel WaveNet: Fast high-fidelity speech synthesis, ICML, 2018) 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Ping in order to implement the method of claim 1, wherein generating the audio data comprises utilizing -28-R34-05693teacher and student networks to generate the audio data. Doing so allows for the enablement of fast inference and end-to-end training for improved performance (Ping [0005]).
Regarding claim 9, McDuff in view of Ping teaches all of the limitations as in claim 8, above.
However, McDuff does not disclose the method of claim 8, wherein generating the audio data comprises:
training the teacher network to generate audio data in an autoregressive manner;
and training the student network to learn from the teacher network to generate audio data in a non-autoregressive manner.
Ping does teach the method of claim 8, wherein generating the audio data comprises:
training the teacher network to generate audio data in an autoregressive manner (Figure 11 - 1125 - Distill a Gaussian inverse autoregressive flow from the autoregressive teacher-net into a non-autogressive student-net using a linear combination of a regularized KL divergence and a frame-level loss. [0003] - Typically, these systems first transform text into a compact audio representation, and then convert this representation into audio using an audio waveform synthesis method called a vocoder);
and training the student network to learn from the teacher network to generate audio data in a non-autoregressive manner (Figure 11 - 1125 - Distill a Gaussian inverse autoregressive 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Ping in order to implement the method of claim 8, wherein generating the audio data comprises: training the teacher network to generate audio data in an autoregressive manner; and training the student network to learn from the teacher network to generate audio data in a non-autoregressive manner. Doing so allows for the enablement of fast inference and end-to-end training for improved performance (Ping [0005]).
Regarding claim 10, McDuff in view of Ping teaches all of the limitations as in claim 9, above.
However, McDuff does not disclose the method of claim 9, wherein training the student network comprises training the student network to learn to predict attention from the set of audio inputs, wherein the student network generates the audio data using the predicted attention.
Ping does teach method of claim 9, wherein training the student network comprises training the student network to learn to predict attention from the set of audio inputs, wherein the student network generates the audio data using the predicted attention ([0039] - However, they depend on a traditional vocoder, the Griffin-Lim algorithm, or a separately trained neural vocoder to convert the predicted spectrogram to raw audio. [0087] – the hidden representations learned from the attention mechanism may be directly fed to the ne neural vocoder through some 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Ping in order to implement the method of claim 9, wherein training the student network comprises training the student network to learn to predict attention from the set of audio inputs, wherein the student network generates the audio data using the predicted attention. Doing so allows for the enablement of fast inference and end-to-end training for improved performance (Ping [0005]).
Regarding claim 19, McDuff discloses all of the limitations as in claim 13, above.
However, McDuff does not disclose the non-transitory machine readable medium of claim 13, 
wherein generating the audio data comprises utilizing teacher and student networks to generate the audio data, wherein generating the audio data comprises:
training the teacher network to generate audio data in an autoregressive manner, and training the student network to learn from the teacher network to generate audio data in a non-autoregressive manner.
Ping does teach the non-transitory machine readable medium of claim 13, 
wherein generating the audio data comprises utilizing teacher and student networks to generate the audio data, wherein generating the audio data comprises ([0003] - Typically, these systems first transform text into a compact audio representation, and then convert this representation into audio using an audio waveform synthesis method called a vocoder. [0030] - Most recently, Oord et al. (Parallel WaveNet: Fast high-fidelity speech synthesis, ICML, 2018) 
training the teacher network to generate audio data in an autoregressive manner, and training the student network to learn from the teacher network to generate audio data in a non-autoregressive manner (Figure 11 - 1125 - Distill a Gaussian inverse autoregressive flow from the autoregressive teacher-net into a non-autogressive student-net using a linear combination of a regularized KL divergence and a frame-level loss. [0003] - Typically, these systems first transform text into a compact audio representation, and then convert this representation into audio using an audio waveform synthesis method called a vocoder).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Ping in order to implement the non-transitory machine readable medium of claim 13, wherein generating the audio data comprises utilizing teacher and student networks to generate the audio data, wherein generating the audio data comprises: training the teacher network to generate audio data in an autoregressive manner, and training the student network to learn from the teacher network to generate audio data in a non-autoregressive manner. Doing so allows for the enablement of fast inference and end-to-end training for improved performance (Ping [0005]).
Regarding claim 20, McDuff in view of Ping teaches all of the limitations as in claim 9, above.
However, McDuff does not disclose the non-transitory machine readable medium of claim 9, wherein training the student network comprises training the student network to learn to 
Ping does teach the non-transitory machine readable medium of claim 9, wherein training the student network comprises training the student network to learn to predict attention from the set of audio inputs, wherein the student network generates the audio data using the predicted attention ([0039] - However, they depend on a traditional vocoder, the Griffin-Lim algorithm, or a separately trained neural vocoder to convert the predicted spectrogram to raw audio. [0087] – the hidden representations learned from the attention mechanism may be directly fed to the ne neural vocoder through some intermediate processing, and the whole model from scratch may be trained in an end-to-end manner).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified McDuff to incorporate the teachings of Ping in order to implement the non-transitory machine readable medium of claim 9, wherein training the student network comprises training the student network to learn to predict attention from the set of audio inputs, wherein the student network generates the audio data using the predicted attention. Doing so allows for the enablement of fast inference and end-to-end training for improved performance (Ping [0005]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Li (U.S. Patent No. 10885900) teaches domain adaptation in speech recognition via teacher-student learning. Kato (U.S. Publication No. 20090234652) teaches a voice synthesis device. Zhou (U.S. Publication No. 20200410976) teaches speech style transfer.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ETHAN DANIEL KIM/
Examiner, Art Unit 2658

/RICHEMOND DORVIL/            Supervisory Patent Examiner, Art Unit 2658