DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 2, 4, 13, 14 and 16 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zhao U.S. PAP 2015/0364128.

Regarding claim 1 Zhao teaches a system (technology relates to a system for converting text to speech, see par. [0007]) comprising: 
a context encoder configured (one processor and memory encoding computer executable instructions, see par. [0007])  to: 
receive one or more context features associated with current input text to be synthesized into expressive speech, each context feature derived from a text source of the current input text (receiving text input and receiving two or more properties from the group consisting of: part-of-speech properties, phonemes, linguistic prosody properties, contextual properties, and semantic properties, see par. [0007]);
and process the one or more context features to generate a context embedding associated with the current input text (two or more properties are determined by a recurrent neural network module, see par. [0007]); 
a text-prediction network in communication with the context encoder and configured to: receive the current input text from the text source, the text source comprising sequences of text to be synthesized into expressive speech ( receiving text input, see par. [0007]); 
receive the context embedding associated with the current input text from the context encoder ( determining phonetic properties for the text input based on the received two or more properties, see par. [0007]); 
and process the current input text and the context embedding associated with the current input text to predict, as output, a style embedding for the current input text, the style embedding specifying a specific prosody and/or style for synthesizing the current input text into expressive speech (generating a generation sequence, wherein generating the generation sequence utilizes a unified recurrent neural network decoder, see par. [0007]; he linguistic prosody tagger (LPT) RNN module 108 determines linguistic prosody properties for letters, words, or groups of words from the input 102, see par. [0038]); 
and a text-to-speech model in communication with the text-prediction network and configured to: 
receive the current input text from the text source (receive input text, see par. [0003]); 
receive the style embedding predicted by the text-predication network (determining phonetic properties for the text input based on the received two or more properties, see par. [0007]); 
and process the current input text and the style embedding to generate an output audio signal of expressive speech of the current input text, the output audio signal having the specific prosody and/or style specified by the style embedding ( synthesizing the generation sequence into audible speech, see par. [0004]).
Regarding claim 2 Zhao teaches the system of claim 1, wherein the one or more context features associated with the current input text comprise at least one of: the current input text (input text 102, see fig 1); previous text from the text source that precedes the current input text; previous speech synthesized from the previous text; upcoming text from the text source that follows the current input text; or a previous style embedding predicted by the text-prediction network based on the previous text and a previous context embedding associated with the previous text (he processing may be based on the history of the letters previously analyzed, encoded as S.sub.0, and the future letters, see par. [0059]).
Regarding claim 4 Zhao teaches the system of claim 1, wherein: 
the text source comprises a dialogue transcript (the input text may be in the form of a single word, a letter of a word, or a group of words, such as a sentence, paragraph, or dialogue, see par. [0041]; 
the current input text corresponds to a current turn in the dialogue transcript (Examples of the contextual information include emotional style, dialogue state, see par. [0040]): and the one or more context features associated with the current input text comprises at least one of: 
previous text in the dialogue transcript that corresponds to a previous turn in the dialogue transcript; or upcoming text in the dialogue transcript that corresponds to a next turn in the dialogue transcript (is desirable to identify a likeliest phonetic property sequence for text in the sequence of text given all text in such sequence, “future” text may be desirably employed as input when determining the semantic label for word w(t). , see par. [0051]).

Regarding claim 13 Zhao teaches a method   for generating an output audio signal of expressive synthesized speech (the technology relates to a method for converting text to speech, see par. [0003]), the method comprising: 
receiving, at data processing hardware, current input text from a text source, the current input text to be synthesized into expressive speech by a text-to-speech (TTS) model (receiving text input and receiving two or more properties from a group consisting of part-of-speech properties, see par. [0003]); 
generating, by the data processing hardware, using a context model, a context embedding associated with current input text by processing one or more context features derived from the text source (receiving two or more properties from a group consisting of part-of-speech properties, phonemes, linguistic prosody properties, contextual properties, and semantic properties. The two or more phonetic properties are determined by a recurrent neural network (RNN) module, see par. [0003]); 
predicting, by the data processing hardware, using a text-prediction network, a style embedding for the current input text by processing the current input text and the context embedding associated with the current input text, the style embedding specifying a specific prosody and/or style for synthesizing the current input text into expressive speech (determining phonetic properties for the text input based on the received two or more properties and generating a generation sequence. In one embodiment, the two or more properties received are the part-of-speech properties and phonemes. In another embodiment, the two or more properties received are the linguistic prosody properties, the contextual properties, and the semantic properties, see par. [0003]); 
and generating, by the data processing hardware, using the TTS model, the output audio signal of expressive speech of the current input text by processing the style embedding and the current input text, the output audio signal having the specific prosody and/or style specified by the style embedding (synthesizing the generation sequence into audible speech, see par. [0004]).
Regarding claim 14 Zhao teaches the method of claim 1 wherein the one or more context features associated with the current input text comprise at least one of: the current input text (input text 102, see fig 1); previous text from the text source that precedes the current input text; previous speech synthesized from the previous text; upcoming text from the text source that follows the current input text; or a previous style embedding predicted by the text-prediction network based on the previous text and a previous context embedding associated with the previous text (he processing may be based on the history of the letters previously analyzed, encoded as S.sub.0, and the future letters, see par. [0059]).
Regarding claim 16 Zhao teaches the method of claim 13, the text source comprises a dialogue transcript (the input text may be in the form of a single word, a letter of a word, or a group of words, such as a sentence, paragraph, or dialogue, see par. [0041]; 
the current input text corresponds to a current turn in the dialogue transcript (Examples of the contextual information include emotional style, dialogue state, see par. [0040]): and the one or more context features associated with the current input text comprises at least one of: 
previous text in the dialogue transcript that corresponds to a previous turn in the dialogue transcript; or upcoming text in the dialogue transcript that corresponds to a next turn in the dialogue transcript (is desirable to identify a likeliest phonetic property sequence for text in the sequence of text given all text in such sequence, “future” text may be desirably employed as input when determining the semantic label for word w(t). , see par. [0051]).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 3, 11, 12, 15, 23 and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PAP 2015/0364128 in view of Chicote U.S. Patent No. 10,475,438 B1.
Regarding claim 3 Zhao does not teach the system of claim 1, wherein: the text source comprises a text document, and the one or more context features associated with the current input text comprise at least one of: a title of the text document; a title of a chapter in the text document; a title of a section in the text document, a headline in the text document; one or more bullet points in the text, document; entities from a concept graph extracted from the text document; or one or more structured answer representations extracted from the text document.
In the same field of endeavor Chicote teaches a text-to-speech (TTS) system that is capable of considering characteristics of various portions of text data in order to create continuity between segments of synthesized speech, see abstract. One issue with performing TTS processing on large textual works, such as eBooks, is how to segment a book or other work into portions in order to provide context so that the portions of the book are smaller and include contextual information to improve the efficiency of the TTS system. Other textual characteristics, such as indications of dialog, chapter breaks, etc., if known to a TTS system, may allow the TTS system to create more natural sounding output audio. Thus, it may be beneficial to for a system to take raw text from an electronic file that is unstructured (e.g., has limited data markings/offsets around the text) and identify such markings/offsets (for example, the chapters, paragraphs, etc.) as well as to identify certain characteristics about the text (e.g., a paragraph's nature and content, etc.). Then TTS processing can be performed on the segmented portions using the additional information to provide natural and pleasant audio data to a user device, see col. 2 lines 33-67.
It would have been obvious to one of ordinary skill in the art to combine the Zhao invention with the teachings of Chicote for the benefit of providing natural and pleasant audio data to a user device, see col. 2 lines 33-67.

Regarding claim 11 Chicote teaches the system of claim 1, wherein the context model, the text-prediction model, and the text-to-speech model are trained jointly( The encoder 220 and/or TTS models may be trained jointly or separately).
 Regarding claim 12 Chicote teaches the system of claim 1, wherein a two-step training procedure trains the text-to-speech mode during a first step of the training procedure and separately trains the context model and the text-prediction model jointly during a second step of the training procedure ( The encoder 220 and/or TTS models may be trained jointly or separately).
Regarding claim 15 Zhao does not teach the method of claim 13, wherein: the text source comprises a text document, and the one or more context features associated with the current input text comprise at least one of: a title of the text document; a title of a chapter in the text document; a title of a section in the text document; a headline in the text document; one or more bullet points in the text, document; entities from a concept graph extracted from the text document; or one or more structured answer representations extracted from the text document.
In the same field of endeavor Chicote teaches a text-to-speech (TTS) system that is capable of considering characteristics of various portions of text data in order to create continuity between segments of synthesized speech, see abstract. One issue with performing TTS processing on large textual works, such as eBooks, is how to segment a book or other work into portions in order to provide context so that the portions of the book are smaller and include contextual information to improve the efficiency of the TTS system. Other textual characteristics, such as indications of dialog, chapter breaks, etc., if known to a TTS system, may allow the TTS system to create more natural sounding output audio.Thus, it may be beneficial to for a system to take raw text from an electronic file that is unstructured (e.g., has limited data markings/offsets around the text) and identify such markings/offsets (for example, the chapters, paragraphs, etc.) as well as to identify certain characteristics about the text (e.g., a paragraph's nature and content, etc.). Then TTS processing can be performed on the segmented portions using the additional information to provide natural and pleasant audio data to a user device, see col. 2 lines 33-67.
It would have been obvious to one of ordinary skill in the art to combine the Zhao invention with the teachings of Chicote for the benefit of providing natural and pleasant audio data to a user device, see col. 2 lines 33-67.
Regarding claim 23 Chicote teaches the method of claim 13, wherein the context model, the text-prediction model, and the text-to-speech model are trained jointly ( The encoder 220 and/or TTS models may be trained jointly or separately).
Regarding claim 24. Chicote teaches the method of claim 13, wherein a two-step training procedure trains the text-to-speech model during a first step of the training procedure and separately trains the context model and the text-prediction model jointly during a second step of the training procedure ( The encoder 220 and/or TTS models may be trained jointly or separately).

Claim(s) 5 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhao U.S. PAP 2015/0364128 in view of Fructuoso U.S. PAP 2015/0186359 A1.
Regarding claim  5 Zhao does not teach the system of claim 1, wherein: the text source comprises a query-response system; the current input text corresponds to a response to a current query received at the query-response system; and the one or more context features associated with the current input text comprises at least one of: text associated with the current query or text associated with a sequence of queries received at the query response-system, the sequence of queries comprising the current query and one or more queries preceding the current query; or audio features associated with the current query or audio features associated with the sequence of queries received at the query response-system.
In the same field of endeavor Fructuoso teaches text-to-speech systems can be used to artificially generate an audible representation of a text. Text-to speech systems typically attempt to approximate various characteristics of human speech, such as the sounds produced, rhythm of speech, and intonation, see par. [0002]. In the example of FIG. 1, the computing system 120 obtains a text 121 for which synthesized speech should be generated. The text 121 may be provided by any appropriate source. For example, the client device 110 may provide the text 121 over the network 130 and request an audio representation. Alternatively, the text 121 may be generated by the computing system 120, accessed from storage, received from another computing system, or obtained from another source. Examples of texts for which synthesized speech may be desired include text of an answer to a voice query, text in web pages, short message service (SMS) text messages, e-mail messages, social media content, user notifications from an application or device, and media playlist information, see par. [0030].
It would have been obvious to one of ordinary skill in the art to combine the Zhao invention with the teachings of Fructuoso for the benefit of approximating human speech to produce more natural synthesized speech, see par. [0002].
Regarding claim 17 Zhao does not teach the method of claim 13, wherein: the text source comprises a query-response system; the current input text corresponds to a response to a current query received at the query-response system; and the one or more context features associated with the current input text comprises at least one of: text associated with the current query or text associated with a sequence of queries received at the query response-system, the sequence of queries comprising the current query and one or more queries preceding the current query; or audio features associated with the current query or audio features associated with the sequence of queries received at the query response-system.
In the same field of endeavor Fructuoso teaches text-to-speech systems can be used to artificially generate an audible representation of a text. Text-to speech systems typically attempt to approximate various characteristics of human speech, such as the sounds produced, rhythm of speech, and intonation, see par. [0002]. In the example of FIG. 1, the computing system 120 obtains a text 121 for which synthesized speech should be generated. The text 121 may be provided by any appropriate source. For example, the client device 110 may provide the text 121 over the network 130 and request an audio representation. Alternatively, the text 121 may be generated by the computing system 120, accessed from storage, received from another computing system, or obtained from another source. Examples of texts for which synthesized speech may be desired include text of an answer to a voice query, text in web pages, short message service (SMS) text messages, e-mail messages, social media content, user notifications from an application or device, and media playlist information, see par. [0030].
It would have been obvious to one of ordinary skill in the art to combine the Zhao invention with the teachings of Fructuoso for the benefit of approximating human speech to produce more natural synthesized speech, see par. [0002].
Claim(s) 6-10 and 20-22 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PAP 2015/0364128 in view of Wang “Tacotron: A Fully End-To-End Text-To-Speech Synthesis Model”.
Regarding claim 6 Zhao teaches the system of claim 1, wherein the text-to-speech model comprises: 
an encoder neural network (recurrent neural network, see par. [0007]) configured to: 
receive the current input text from the text source (receive text, see par. [0007]); and process the current input text to generate a respective encoded sequence of the current input text (to retain text ordering information, representations may be concatenated in sequence in a given context window, see par. [0052]).
However Zhao does not teach a concatenator configured to: receive the respective encoded sequence of the current input text from the encoder neural network; receive the style embedding predicted by the textual-prediction network; and generate a concatenation between the respective encoded sequence of the current input text and the style embedding; and an attention-based decoder recurrent neural network configured to: receive a sequence of decoder inputs; and for each decoder input in the sequence, process the corresponding decoder input and the concatenation between the respective encoded sequence of the current input text and the style embedding to generate r frames of the output audio signal, wherein r comprises an integer greater than one.
IN the same field of endeavor Wang teaches an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization, see abstract.
a concatenator configured to: receive the respective encoded sequence of the current input text from the encoder neural network (extract robost sequential representations from text, see section 3.2); receive the style embedding predicted by the textual-prediction network (apply pre-net to each embedding, see section 3.2); and generate a concatenation between the respective encoded sequence of the current input text and the style embedding (concatenate the context vector a and output from encoder, see section 3.3); and an attention-based decoder recurrent neural network  (attention decoder, see section 3.3) configured to: receive a sequence of decoder inputs (receive output from attention RNN, see section 3.3); and for each decoder input in the sequence, process the corresponding decoder input and the concatenation between the respective encoded sequence of the current input text and the style embedding to generate r frames of the output audio signal, wherein r comprises an integer greater than one (concatenate the context vector and the attention RNN cell output, see section 3.3).
It would have been obvious to one of ordinary skill in the art to combine the Zhao invention with the teachings of Wang for the benefit of training a model completely from scratch with random initialization, see abstract.
Regarding claim 7 Wang teaches the system of claim 6, wherein the encoder neural network comprises: 
an encoder pre-net neural network (pre-net, see table 1) configured to: 
receive a respective embedding of each character in a sequence of characters of the current input text (character embedding, see table 1); 
and for each character, process the respective embedding to generate a respective transformed embedding of the character (apply pre-net to each character embedding, see section 3.2); 
and an encoder CBHG neural network configured to: 
receive the transformed embeddings generated by the encoder pre-net neural network (A CBHG module transforms the prenet outputs into the final encoder representation used by the attention module, see section 3.2); 
and process the transformed embeddings to generate the respective encoded sequence of the current input text (A CBHG module transforms the prenet outputs into the final encoder representation used by the attention module, see section 3.2).
input text (encoder pre-net, see section 3.2); for each character in the sequence of characters, processing, using the encoder pre-net neural network, the respective embedding to generate a respective transformed embedding of the character, and generating, using an encoder CBHG neural network of the encoder neural network, respective encoded sequence of the current input text by processing the transformed embeddings 
Regarding claim 8 Wang teaches the system of claim 7, wherein the encoder CBHG neural network comprises a bank of 1-D convolutional filters, followed by a highway network, and followed by a bidirectional recurrent neural network (Conv 1D bank, highway net, bidirectional GRU, see table 1)).
Regarding claim 9 Wang teaches the system of claim 1, wherein the text-prediction network comprises: a time-aggregating gated recurrent unit (GRU) recurrent neural network (RNN) configured to: receive the context embedding associated with the current input text and an encoded sequence of the current input text; and generate a fixed-length feature vector by processing the context embedding and the encoded sequence; and one or more fully-connected layers configured to predict the style embedding by processing the fixed-length feature vector (we stack a bidirectional GRU RNN on top to extract sequential features from both forward and backward context, see section 3.1; we stack a bidirectional GRU RNN on top to extract sequential features from both forward and backward context, see section 3.3).
Regarding claim 10 Wang teaches the system of claim 9, wherein the one or more fully-connected layers comprise one or more hidden fully-connected layers using ReLU activations and an output layer that uses tanh activation to emit the predicted style embedding (tanh attention decoder, ReLU see table 1).
Regarding claim 18 Zhao teaches the method of claim 13, wherein generating the output audio signal comprises: receiving, at an encoder neural network of the text-to-speech model, the current input text from the text source (receive text, see par. [0007]); ; generating, using the encoder neural network, a respective encoded sequence of the current input text (to retain text ordering information, representations may be concatenated in sequence in a given context window, see par. [0052]).
However Zhao does not teach generating, using a concatenator of the text-to-speech model, a concatenation between the respective encoded sequence of the current input text and the style embedding; receiving, at an attention-based decoder recurrent neural network of the text-to-speech model, a sequence of decoder inputs; and for each decoder input in the sequence of decoder inputs, processing, using the attention-based decoder recurrent neural network, the corresponding decoder input and the concatenation between the respective encoded sequence of the current input text and the style embedding to generate r frames of the output audio signal, wherein r comprises an integer greater than one.
IN the same field of endeavor Wang teaches an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization, see abstract.
a concatenator configured to: receive the respective encoded sequence of the current input text from the encoder neural network (extract robost sequential representations from text, see section 3.2); receive the style embedding predicted by the textual-prediction network (apply pre-net to each embedding, see section 3.2); and generate a concatenation between the respective encoded sequence of the current input text and the style embedding (concatenate the context vector a and output from encoder, see section 3.3); and an attention-based decoder recurrent neural network  (attention decoder, see section 3.3) configured to: receive a sequence of decoder inputs (receive output from attention RNN, see section 3.3); and for each decoder input in the sequence, process the corresponding decoder input and the concatenation between the respective encoded sequence of the current input text and the style embedding to generate r frames of the output audio signal, wherein r comprises an integer greater than one (concatenate the context vector and the attention RNN cell output, see section 3.3).
It would have been obvious to one of ordinary skill in the art to combine the Zhao invention with the teachings of Wang for the benefit of training a model completely from scratch with random initialization, see abstract.
Regarding claim 19 Wang teaches the method of claim 18, wherein generating the respective encoded sequence of the current input text comprises: receiving, at an encoder pre-net neural network of the encoder neural network, a respective embedding of each character in a sequence of characters of the current input text (encoder pre-net, see section 3.2); for each character in the sequence of characters, processing, using the encoder pre-net neural network, the respective embedding to generate a respective transformed embedding of the character, and generating, using an encoder CBHG neural network of the encoder neural network, respective encoded sequence of the current input text by processing the transformed embeddings (A CBHG module transforms the prenet outputs into the final encoder representation used by the attention module, see section 3.2).
Regarding claim 20 Wang teaches the method of claim 19, wherein the encoder CBHG neural network comprises a bank of 1-D convolutional filters, followed by a highway network, and followed by a bidirectional recurrent neural network(Conv 1D bank, highway net, bidirectional GRU, see table 1).
Regarding claim 21 Wang teaches the method of claim 13, wherein predicting the style embedding for the current input text comprises. generating, using a time-aggregating gated recurrent unit (GRU) recurrent neural net work (RNN) of the text-prediction model, a fixed-length feature vector by processing the context embedding associated with the current input text and an encoded sequence of the current input text(we stack a bidirectional GRU RNN on top to extract sequential features from both forward and backward context, see section 3.1); and predicting, using one or more fully-connected layers of the text-prediction model that follow the GRU-RNN, the style embedding by processing the fixed-length feature vector (we stack a bidirectional GRU RNN on top to extract sequential features from both forward and backward context, see section 3.3).
Regarding claim 22 Wang teaches the method of claim 21, wherein the one or more fully-connected layers comprise one or more hidden fully-connected layers using ReLU activations and an output layer that uses tanh activation to emit the predicted style embedding (tanh attention decoder, ReLU see table 1).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.
Pollet ‘677 teaches a speech synthesis device which uses a recurrent neural network to determine embedded data to select speech units based on the data to generate output speech, see abstract.
Jeon ‘259 teaches a speech to text synthesizer using concatenation-sensitive neural networks, see abstract 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711. The examiner can normally be reached Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656