DETAILED ACTION
Introduction
1.	This office action is in response to Applicant’s submission filed July 14, 2022.  Claims 1-24 are pending in the application.  As such, Claims 1-24 have been examined.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
3. 	Applicant’s arguments and amendments in the Amendment filed July 14, 2022 (herein “Amendment”) with respect to rejections of Claims 1-24 under 35 U.S.C. 103 have been fully considered, but are moot in view of the new ground of rejection based on U.S. Patent App. Pub. No. 20210312905 (Zhao et al., hereinafter “Zhao”).
	With regard to Claims 1 and 13, the Amendment appears to argue that the claimed invention selects/uses only the initial wordpiece in some cases, but the claims do not reflect this.  Since the cited references describe using the entire words (as noted on page 12 of the Amendment), the cited references describe using an initial wordpiece (along with all the rest of the wordpieces in each word).  Accordingly, the broadest reasonable interpretation of the pending claims are rendered obvious by the cited references.
	With regard to Claims 10 and 22, paragraph 54 of Kuo describes that a training procedure that optimizes two separate loss terms is employed. The first loss term corresponds to a composite cross-entropy intent classification loss derived by using the text embeddings and the acoustic embeddings (cited as “predicted frames” and “reference frames”).  As Claims 10 and 22 still recite a loss between predicted and reference frames, the broadest reasonable interpretation of Claims 10 and 22 are rendered obvious by the cited references.

Claim Rejections - 35 USC § 103
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

6.	Claims 1-9, 12-21, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent App. Pub. No. 20190348020 (Clark et al. hereinafter “Clark”) in view of “Pre-trained Text Embeddings for Enhanced Text-to-Speech Synthesis” (Hayashi et al. hereinafter “Hayashi”) (Cited in IDS) and U.S. Patent App. Pub. No. 20210312905 (Zhao et al., hereinafter “Zhao”).
	With regard to Claim 1, Clark describes:
“A method comprising:
receiving, at data processing hardware, a text utterance having one or more words, each word having at least one syllable, each syllable having at least one phoneme; (Claim 1, lines 1-4)
selecting, by the data processing hardware, an utterance embedding for the text utterance, the utterance embedding representing an intended prosody; (Claim 1, lines 5-7)
for each syllable, using the selected utterance embedding and a prosody model (Claim 1, line 8 and paragraph 33.  Paragraph 33 describes that the prosody model 300 may select an utterance embedding 260 for the text utterance 320.) [[that incorporates the BERT model]]:
when the word that includes the syllable includes a plurality of wordpiece units from the sequence of wordpiece units, selecting the wordpiece embedding from the sequence of wordpiece embeddings that corresponds to the initial wordpiece unit of the plurality of wordpiece units from the sequence of wordpiece units to represent the word that includes the syllable; (Paragraph 41 describes that the variational autoencoder 300 is configured to produce a plurality of fixed-length prosodic syllable embeddings 245 for an utterance that includes words with the syllables.  The entire word is used, which includes the initial wordpiece.)
generating, by the data processing hardware, a corresponding prosodic syllable embedding for the syllable [[based on the wordpiece embedding]] corresponding to the initial wordpiece unit of the plurality of wordpiece units from the sequence of wordpiece units that includes the syllable that is selected to represent associated with the word that includes the syllable; (Paragraph 41 describes that the variational autoencoder 300 is configured to produce a plurality of fixed-length prosodic syllable embeddings 245 for an utterance that includes words with the syllables. The entire word is used, which includes the initial wordpiece.) and
predicting, by the data processing hardware, a duration of the syllable by encoding linguistic features of each phoneme of the syllable with the corresponding prosodic syllable embedding for the syllable; (Claim 1, lines 9-12) and
generating, by the data processing hardware, using the prosody model, a prosodic representation for the text utterance based on the predicted durations of the syllables. (Paragraph 47 describes that prosody model 300 generates a prosodic representation for the text utterance based on the predicted durations of the syllables)
Clark does not explicitly describe:
“generating, by the data processing hardware, using a Bidirectional Encoder Representations from Transformers (BERT) model, a sequence of wordpiece embeddings, each wordpiece embedding corresponding to one of the word piece units that are associated with one of the one or more words of the text utterance; 
[[using the selected utterance embedding and a prosody model]] that incorporates the BERT model; and 
[[generating]] ... based on the wordpiece embedding.” And
“splitting, by the data processing hardware, using a tokenizer, the text utterance into a sequence of wordpiece units.”
However, page 4431, Section 2.2 of Hayashi describes using a BERT model to generate a deep representation based on input wordpieces.  Section 2.1 (crossing pages 4430 and 4431) describes the output of the BERT model as “contextual encodings.” It would be obvious to input this representation generated by the BERT model (cited as “wordpiece embeddings”) into the autoencoder of Clark.  Further, it would have been obvious to generate a corresponding prosodic syllable embedding for the syllable based on the representation generated by the BERT model, as this representation is a “contextual encoding” as described by Hayashi.

    PNG
    media_image1.png
    318
    563
    media_image1.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the BERT derived encodings of Hayashi into the system of Clark to allow a user to allow for training of the autoencoder without needed a variety of speech samples, as described in the last paragraph of Section 1, on page 4430 of Hayashi.  

    PNG
    media_image2.png
    100
    562
    media_image2.png
    Greyscale

	Clark in view of Hayashi does not explicitly describe “splitting, by the data processing hardware, using a tokenizer, the text utterance into a sequence of wordpiece units.”
However, paragraph 40 of Zhao describes a device that uses a tokenizer to divide input words into wordpieces.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the wordpieces of Zhao into the system of Clark in view of Hayashi to reduce training error rates, as described in paragraph 59 of Zhao.
With regard to Claim 2, Clark describes “for each syllable, using the selected utterance embedding and the prosody model: (Claim 1, line 8 and paragraph 33.  Paragraph 33 describes that the prosody model 300 may select an utterance embedding 260 for the text utterance 320.)
predicting, by the data processing hardware, a pitch contour of the syllable based on the predicted duration for the syllable; (Claim 1, lines 13-15) and
generating, by the data processing hardware, a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable, each fixed-length predicted pitch frame representing part of the predicted pitch contour of the syllable, wherein generating the prosodic representation for the utterance is based on the plurality of fixed-length predicted pitch frames generated for each syllable. (Claim 1, lines 16-20 and paragraph 44.  Paragraph 44 describes that the autoencoder 300 predicts a prosodic representation for a given text utterance 320 during inference by jointly predicting durations of phonemes 230 and pitch and/or energy contours (cited as “the plurality of fixed-length predicted pitch frames”) for each syllable 240 of the given text utterance 320.   Paragraph 47 describes that each syllable in the syllable level 240 may be associated with a corresponding LTSM processing cell that outputs a corresponding syllable embedding to the faster clocking phoneme level 230 for decoding the individual fixed-length predicted pitch (F0) frames and for decoding the individual fixed-length predicted energy (C0) frames in parallel.)
	With regard to Claim 3, Clark describes:
“for each syllable, using the selected utterance embedding and the prosody model: (Claim 6)
predicting, by the data processing hardware, an energy contour of each phoneme in the syllable based on a predicted duration for the phoneme; (Claim 6) and
for each phoneme associated with the syllable, generating, by the data processing hardware, a plurality of fixed-length predicted energy frames based on the predicted duration for the corresponding phoneme, each fixed-length energy frame representing the predicted energy contour of the corresponding phoneme, (Claim 6)
wherein generating the prosodic representation for the utterance is further based on the plurality of fixed-length predicted energy frames generated for each phoneme associated with each syllable.” (Paragraph 44 describes that the autoencoder 300 predicts a prosodic representation for a given text utterance 320 during inference by jointly predicting durations of phonemes 230 and pitch and/or energy contours (cited as “the plurality of fixed-length predicted energy frames”) for each syllable 240 of the given text utterance 320.  Paragraph 47 describes that each syllable in the syllable level 240 may be associated with a corresponding LTSM processing cell that outputs a corresponding syllable embedding to the faster clocking phoneme level 230 for decoding the individual fixed-length predicted pitch (F0) frames and for decoding the individual fixed-length predicted energy (C0) frames in parallel.) 
	With regard to Claim 4, Clark describes “a hierarchical linguistic structure represents the text utterance, (Claim 7) the hierarchical linguistic structure comprising:
a first level including each syllable of the text utterance; (Claim 7)
a second level including each phoneme of the text utterance; (Claim 7)
a third level including each fixed-length predicted pitch frame for each syllable of the text utterance; (Claim 7) and
a fourth level parallel to the third level and including each fixed-length predicted energy frame for each phoneme of the text utterance. (Claim 7)
	With regard to Claim 5, Clark describes:
the first level of the hierarchical linguistic structure comprises a long short-term memory (LSTM) processing cell representing each syllable of the text utterance; (Claim 8)
the second level of the hierarchical linguistic structure comprises a LSTM processing cell representing each phoneme of the text utterance, the LSTM processing cells of the second level clocking relative to and faster than the LSTM processing cells of the first level; (Claim 8)
the third level of the hierarchical linguistic structure comprises a LSTM processing cell representing each fixed-length predicted pitch frame, the LSTM processing cells of the third level clocking relative to and faster than the LSTM processing cells of the second level; (Claim 8) and
the fourth level of the hierarchical linguistic structure comprises a LSTM processing cell representing each fixed-length predicted energy frame, the LSTM processing cells of the fourth level clocking at the same speed as the LSTM processing cells of the third level and clocking relative to and faster than the LSTM processing cells of the second level. (Claim 8)
	With regard to Claim 6, Clark describes “the lengths of the fixed-length predicted energy frames and the fixed-length predicted pitch frames are the same.”  (Claim 10)
	With regard to Claim 7, Clark describes “a total number of fixed-length predicted energy frames generated for each phoneme of the received text utterance is equal to a total number of the fixed-length predicted pitch frames generated for each syllable of the received text utterance.” (Claim 11)
	With regard to Claim 8, Clark describes:
receiving, at the data processing hardware, training data including a plurality of reference audio signals and corresponding transcripts, each reference audio signal comprising a spoken utterance of human speech and having a corresponding prosody, each transcript comprising a textual representation of the corresponding reference audio signal; (Paragraph 32 describes that deep neural network 200 may store each fixed-length utterance embedding 260 in an utterance embedding storage 180 (e.g., on the memory hardware 124 of the computing system 120) along with a corresponding transcript 261 of the reference audio signal 222 associated the utterance embedding 260.) and
training, by the data processing hardware, using a deep neural network [[that incorporates the BERT model]], the prosody model by encoding each reference audio signal into a corresponding fixed-length utterance embedding representing the corresponding prosody of the reference audio signal.  (Paragraph 32 describes that deep neural network 200 is configured to encode/compress the prosodic representation associated with each reference audio signal 222 into a corresponding fixed-length utterance embedding 260.)
Clark does not explicitly describe:
“obtaining, by the data processing hardware, the BERT model, the BERT model pre-trained on a text-only language modeling task; and
[using a deep neural network] that incorporates the BERT model.”
However, page 4431, Section 2.2 of Hayashi describes using a deep neural network that is a BERT model, and that is pre-trained and where the input is text.  

    PNG
    media_image1.png
    318
    563
    media_image1.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the pre-trained BERT model of Hayashi as the deep neural network of Clark to allow a user to allow for training of the autoencoder without needed a variety of speech samples, as described in the last paragraph of Section 1, on page 4430 of Hayashi.  

    PNG
    media_image2.png
    100
    562
    media_image2.png
    Greyscale

	With regard to Claim 9, Clark describes:
“encoding each reference audio signal into a corresponding fixed-length utterance embedding comprises: (Paragraph 32 describes that deep neural network 200 is configured to encode/compress the prosodic representation associated with each reference audio signal 222 into a corresponding fixed-length utterance embedding 260.)
generating, [[using the BERT model,]] a sequence of wordpiece embeddings from the transcript of the corresponding reference audio signal; (Paragraph 39 describes that linguistic features may be extracted from transcripts 261 and stored for use in conditioning the training of the hierarchical linguistic structure 200.  The linguistic features may include, without limitation, individual sounds for each phoneme, whether each syllable is stressed or un-stressed, the type of each word (e.g., noun/adjective/verb) and/or the position of the word in the utterance, and whether the utterance is a question or phrase.  The element “wordpiece embeddings” is interpreted to be a possible “linguistic feature.”)
sampling, from the corresponding reference audio signal, a sequence of fixed- length reference frames providing a duration, pitch contour, and/or energy contour that represents the corresponding prosody of the reference audio signal; (Paragraph 34 describes that the autoencoder 300 includes an encoder portion 302 (FIG. 2A) that encodes a plurality of fixed-length reference frames 220 sampled from a reference audio signal 222 into a fixed-length utterance embedding 260.  Paragraph 35 describes that the reference frames 220 may each include a duration of 5 milliseconds (ms) and represent one of a contour of pitch (F0) or a contour of energy (C0) for the reference audio signal 222.) and
for each syllable in the reference audio signal:
encoding phone-level linguistic features associated with each phoneme in the syllable into a phone feature-based syllable embedding; (Paragraph 12 describes that predicting the pitch contour of the syllable based on the predicted duration for the syllable may include combining the corresponding prosodic syllable embedding for the syllable with each encoding of the corresponding prosodic syllable embedding and the phone-level linguistic features of each corresponding phoneme associated with the syllable.)
encoding the fixed-length reference frames associated with the syllable into a frame-based syllable embedding, the frame-based syllable embedding indicative of a duration, pitch, and/or energy associated with the corresponding syllable; (Paragraphs 38 and 47 describe that each syllable 240Aa-240Cb in the level of syllables 240 may correspond to a respective syllable embedding (e.g., a numerical vector) that indicates a duration, pitch (F0), and/or energy (C0) associated with the corresponding syllable 240.) and
encoding, into a corresponding prosodic syllable embedding for the syllable, the phoneme feature-based and frame-based syllable embeddings with syllable-level linguistic features associated with the syllable (Paragraph 46 describes that at the syllable level 240 of LTSM processing cells, the autoencoder 300 is configured to produce/output a corresponding syllable embedding 245Aa, 245Ab, 245Ba, 245Ca, 245Cb for each syllable 240 from the following inputs: the fixed-length utterance embedding 260; utterance-level linguistic features 262 associated with the text utterance 320; word-level linguistic features 252 associated with the word 250 that contains the syllable 240; and syllable-level linguistic features 242 for the syllable 240.), sentence-level linguistic features associated with the reference audio signal (utterance-level linguistic features 262 are cited as “sentence-level linguistic features”), and a wordpiece embedding from the sequence of wordpiece embeddings [[generated by the BERT model]] that is associated with a word that includes the corresponding syllable (word-level linguistic features 252 are cited as “a wordpiece embedding … associated with a word that includes the corresponding syllable”).
Clark does not describe “using the BERT model; and
[[a wordpiece embedding from the sequence of wordpiece embeddings]] generated by the BERT model.”
 However, page 4431, Section 2.2 of Hayashi describes using a BERT model to generate a deep representation based on input wordpieces, and thus using those wordpiece embeddings in further processing.  Section 2.1 (crossing pages 4430 and 4431) describes the output of the BERT model as “contextual encodings.” This representation generated by the BERT model (cited as “wordpiece embeddings”) could then be input into the autoencoder of Clark. 

    PNG
    media_image1.png
    318
    563
    media_image1.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the pre-trained BERT model of Hayashi as the deep neural network of Clark to allow a user to allow for training of the autoencoder without needed a variety of speech samples, as described in the last paragraph of Section 1, on page 4430 of Hayashi.  

    PNG
    media_image2.png
    100
    562
    media_image2.png
    Greyscale

	With regard to Claim 12, Clark describes “the utterance embedding comprises a fixed-length numerical vector.” (Claim 12)
	With regard to Claim 13, Clark describes:
“A system comprising:
data processing hardware; (Paragraph 31, data processing hardware 122) and
memory hardware (Paragraph 31, memory hardware 124) in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising:
receiving a text utterance having one or more words, each word having at least one syllable, each syllable having at least one phoneme; (Claim 1, lines 1-4)
selecting an utterance embedding for the text utterance, the utterance embedding representing an intended prosody; (Claim 1, lines 5-7)
for each syllable, using the selected utterance embedding and a prosody model (Claim 1, line 8 and paragraph 33.  Paragraph 33 describes that the prosody model 300 may select an utterance embedding 260 for the text utterance 320.) [[that incorporates the BERT model]]:
when the word that includes the syllable includes a plurality of wordpiece units from the sequence of wordpiece units, selecting the wordpiece embedding from the sequence of wordpiece embeddings that corresponds to the initial wordpiece unit of the plurality of wordpiece units from the sequence of wordpiece units to represent the word that includes the syllable; (Paragraph 41 describes that the variational autoencoder 300 is configured to produce a plurality of fixed-length prosodic syllable embeddings 245 for an utterance that includes words with the syllables.  The entire word is used, which includes the initial wordpiece.)
generating a corresponding prosodic syllable embedding for the syllable [[based on the wordpiece embedding]] corresponding to the initial wordpiece unit of the plurality of wordpiece units from the sequence of wordpiece units that includes the syllable that is selected to represent associated with the word that includes the syllable; (Paragraph 41 describes that the variational autoencoder 300 is configured to produce a plurality of fixed-length prosodic syllable embeddings 245 for an utterance that includes words with the syllables.  The entire word is used, which includes the initial wordpiece.) and
predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with the corresponding prosodic syllable embedding for the syllable; (Claim 1, lines 9-12) and
generating, using the prosody model, a prosodic representation for the text utterance based on the predicted durations of the syllables. (Paragraph 47 describes that prosody model 300 generates a prosodic representation for the text utterance based on the predicted durations of the syllables)
Clark does not explicitly describe:
“generating, using a Bidirectional Encoder Representations from Transformers (BERT) model, a sequence of wordpiece embeddings, each wordpiece embedding corresponding to one of the word piece units that are associated with one of the one or more words of the text utterance; 
[[a prosody model]] that incorporates the BERT model; and 
[[generating]] ... based on the wordpiece embedding;” and
 “splitting, by the data processing hardware, using a tokenizer, the text utterance into a sequence of wordpiece units.”
However, page 4431, Section 2.2 of Hayashi describes using a BERT model to generate a deep representation based on input wordpieces.  Section 2.1 (crossing pages 4430 and 4431) describes the output of the BERT model as “contextual encodings.” It would be obvious to input this representation generated by the BERT model (cited as “wordpiece embeddings”) into the autoencoder of Clark.  Further, it would have been obvious to generate a corresponding prosodic syllable embedding for the syllable based on the representation generated by the BERT model, as this representation is a “contextual encoding” as described by Hayashi.

    PNG
    media_image1.png
    318
    563
    media_image1.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the BERT derived encodings of Hayashi into the system of Clark to allow a user to allow for training of the autoencoder without needed a variety of speech samples, as described in the last paragraph of Section 1, on page 4430 of Hayashi.  

    PNG
    media_image2.png
    100
    562
    media_image2.png
    Greyscale

	Clark in view of Hayashi does not explicitly describe “splitting, by the data processing hardware, using a tokenizer, the text utterance into a sequence of wordpiece units.”
However, paragraph 40 of Zhao describes a device that uses a tokenizer to divide input words into wordpieces.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the wordpieces of Zhao into the system of Clark in view of Hayashi to reduce training error rates, as described in paragraph 59 of Zhao.
With regard to Claim 14, Clark describes:
the operations further comprise, for each syllable, using the selected utterance embedding and the prosody model: (Claim 1, line 8 and paragraph 33.  Paragraph 33 describes that the prosody model 300 may select an utterance embedding 260 for the text utterance 320.)
predicting a pitch contour of the syllable based on the predicted duration for the syllable; (Claim 1, lines 13-15) and
generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable, each fixed-length predicted pitch frame representing part of the predicted pitch contour of the syllable, wherein generating the prosodic representation for the utterance is based on the plurality of fixed-length predicted pitch frames generated for each syllable. (Claim 1, lines 16-20 and paragraph 44.  Paragraph 44 describes that the autoencoder 300 predicts a prosodic representation for a given text utterance 320 during inference by jointly predicting durations of phonemes 230 and pitch and/or energy contours (cited as “the plurality of fixed-length predicted pitch frames”) for each syllable 240 of the given text utterance 320.  Paragraph 47 describes that each syllable in the syllable level 240 may be associated with a corresponding LTSM processing cell that outputs a corresponding syllable embedding to the faster clocking phoneme level 230 for decoding the individual fixed-length predicted pitch (F0) frames and for decoding the individual fixed-length predicted energy (C0) frames in parallel.) 
	With regard to Claim 15, Clark describes:
“the operations further comprise, for each syllable, using the selected utterance embedding and the prosody model: (Claim 6)
predicting an energy contour of each phoneme in the syllable based on a predicted duration for the phoneme; (Claim 6) and
for each phoneme associated with the syllable, generating a plurality of fixed-length predicted energy frames based on the predicted duration for the corresponding phoneme, each fixed-length energy frame representing the predicted energy contour of the corresponding phoneme, (Claim 6)
wherein generating the prosodic representation for the utterance is further based on the plurality of fixed-length predicted energy frames generated for each phoneme associated with each syllable.” (Paragraph 44 describes that the autoencoder 300 predicts a prosodic representation for a given text utterance 320 during inference by jointly predicting durations of phonemes 230 and pitch and/or energy contours (cited as “the plurality of fixed-length predicted energy frames”) for each syllable 240 of the given text utterance 320.  Paragraph 47 describes that each syllable in the syllable level 240 may be associated with a corresponding LTSM processing cell that outputs a corresponding syllable embedding to the faster clocking phoneme level 230 for decoding the individual fixed-length predicted pitch (F0) frames and for decoding the individual fixed-length predicted energy (C0) frames in parallel.) 
	With regard to Claim 16, Clark describes:
“a hierarchical linguistic structure represents the text utterance, (Claim 7)  the hierarchical linguistic structure comprising:
a first level including each syllable of the text utterance; (Claim 7)
a second level including each phoneme of the text utterance; (Claim 7)
a third level including each fixed-length predicted pitch frame for each syllable of the text utterance; (Claim 7) and
a fourth level parallel to the third level and including each fixed-length predicted energy frame for each phoneme of the text utterance. (Claim 7)
	With regard to Claim 17, Clark describes:
“the first level of the hierarchical linguistic structure comprises a long short-term memory (LSTM) processing cell representing each syllable of the text utterance; (Claim 8)
the second level of the hierarchical linguistic structure comprises a LSTM processing cell representing each phoneme of the text utterance, the LSTM processing cells of the second level clocking relative to and faster than the LSTM processing cells of the first level; (Claim 8)
the third level of the hierarchical linguistic structure comprises a LSTM processing cell representing each fixed-length predicted pitch frame, the LSTM processing cells of the third level clocking relative to and faster than the LSTM processing cells of the second level; (Claim 8) and
the fourth level of the hierarchical linguistic structure comprises a LSTM processing cell representing each fixed-length predicted energy frame, the LSTM processing cells of the fourth level clocking at the same speed as the LSTM processing cells of the third level and clocking relative to and faster than the LSTM processing cells of the second level. (Claim 8)
	With regard to Claim 18, Clark describes “the lengths of the fixed-length predicted energy frames and the fixed-length predicted pitch frames are the same.” (Claim 10)
	With regard to Claim 19, Clark describes “a total number of fixed-length predicted energy frames generated for each phoneme of the received text utterance is equal to a total number of the fixed-length predicted pitch frames generated for each syllable of the received text utterance.” (Claim 11)
	With regard to Claim 20, Clark describes:
receiving training data including a plurality of reference audio signals and corresponding transcripts, each reference audio signal comprising a spoken utterance of human speech and having a corresponding prosody, each transcript comprising a textual representation of the corresponding reference audio signal; (Paragraph 32 describes that deep neural network 200 may store each fixed-length utterance embedding 260 in an utterance embedding storage 180 (e.g., on the memory hardware 124 of the computing system 120) along with a corresponding transcript 261 of the reference audio signal 222 associated the utterance embedding 260.) and
training, using a deep neural network [[that incorporates the BERT model]], the prosody model by encoding each reference audio signal into a corresponding fixed-length utterance embedding representing the corresponding prosody of the reference audio signal. (Paragraph 32 describes that deep neural network 200 is configured to encode/compress the prosodic representation associated with each reference audio signal 222 into a corresponding fixed-length utterance embedding 260.)
Clark does not explicitly describe:
“obtaining the BERT model, the BERT model pre-trained on a text-only language modeling task; and
[[a deep neural network]] that incorporates the BERT model”
However, page 4431, Section 2.2 of Hayashi describes using a deep neural network that is a BERT model, and that is pre-trained and where the input is text.    

    PNG
    media_image1.png
    318
    563
    media_image1.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the pre-trained BERT model of Hayashi as the deep neural network of Clark to allow a user to allow for training of the autoencoder without needed a variety of speech samples, as described in the last paragraph of Section 1, on page 4430 of Hayashi.  

    PNG
    media_image2.png
    100
    562
    media_image2.png
    Greyscale

	With regard to Claim 21, Clark describes:
“encoding each reference audio signal into a corresponding fixed-length utterance embedding comprises: (Paragraph 32 describes that deep neural network 200 is configured to encode/compress the prosodic representation associated with each reference audio signal 222 into a corresponding fixed-length utterance embedding 260.)
generating, [[using the BERT model,]] a sequence of wordpiece embeddings from the transcript of the corresponding reference audio signal; (Paragraph 39 describes that linguistic features may be extracted from transcripts 261 and stored for use in conditioning the training of the hierarchical linguistic structure 200.  The linguistic features may include, without limitation, individual sounds for each phoneme, whether each syllable is stressed or un-stressed, the type of each word (e.g., noun/adjective/verb) and/or the position of the word in the utterance, and whether the utterance is a question or phrase.  The element “wordpiece embeddings” is interpreted to be a possible “linguistic feature.”)
sampling, from the corresponding reference audio signal, a sequence of fixed- length reference frames providing a duration, pitch contour, and/or energy contour that represents the corresponding prosody of the reference audio signal; (Paragraph 34 describes that the autoencoder 300 includes an encoder portion 302 (FIG. 2A) that encodes a plurality of fixed-length reference frames 220 sampled from a reference audio signal 222 into a fixed-length utterance embedding 260.  Paragraph 35 describes that the reference frames 220 may each include a duration of 5 milliseconds (ms) and represent one of a contour of pitch (F0) or a contour of energy (C0) for the reference audio signal 222.) and
for each syllable in the reference audio signal:
encoding phone-level linguistic features associated with each phoneme in the syllable into a phone feature-based syllable embedding; (Paragraph 12 describes that predicting the pitch contour of the syllable based on the predicted duration for the syllable may include combining the corresponding prosodic syllable embedding for the syllable with each encoding of the corresponding prosodic syllable embedding and the phone-level linguistic features of each corresponding phoneme associated with the syllable.)
encoding the fixed-length reference frames associated with the syllable into a frame-based syllable embedding, the frame-based syllable embedding indicative of a duration, pitch, and/or energy associated with the corresponding syllable; (Paragraph 38 describes that each syllable 240Aa-240Cb in the level of syllables 240 may correspond to a respective syllable embedding (e.g., a numerical vector) that indicates a duration, pitch (F0), and/or energy (C0) associated with the corresponding syllable 240.) and
encoding, into a corresponding prosodic syllable embedding for the syllable, the phoneme feature-based and frame-based syllable embeddings with syllable-level linguistic features associated with the syllable (Paragraph 46 describes that at the syllable level 240 of LTSM processing cells, the autoencoder 300 is configured to produce/output a corresponding syllable embedding 245Aa, 245Ab, 245Ba, 245Ca, 245Cb for each syllable 240 from the following inputs: the fixed-length utterance embedding 260; utterance-level linguistic features 262 associated with the text utterance 320; word-level linguistic features 252 associated with the word 250 that contains the syllable 240; and syllable-level linguistic features 242 for the syllable 240.), sentence-level linguistic features associated with the reference audio signal (utterance-level linguistic features 262 are cited as “sentence-level linguistic features”), and a wordpiece embedding from the sequence of wordpiece embeddings [[generated by the BERT model]] that is associated with a word that includes the corresponding syllable. (word-level linguistic features 252 are cited as “a wordpiece embedding … associated with a word that includes the corresponding syllable”).
Clark does not describe “using the BERT model; and
[[a wordpiece embedding from the sequence of wordpiece embeddings]] generated by the BERT model”
However, page 4431, Section 2.2 of Hayashi describes using a BERT model to generate a deep representation based on input wordpieces, and thus using those wordpiece embeddings in further processing.  Section 2.1 (crossing pages 4430 and 4431) describes the output of the BERT model as “contextual encodings.” This representation generated by the BERT model (cited as “wordpiece embeddings”) could then be input into the autoencoder of Clark. 


    PNG
    media_image1.png
    318
    563
    media_image1.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the pre-trained BERT model of Hayashi as the deep neural network of Clark to allow a user to allow for training of the autoencoder without needed a variety of speech samples, as described in the last paragraph of Section 1, on page 4430 of Hayashi.  

    PNG
    media_image2.png
    100
    562
    media_image2.png
    Greyscale

	With regard to Claim 24, Clark describes “the utterance embedding comprises a fixed-length numerical vector.” (Claim 12)

5.	Claims 10, 11, 22, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Clark in view of Hayashi and Zhao and further in view of U.S. Patent App. Pub. No. 20210312906 (Kuo et al. hereinafter “Kuo”).
	With regard to Claim 10, Clark describes:
“training the prosody model further comprises, for each reference audio signal:
sampling, from the corresponding reference audio signal, a sequence of fixed- length reference frames providing a duration, pitch contour, and/or energy contour that represents the corresponding prosody of the reference audio signal; (Paragraph 34 describes that the autoencoder 300 includes an encoder portion 302 (FIG. 2A) that encodes a plurality of fixed-length reference frames 220 sampled from a reference audio signal 222 into a fixed-length utterance embedding 260.  Paragraph 35 describes that the reference frames 220 may each include a duration of 5 milliseconds (ms) and represent one of a contour of pitch (F0) or a contour of energy (C0) for the reference audio signal 222.)
decoding, using the transcript of the corresponding reference audio signal, the corresponding fixed-length utterance embedding into a sequence of fixed-length predicted frames representing a prosodic representation of the transcript; (Paragraph 60 describes that the data processing hardware 122 may first query the data storage 180 to locate utterance embeddings 260 having transcripts 261 that closely match the text utterance 320 and then select the utterance embeddings 260 to predict the prosodic representation 322 for the given text utterance 320. In some examples, the fixed-length utterance embedding 260 is selected by picking a specific point in a latent space of embeddings 260 that likely represents particular semantics and pragmatics for a target prosody.)
Clark in view of Hayashi and Zhao does not explicitly describe:
“generating gradients/losses between the sequence of fixed-length predicted frames decoded from the corresponding fixed-length utterance embedding and the sequence of fixed-length reference frames sampled; and
back-propagating the gradients/losses through the prosody model.”  
However, Kuo describes a BERT-based classifier 723 that:
“generating gradients/losses between the sequence of fixed-length predicted frames decoded from the corresponding fixed-length utterance embedding and the sequence of fixed-length reference frames sampled; (Paragraph 54 describes that a training procedure that optimizes two separate loss terms is employed. The first loss term corresponds to a composite cross-entropy intent classification loss derived by using the text embeddings and the acoustic embeddings (cited as “predicted frames” and “reference frames”) (“gradients/losses” is interpreted to be “gradients or losses”)) and
back-propagating the gradients/losses through the prosody model.” (Paragraph 54 describes that the gradients from the combined classification loss are propagated back to both the text and acoustic embedding networks.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the back propagation of gradients/losses of Kuo into the combination of Clark in view of Hayashi and Zhao to achieve the goal of better training, as described in paragraphs 52 and 54 of Kuo.
With regard to Claim 11, Clark in view of Hayashi and Zhao does not explicitly describe “back-propagating the gradients/losses through the prosody model comprises fine-tuning the pre-trained BERT model by updating parameters of the pre-trained BERT model based on the gradients/losses back-propagating through the prosody model.”
However, Kuo describes “back-propagating the gradients/losses through the prosody model comprises fine-tuning the pre-trained BERT model by updating parameters of the pre-trained BERT model based on the gradients/losses back-propagating through the prosody model.” (Paragraph 54 describes that the gradients from the combined classification loss are propagated back to both the text and acoustic embedding networks, which would update the parameters of these BERT-based deep neural networks.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the back propagation of gradients/losses of Kuo into the combination of Clark in view of Hayashi and Zhao to achieve the goal of better training, as described in paragraphs 52 and 54 of Kuo.
	With regard to Claim 22, Clark describes:
“training the prosody model further comprises, for each reference audio signal:	sampling, from the corresponding reference audio signal, a sequence of fixed- length reference frames providing a duration, pitch contour, and/or energy contour that represents the corresponding prosody of the reference audio signal; (Paragraph 34 describes that the autoencoder 300 includes an encoder portion 302 (FIG. 2A) that encodes a plurality of fixed-length reference frames 220 sampled from a reference audio signal 222 into a fixed-length utterance embedding 260.  Paragraph 35 describes that the reference frames 220 may each include a duration of 5 milliseconds (ms) and represent one of a contour of pitch (F0) or a contour of energy (C0) for the reference audio signal 222.)
decoding, using the transcript of the corresponding reference audio signal, the corresponding fixed-length utterance embedding into a sequence of fixed-length predicted frames representing a prosodic representation of the transcript; (Paragraph 60 describes that the data processing hardware 122 may first query the data storage 180 to locate utterance embeddings 260 having transcripts 261 that closely match the text utterance 320 and then select the utterance embeddings 260 to predict the prosodic representation 322 for the given text utterance 320. In some examples, the fixed-length utterance embedding 260 is selected by picking a specific point in a latent space of embeddings 260 that likely represents particular semantics and pragmatics for a target prosody.)
Clark in view of Hayashi and Zhao does not explicitly describe:
“generating gradients/losses between the sequence of fixed-length predicted frames decoded from the corresponding fixed-length utterance embedding and the sequence of fixed-length reference frames sampled; and
back-propagating the gradients/losses through the prosody model.”  
However, Kuo describes a BERT-based classifier 723 that:
“generating gradients/losses between the sequence of fixed-length predicted frames decoded from the corresponding fixed-length utterance embedding and the sequence of fixed-length reference frames sampled; (Paragraph 54 describes that a training procedure that optimizes two separate loss terms is employed. The first loss term corresponds to a composite cross-entropy intent classification loss derived by using the text embeddings and the acoustic embeddings (cited as “predicted frames” and “reference frames”) (“gradients/losses” is interpreted to be “gradients or losses”)) and
back-propagating the gradients/losses through the prosody model.” (Paragraph 54 describes that the gradients from the combined classification loss are propagated back to both the text and acoustic embedding networks.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the back propagation of gradients/losses of Kuo into the combination of Clark in view of Hayashi and Zhao to achieve the goal of better training, as described in paragraphs 52 and 54 of Kuo.
With regard to Claim 23, Clark in view of Hayashi and Zhao does not explicitly describe “back-propagating the gradients/losses through the prosody model comprises fine-tuning the pre-trained BERT model by updating parameters of the pre-trained BERT model based on the gradients/losses back- propagating through the prosody model.”
However, Kuo describes “back-propagating the gradients/losses through the prosody model comprises fine-tuning the pre-trained BERT model by updating parameters of the pre-trained BERT model based on the gradients/losses back- propagating through the prosody model.” (Paragraph 54 describes that the gradients from the combined classification loss are propagated back to both the text and acoustic embedding networks, which would update the parameters of these BERT-based deep neural networks.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the back propagation of gradients/losses of Kuo into the combination of Clark in view of Hayashi and Zhao to achieve the goal of better training, as described in paragraphs 52 and 54 of Kuo.

Conclusion
6.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
U.S. Patent App. Pub. No. 20210286947 (Pajak) describes a device that also divides words into word pieces using a tokenizer.
7.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EDWARD TRACY whose telephone number is (571)272-8332. The examiner can normally be reached Monday-Friday 9 AM- 5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/EDWARD TRACY JR./Examiner, Art Unit 2656                                                                                                                                                                                                        
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656