77Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-12 are pending and Claims 1 and 3-5 and 7-12 are independent.  Claims 11-12 are new.  Claims have been amended.  Claims 1 and 5 are amended to include a normalizing step: “normalize an acoustic feature amount of speech data corresponding to the learning text data.”
This Application is published as U.S. 20190362703.
Apparent priority February 15, 2017.

Claim 1 is directed to a “word vectorization model learning device.”  
Claim 2 depends from Claim 1.
Claim 3 is directed to a “word vectorization device” that uses the model learned in Claims 1 or 2.
Claim 4 is directed to a “speech synthesis device” that uses the word vectorization device of Claim 3 (and thus includes limitations of Claims 1 or 2 both).
Claim 5 is directed to a “word vectorization model learning method” with limitations similar to those of Claim 1.  Claim 6 depends from Claim 5.
Claim 7 is directed to a “speech synthesis method” using the word vectorization device of Claim 6.
Claim 8 is a CRM claim referring to Claims 1 or 2.
Claim 9 is a CRM claim referring to Claim 3.
Claim 10 is a CRM claim referring to Claim 4.
Claim 11 is directed to a “word vectorization device.”  
Claim 12 is directed to a “speech synthesis device” that uses the “word vectorization device” of Claim 11.  

Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection.

Please see Example Claim language below.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 4/6/2022 has been entered.

Response to Amendments
Objection to Figure 1, Figure 2, and Figures 3A and 3B is withdrawn in view of the amendments that add the legend “Related Art” to the Drawings.
Objection to Claim 2 because informalities is withdrawn in view of the amendments to this Claim.
Example Claim Language

    PNG
    media_image1.png
    594
    669
    media_image1.png
    Greyscale

Please note from the Specification:  “[0031] The major difference from a conventional word vectorization model learning device 91 (see FIG. 2) is that the word vectorization model learning device 91 uses only text data as learning data of the word vectorization model, whereas this embodiment uses speech data and its text data.”  
This appears key to the concept of this Application.  Speech is conducted in time.  It is not an instantaneous data point like text.  Concept of Time as an input has to be present in the Claim.  In the instant Application this concept is represented in the “word segmentation information segL,s(t)” which is an input to the training/learning part of Figure 4 either directly or indirectly through an “alignment” processing.
See also: “[0037] In addition, word segmentation information segL,s(t) (see FIG. 7) indicating when the word yL,s(t) in the speech data was spoken is also given. …  This word segmentation information may be given manually, or may be automatically given from speech data and text data by using a speech recognizer or the like. In this embodiment, information xL(t) and word segmentation information segL,s(t) based on the speech data are input to the word vectorization model learning device 110. However, a configuration may be adopted in which only the information xL(t) based on the speech data is input to the word vectorization model learning device 110 and the word boundary of each word is given by forced alignment in the word vectorization model learning device 110, thereby obtaining word segmentation information segL,s(t).”

Consider the following Claim as an example.  Claim needs to set forth the Inputs clearly and then state what the Output will be.  Here, the Inputs are to the training/learning module and the initial output is the trained model.  Claim needs to reiterate what the trained model takes in as its input and what it can generate as output.
Example Claim. A word vectorization model learning device comprising: 
processing circuitry configured to:
receive learning text data ( textL ),  wherein the learning text data is a speech recognition corpus comprising speech data ( XL ) and corresponding transcribed text data as words ( yL,s(t) );
receive word segmentation information ( segL,s(t) ) indicating when each word in the speech data was spoken;
convert the words in the learning text data to word vectors ( wL,s(t) );
obtain an acoustic feature amount vector ( af L,s(t) ) for each word from the speech data XL and the word segmentation information;
provide the word vectors as a first input to a neural network in training;
provide the acoustic feature amount for each word as a second input to the neural network; and
train the neural network to learn the word vectorization model by using the word vectors indicating the words included in learning text data, and the acoustic feature amount that corresponds to the words,
wherein the trained word vectorization model receives a word vector wL, representing a first word and outputs an acoustic feature amount vector af(wL) representing an acoustic feature amount of the speech data corresponding to the first word, and
wherein the neural network uses an output value from any intermediate layer as a word vector. 

Examiner makes no representation as to allowability of the Example Claim.  The Instant Application trains a neural network model to generate acoustic information/feature data for an input word.  The neural network model is trained on a corpus of text and associated acoustic information.  This concept is not novel.  Nonobviousness, if any, would depend on particulars of the claimed method.

Response to Arguments

Arguments are moot in view of the new grounds of rejection.

Claim 1 provides:
1. A word vectorization model learning device comprising: 
processing circuitry configured to:
normalize an acoustic feature amount of speech data corresponding to the learning text data, and
learn a word vectorization model by using a vector wL indicating a word yL included in learning text data, and an acoustic feature amount that is the normalized acoustic feature amount and that corresponds to the word yL, 
wherein the word vectorization model includes 
a neural network that receives a vector indicating a word as an input and outputs the acoustic feature amount of speech data corresponding to the word, and 
the word vectorization model is a model that uses an output value from any intermediate layer as a word vector. 

	Support for the added normalization step is found at:
Speech Data Normalization Part 214
[0061] The speech data normalization part 214 receives an acoustic feature amount, which is information x.sub.L based on speech data, normalizes the acoustic feature amount of the speech data corresponding to the learning text data of the same speaker (S121), and outputs it.
[0062] In the case where the information about the speaker of each sentence is given in the acoustic feature amount, a way of normalization is, for example, to determine the mean and the variance from the acoustic feature amount of the same speaker, and determine the z-score. For example, in the case where no information about the speaker is given, it is assumed that the speakers of the sentences are different, and the mean and the variance are determined for each sentence from the acoustic feature amount, and the z-score is determined. Subsequently, the z-score is used as a normalized acoustic feature amount.
	Published Application.
	“Normalization” can have varied meanings in different contexts.  The instant Application defines normalizing as obtaining a Z-score.  However, the Claim broadly refers to normalizing the acoustic feature amount without providing particularity regarding how this normalizing is performed.  In general, a “normalization” step, without particularity, is quite “normal” in many processes using data with variability that must be wiped out or moderated before use. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 5-6, 8-9 and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Pollet (U.S. 20180096677) in view of Matsuda (U.S. 20160260428).
Regarding Claim 1, Pollet teaches:
1. A word vectorization model learning device comprising: 
processing circuitry configured to: [Pollet, Figure 1, “processor 111.”]
normalize an acoustic feature amount of speech data corresponding to the learning text data, and [ Pollet teaches normalizing the linguistic features and not the acoustic features.  “[0112] … The lingcode function extracts and normalizes a 367 dimensional vector as the linguistic features. …”]
learn a word vectorization model by using [Pollet.  Figures 3 and 10.  Figure 3 shows the training of the speech synthesis model which teaches the word vectorization model of this Claim.  “[0045] A training process configures the model of a LSTM-RNN. FIG. 3 depicts an example method for training an LSTM-RNN….”    “[0049] At step 305, the computing device may perform model training on the LSTM-RNN. …”  Figures 5-9 show various embodiments of model training.  Figure 10, “Train each layer of LSTM-RNN 1001.”]
a vector wL indicating a word yL  [Pollet, the word vector is taught by on-hot vector of “Linguistic Features.”  The word vectors or vector of linguistic features may be determined in different ways ([0056]-[0058]) including: “[0058] … The linguistic encoding function may annotate, or otherwise associate, each sub-word unit with orthographic, syntactic, prosodic and/or phonetic information. … Word representations, which may be encoded as one-hot vectors or word embeddings, may be repeated for all sub-word units.”  For various types of linguistic features, which are comprehensive, see:  “[0055] … Linguistic features may include orthographic words or identifiers to a dictionary of orthographic words, or other encodings of word data. In some variations there may be more than 50 different types of linguistic features. For example, different types of the linguistic features may include phoneme, syllable, word, phrase, phone identity, part-of-speech, phrase tag, stress, number of words in a syllable/word/phrase, compounds, document class, other utterance-level features, etc. …”]  
included in learning text data, and [ Pollet, Figures 3 and 10-12.  The “Training Set” teaches the “Learning text data” of the Claim.  “Configure Training Set  … 301.”  Training set, as shown in Figure 12, includes “linguistic features 901” / text and corresponding speech/ “Aligned acoustic Data 1210” / “acoustic feature amount” of the Claim.  “[0107] … a set of ground truth speech units or other type of speech data….”]
an acoustic feature amount that is the normalized acoustic feature amount and that corresponds to the word yL, [ Pollet,  ““[0062] In some variations, the speech data may include acoustic parameters, a sequence of speech unit identifiers, or other speech descriptors. The sequence of speech unit identifiers may be the result of a search of a speech unit database. In some arrangements, the acoustic parameters may include acoustic frame-sized features including, for example, high resolution, dimensional spectral parameters (e.g., mel regularized cepstral coefficients (MRCC)); high resolution fundamental frequency; duration; high resolution excitation (e.g., maximum voicing frequency); one or more other prosodic parameters (e.g., log-fundamental frequency (log-f0) or pitch); one or more other spectral parameters (e.g., mel cepstral vectors with first and second order derivatives, parameters for providing a high detailed multi-representational spectral structure); or any other parameter encoded by one of the LSTM-RNNs. Indeed, different embodiments may include different configurations of the acoustic parameters (e.g., a set based on the parameters supported by or expected by a particular vocoder). Further details of the types of speech data that may result from the determination at step 405 will be discussed in connection with FIGS. 5-9.”]
wherein the word vectorization model includes a neural network that [Pollet, Figure 4 shows the use of the trained TTS to synthesize speech from text.  Figure 2 shows the trained model to be a “LSTM-RNN 202.”  Figure 4, “Using one or more LSTM-RNNs … 405.”]
receives a vector indicating a word as an input and [ Pollet, Figure 4, “Receive Text Input 401” and “Determine Linguistic Features Based on the Input Text 403.”  “[0054] At step 401, a computing device may receive text input. The text input may, in some variations, be received following user interaction with an interface device or generated as part of an application's execution. For example, a user may enter a string of characters by pressing keys on a keyboard and the text input may include the string of characters. ….”  Note [0056]-[0058] regarding encoding the word representations as one-hot vectors or vector embeddings and that words are a type of “linguistic feature” of Pollet as taught in [0055].]
outputs the acoustic feature amount of speech data corresponding to the word, and [ Pollet, Figure 4, ““Using one or more LSTM-RNNs, Determine speech data based on the linguistic features 405.”  “Speech data” includes “acoustic features.”  “[0025] … recurrent neural networks (RNNs) may be used to determine speech data based on text input. …”  See [0062] for the types of “acoustic feature” of “speech data” that are considered by the RNN models both during training and as output.]
the word vectorization model is a model that uses an output value from any intermediate layer as a word vector. [ Pollet, Figure 4, “Using One or More LSTM-RNNs … 405.”  Pollet teaches that outputs of the hidden layers/ intermediate layers are used.  SUE= Speech Embedding Unit which teaches the “word vector” of the Claim.  “[0028] In the embodiments described throughout this disclosure, the embedded data extracted from one or more hidden layers of an LSTM-RNN may be used as a basis for performing a speech unit search. In particular, the embedded data may include one or more SUEs, which may be organized as a vector (e.g., an SUE vector). For example, an SUE vector may be extracted or otherwise determined from activations of a hidden layer from an LSTM-RNN. With respect to the embodiments described herein, an SUE vector may encode both symbolic and acoustic information associated with speech units. For example, the vector of SUE may encode dynamic information that describes both future state information and past state information for a speech unit and may encode mapping information between linguistic features to acoustic features. … Due to the SUE vector being extracted from an intermediate layer in the BLSTM-RNN, the exact content of the SUE may be unknown and, thus, the SUE vector may be considered information that is both deep in time and space. ….”]
Pollet does not teach normalizing the acoustic feature data.
Matsuda teaches:
normalize an acoustic feature amount of speech data corresponding to the learning text data, and [ Matsuda teaches acoustic feature normalization in order to smooth out speaker variability in the training data:   “[0011] Speech characteristics differ as the sex and age of speakers differ. ….”  “[0012] To solve this problem, in a conventional speech recognition technique using an acoustic model based on HMM (Hidden Markov Model), a method of speaker adaptation referred to as SAT (Speaker Adaptive Training) has been successfully applied. … SAT is a training scheme that normalizes speaker-dependent acoustic variability in speech signals and optimizes recognizing parameters including GMMs to realize speaker-adaptation of the acoustic models and to achieve high recognition accuracy. HMM of this type is referred to as SAT-HMM.”]
Pollet and Matsuda pertain to training of neural networks for speech synthesis and it would have been obvious to modify the system of Pollet that teaches the training and use of a neural network model for speech synthesis but does not teach normalizing of the acoustic feature data of the corpus used for the training with the teachings of Matsuda that teaches this feature in order to smooth out the variability that is not to be trained into the model.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 3, Pollet teaches:
3. A word vectorization device that uses a word vectorization model learned in the word vectorization model learning device according to claim 1 or 2, the word vectorization device comprising:
processing circuitry configured to:
convert a vector wo_1 indicating a word yo included in text data to be vectorized to a word vector wo_2 by using the word vectorization model. [See rejection of Claim 1. Pollet. The word vector is taught by on-hot vector of “Linguistic Features.”  Input text is first converted to a vector before being input to the neural network and then at each layer of the LSTM-RNN the input word vector is converted to another vector.  The word vectors or vector of linguistic features may be determined in different ways ([0056]-[0058]) including: “[0058] … The linguistic encoding function may annotate, or otherwise associate, each sub-word unit with orthographic, syntactic, prosodic and/or phonetic information. … Word representations, which may be encoded as one-hot vectors or word embeddings, may be repeated for all sub-word units.”]

Claim 5 is a method Claim with limitations similar to the limitations of Claim 1 which are rejected under similar rationale.
5. A word vectorization model learning method to be executed by a word vectorization model learning device that includes processing circuitry, the word vectorization model learning method comprising: 
normalizing step in which the processing circuitry normalizes an acoustic feature amount of speech data corresponding to the learning text data, and
a learning step in which the processing circuitry learns a word vectorization model by using a vector wL indicating a word yL included in learning text data, and an acoustic feature amount that is the normalized acoustic feature amount and that corresponds to the word yL, 
wherein the word vectorization model includes a neural network that receives a vector indicating a word as an input and outputs the acoustic feature amount of speech data corresponding to the word, and the word vectorization model is a model that uses an output value from any intermediate layer as a word vector. 

Claim 6 is a method Claim which obtains its limitations from Claim 5 and with additional limitations similar to the limitations of Claim 3 which are rejected under similar rationale.
6. A word vectorizing method to be executed by a word vectorization device, the word vectorizing method using a word vectorization model learned by the word vectorization model learning method according to claim 5, the word vectorizing method comprising: 
a word vector converting step in which the processing circuitry converts a vector wo_1 indicating a word yo included in text data to be vectorized to a word vector wo_2 by using the word vectorization model. 

Claim 8 is a CRM Claim receives its steps from the limitations of Claim 1 or Claim 2 which are rejected under similar rationale.  Further:
8. A non-transitory computer-readable medium having recorded thereon a program for causing a computer to function as the word vectorization model learning device according to claim 1 or 2. [Pollet:  Figure 1, Memory 121 storing programs and data.  “15. One or more computer-readable media storing executable instructions that, when executed cause an apparatus to:….”]

Claim 9 is a CRM Claim receives its steps from the limitations of Claim 3 which are rejected under similar rationale. 
9. A non-transitory computer-readable medium having recorded thereon a program for causing a computer to function as the word vectorization device according to claim 3. 

Claim 11 is a device Claim with limitations similar to the limitations of Claim 1 and Claim 3 which are rejected under similar rationale.
11. A word vectorization device that uses a word vectorization model learned in a word vectorization model learning device, wherein 
the word vectorization model learning device comprises processing circuitry configured to learn a word vectorization model by using a vector wL indicating a word yL included in learning text data, and an acoustic feature amount that is an acoustic feature amount of speech data corresponding to the learning text data and that corresponds to the word yL, [Claim 1.]
the word vectorization model includes a neural network that receives a vector indicating a word as an input and outputs the acoustic feature amount of speech data corresponding to the word, and the word vectorization model is a model that uses an output value from any intermediate layer as a word vector, and [Claim 1.]
the word vectorization device inputs a vector wo 1 indicating a word yo included in text data to be vectorized to the word vectorization model, and outputs an output value from any intermediate layer in the word vectorization model as a word vector wo 2 of the word yo.  [Claim 3.]

Claim 12 is a device Claim receives its steps from the limitations of Claim 11 which are rejected under similar rationale and additionally:
Regarding Claim 12, Pollet teaches: 
12. A speech synthesis device that generates synthesized speech data by using a word vector vectorized using the word vectorization device according to Claim 11, [Pollet, Figure 4 showing the process of speech synthesis.  Figure 1 showing the device.]
comprising: 
processing circuitry configured to: [Pollet, Figure 1 showing the hardware including “processor 111.”]
receive phonemic information on a certain word and a word vector corresponding to the word as inputs, and [Pollet teaches that the LSTM-RNN may be trained at a phonemic level and the generated speech units may be phonemes.  “[0069] The predictions of the first LSTM-RNN 503 may be at the unit-level, which may be smaller than a word (e.g., a diphone, phone, half-phone or demi-phone). …”]
generate synthesized speech data. [Pollet, Figure 4, “generate a waveform based on the speech data 407” and output of the waveform 411.]

Claims 2, 4, 7 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Pollet and Matsuda in view of Raitio (U.S. 20170345411).
Regarding Claim 2, Pollet teaches:
2. The word vectorization model learning device according to claim 1, wherein the processing circuitry is further configured to:
convert the word yL included in the learning text data to a first vector wL,1 indicating the word yL, and [Pollet teaches that one-hot vectors of Linguistic Features / words are created by the LSTM-RNN.  See Claim 1.  See Figure 2, input of “Linguistic Features 201” to the “input layer” and the output from the “input layer.”]
convert the first vector wL,1 to the vector wL by using a second word vectorization model, wherein the second word vectorization model is a model that includes a neural network learned based on language information without use of the acoustic feature amount of speech data. 
Pollet, Figure 5, teaches a “First LSTM-RNN 503” to be followed by a “Second LSTM-RNN 511.”  The inputs to and outputs from the layers are vectors.  However, the second model of Pollet operates on speech units and the next limitation of the Claim states that the operation must be on words.  “[0073] The selected speech units 509 are input into a second LSTM-RNN 511. The second LSTM-RNN 511 predicts a second subset of the speech data 513 based on the selected speech units 509….”  “[0074] The prediction of the second LSTM-RNN 511 may be at the frame-level. A duration parameter in the selected data 509 may determine the extrapolation of the selected speech units 509 from unit to frame resolution by appending frame positional features relative to unit duration. Within the second LSTM-RNN 511, several hidden layers may be stacked to create computationally deep model.”
Matsuda teaches the use of n-gram language models.  See [0010] and [0078].  Thus, inherently teaches the use of n-grams to predict a word from the previous word.
Raitio expressly teaches:
convert the first vector wL,1 to the vector wL by using a second word vectorization model, wherein the second word vectorization model is a model that includes a neural network learned based on language information without use of the acoustic feature amount of speech data. [Ratio is also directed to STT and uses Language Models to guess the next word based on the previous words: “[0167] Language model generation module 602 is configured to receive a corpus of text and generate a language model. The generated language model is configured to predict a current word given a context of previous words. For example, the generated language model is an n-gram language model. In some examples, the generate language model is a statistical language model or a neural network based language model.”]
Pollet/Matsuda and Raitio pertain to training of neural networks for speech synthesis and it would have been obvious to modify the system of combination which does include the use of language models with Ratio which expressly teaches the use of n-gram language model to predict a word from the previous.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 4, Pollet teaches: 
4. A speech synthesis device that generates synthesized speech data by using a word vector vectorized using the word vectorization device according to claim 3, [Pollet teaches that its LSTM-RNN trained models are used for speech synthesis.  See Figure 4 and generation of waveforms 411.  “[0029] Taken together, the four illustrative embodiments may provide for aspects of a general framework for synthesizing speech using a neural network graph in which one or more LSTM-RNNs are present. Based on the general framework, other embodiments and variations will be apparent. FIG. 4 depicts a method that relates to the general framework. In particular FIG. 4 illustrates an example method that synthesizes speech based on one or more LSTM-RNNs. Each of the four embodiments described throughout this disclosure provides additional details to the example method of FIG. 4.”]
the speech synthesis device comprising: 
processing circuitry configured to: [Pollet, Figure 1, “processor 111.”]
generate synthesized speech data through a speech synthesis model including a neural network that receives phonemic information on a certain word and a word vector corresponding to the word as inputs and outputs information for generating synthesized speech data related to the word, by using phonemic information on the word yo and the word vector wo, [Pollet teaches speech synthesis in Figure 4 and teaches that the model may be trained on speech units that are ”phonemes.”  “[0030] … In one more particular example, a fixed window of features may be used and include features such as previous phoneme, previous word prominence level and current word prominence level….”  Pollet teaches that the LSTM-RNN may be trained at a phonemic level and the generated speech units may be phonemes.  “[0069] The predictions of the first LSTM-RNN 503 may be at the unit-level, which may be smaller than a word (e.g., a diphone, phone, half-phone or demi-phone). …”]
wherein the word vectorization model is obtained by re-learning a word vectorization model learned using the vector wL and the acoustic feature amount, the re-learning using a vector indicating a word and an acoustic feature amount of speech data for speech synthesis that is speech data corresponding to the word. [Pollet teaches training of the LSTM-RNN model in Figure 3 and Figures 10-12 and training includes “re-learning.”  Also, the “validation” operation of Pollet can teach the re-training of the Claim.  “[0050] At step 307, the computing device may perform model validation on the LSTM-RNN. In some variations, the validation set may be applied to the trained model and heuristics may be tracked. The heuristics may be compared to one or more stop conditions. If a stop condition is satisfied, the training process may end. If none of the one or more stop conditions are satisfied, the model training of step 305 and validation of step 307 may be repeated. Each iteration of steps 305 and 307 may be referred to as a training epoch. Some of the heuristics that may be tracked include sum squared errors (SSE), weighted sum squared error (WSSE), regression heuristics, or number of training epochs. ….”  “[0076] …To avoid overfitting, the training iterations were stopped after 20 iterations without improvement in the performance of the validation set. ….”]
The training/learning uses a corpus of words and their associated speech (acoustic features).  Training/learning is iterative until optimum weights for the neural network model are obtained.  Retraining or re-learning is special only if the training corpus is changed; for example, if a generally trained model is being retrained for a specific person or a specific emotion (Hirose).
While the Claim is broadly stated and Pollet is likely sufficient to teach the language of the Claim, a more express reference is provided.
Matsuda does not address re-training expressly beyond the inherent iterative nature of adaptation or training.
Raitio expressly teaches the retraining:
wherein the word vectorization model is obtained by re-learning a word vectorization model learned using the vector wL and the acoustic feature amount, the re-learning using a vector indicating a word and an acoustic feature amount of speech data for speech synthesis that is speech data corresponding to the word. [ Ratio“[0187] Mixture density network 900 is trained based on data that includes recorded speech and a corresponding corpus of text. In some examples, mixture density network 900 is trained in parallel using multiple CPUs. The parallel training scheme can search for an optimal weight space and provide a model faster than sequential training. This model is further retrained on the whole of the data to obtain the final mixture density network that is used at block 706 to determine the predicted statistical parameters for each of a plurality of acoustic features associated with a respective target unit.”]
Pollet/Matsuda and Raitio pertain to training of neural networks for speech synthesis and it would have been obvious to modify the system of combination which does include the use of language models with Ratio which expressly teaches the re-training of the trained model for a specific purpose in order to achieve specialized models.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 7 is a method Claim which obtains its limitations from Claim 6 and with additional limitations limitations similar to the limitations of Claim 4 which are rejected under similar rationale.
7. A speech synthesis method to be executed by a speech synthesis device, the speech synthesis method generating synthesized speech data by using a word vector vectorized using the word vectorization device according to claim 6, the speech synthesis method comprising: 
a synthesized speech generating step in which the processing circuitry generates synthesized speech data through a speech synthesis model including a neural network that receives phonemic information on a certain word and a word vector corresponding to the word as inputs and outputs information for generating synthesized speech data related to the word, by using phonemic information on the word yo and the word vector wo_2, 
wherein the word vectorization model is obtained by re-learning a word vectorization model learned using the vector wL and the acoustic feature amount, the re-learning using a vector indicating a word and an acoustic feature amount of speech data for speech synthesis that is speech data corresponding to the word. 

Claim 10 is a CRM Claim receives its steps from the limitations of Claim 4 which are rejected under similar rationale. 
10. A non-transitory computer-readable medium having recorded thereon a program for causing a computer to function as the speech synthesis device according to claim 4.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Hirose (U.S. 20090281807) teaches:
normalize an acoustic feature amount of speech data corresponding to the learning text data, and [Hirose, Figure 1, “Learning Text” is input and on the right hand side, the “Acoustic Analysis Unit 2” extracts the “Time series of acoustic features” which is input to the “Phoneme-based Duration Extending/Shortening Unit 6” which performs a normalization on the acoustic feature data:  “[0012] The phoneme-based duration extending/shortening unit 6 temporally normalizes a time series of feature parameters of the speech with emotion to match the speech without emotion, according to the temporal extending/shortening rate for each phoneme generated by the spectrum DP matching unit 4.”  (The instant Application normalizes the acoustic features from speech of various speakers to smooth out the variability; Hirose smooths out the variability in the speech of the same person.  Same idea: smoothing out variability.  Note that Claim is broadly stated.)]

	PLEASE NOTE THE FOLLOWING FROM THE DISCLOSURE OF THE INSTANT APPLICATION:
	The Specification acknowledges the existence of word vectorization methods such as Word2Vec.  Rather, it sets forth a supposedly inventive method of word vectorization that takes in the word vector generated by some program such as Word2Vec and outputs a new type of word vectorization model that takes into account the “acoustic feature amount” (such as pitch) of the word in the vector that it generates.  To do this it has to input a speech recognition corpus that includes speech data as well as corresponding text and word segmentation data that indicates when each word was spoken.  See the following:
[0005] An object of the present invention is to provide a word vectorization device for converting a word into a word vector considering acoustic features of the word, a word vectorization model learning device for learning a word vectorization model used in the word vectorization device, a speech synthesis device for generating synthesized speech data using a word vector, a method thereof, and a program.
…
[0027] In recent years, a large amount of speech data and its transcribed text (hereinafter also referred to as a speech recognition corpus) have been prepared as learning data for speech recognition and the like. In this embodiment, speech data is used as learning data of a word vectorization model (word (morpheme) notation) in addition to text, which is conventionally used. For example, a model that estimates the acoustic feature amount (spectrum, pitch parameter, and the like) of a word and its temporal variations from an input word (text data) is learned using a large amount of speech data and text, and this model is used as a word vectorization model. 

[0028] Learning a model in this manner allows a vector considering the similarity of pronunciation or other features between words to be extracted. Further, use of word vectors considering similarity of pronunciation or other features can improve the performance of speech processing techniques, such as speech synthesis and speech recognition.
…
[0036] For example, a corpus (a speech recognition corpus) consisting of a large amount of speech data and its transcribed text data can be used as learning text data tex.sub.L and speech data corresponding to the learning text data tex.sub.L. In other words, it consists of a large amount of speech (speech data) spoken by a person and sentences (text data) added to the speech (each has S sentences). For this speech data, only speech data spoken by one speaker or a mixture of speech data spoken by various speakers may be used. 

[0037] In addition, word segmentation information seg.sub.L,s(t) (see FIG. 7) indicating when the word y.sub.L,s(t) in the speech data was spoken is also given. Although the start time and the end time of each word are used as word segmentation information in the example shown in FIG. 7, other information may be used. For example, when the end time of a word coincides with the start time of the next word, either one of the start time and the end time may be used as word segmentation information. Alternatively, the start time of the sentence may be designated and only the speaking time may be used as the word segmentation information. For example, with settings in which "pause"=350, "This"=250, "is"=80, . . . , the start time and end time of each word can be specified. In short, the word segmentation information may be any information that can indicate when the word y.sub.L,s(t) was spoken. This word segmentation information may be given manually, or may be automatically given from speech data and text data by using a speech recognizer or the like. In this embodiment, information x.sub.L(t) and word segmentation information seg.sub.L,s(t) based on the speech data are input to the word vectorization model learning device 110. However, a configuration may be adopted in which only the information x.sub.L(t) based on the speech data is input to the word vectorization model learning device 110 and the word boundary of each word is given by forced alignment in the word vectorization model learning device 110, thereby obtaining word segmentation information seg.sub.L,s(t). 

[0038] In addition, although normal text data includes no words expressing silence during speech (such as short pause), this embodiment uses the word "pause" expressing silence in order to ensure consistency with speech data. 

[0039] Information x.sub.L based on speech data may be actual speech data or an acoustic feature amount that can be acquired from the speech data. In this embodiment, it is assumed to be an acoustic feature amount (spectrum parameter and pitch parameter (F0)) extracted from speech data. It is also possible to use the spectrum, the pitch parameter, or both as an acoustic feature amount. Alternatively, it is also possible to use an acoustic feature amount (for example, mel-cepstrum, aperiodicity index, log F0, or voiced/unvoiced flag) that can be extracted from speech data by signal processing or the like. In the case where the information x.sub.L based on the speech data is actual speech data, a configuration for extracting the acoustic feature amount from the speech data may be provided.
…
[0070] In this embodiment, as in the first embodiment, a word is first converted to a one hot expression. As the number of dimensions N at this time, the first embodiment uses the number of types of words in the learning text data tex.sub.L, whereas this embodiment uses the number of types of words in the learning text data that was used for learning of the word vectorization model based on language information. Next, for the obtained vector of the one hot expression of each word, a vector w.sub.L,s(t) is obtained by using the word vectorization model based on language information. Although the vector conversion method varies depending on the word vectorization model based on language information, in the case of Word2Vec, as in the present invention, forward propagation processing is performed to extract the output vector of the intermediate layer (bottleneck layer), thereby obtaining the vector w.sub.L,s(t).
…
[0091] The word vectorization model f.sub.w.fwdarw.af used three layers of Bidirectional LSTM (BLSTM) as an intermediate layer, and the output of the second intermediate layer as a bottleneck layer. The number of units of each layer except the bottleneck layer was 256, and Rectied Linear Unit (ReLU) was used as an activation function. In order to verify performance changes due to the number of dimensions of the word vector, five models with different numbers (16, 32, 64, 128, and 256) of units in a bottleneck layer were learned. In order to support unknown words, all words that appear at a frequency of twice or less in the learning data are regarded as unknown words ("UNK") and are regarded as one word. Besides, since unlike text data, speech data contains silence (a pause) inserted in the beginning of a sentence, in the middle of a sentence, and at the end of a sentence, a pause is also treated as a word ("PAUSE") in this simulation. As a result, a total of 26,663 dimensions including "UNK" and "PAUSE" were taken as inputs to the word vectorization model f.sub.w.fwdarw.af. F0 of each word was resampled to a fixed length (32 samples) and the first to fifth dimensions of the DCT value were used as the output of the word vectorization model f.sub.w.fwdarw.af. For learning, 1% randomly selected from all data was used as development data for cross validation (early stopping), and other data was used as learning data. At the time of re-learning using a speech synthesis corpus, like in the speech synthesis model which will be described later, 4,400 sentences and 100 sentences were used as learning and development data, respectively. For comparison with the proposed method, as in conventional methods (see References 1 and 2), an 80-dimensional word vector (Reference 3) consisting of 82,390 words was used as a word vector learned from only text data.
	

    PNG
    media_image1.png
    594
    669
    media_image1.png
    Greyscale
 
    PNG
    media_image2.png
    527
    591
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    347
    564
    media_image3.png
    Greyscale

	See all the types of inputs that go into the learning of the new vectorization model.  At least, both word information and speech information are used as inputs in Figures 4 and 6 above.

Compare Figure 4 against the prior art in Figures 3A and 3B which include only “word information” and no “speech information” as their input.

    PNG
    media_image4.png
    279
    627
    media_image4.png
    Greyscale


    PNG
    media_image5.png
    287
    587
    media_image5.png
    Greyscale

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/FARIBA SIRJANI/Primary Examiner, Art Unit 2659