DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

                                    Request for Continued Examination
          A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2/23/22 has been entered.

Response to Amendment
               The prior double patenting rejection of the claims is hereby withdrawn in light of the filed (2/23/22) and approved (2/24/22) Terminal Disclaimer linking the instant application with Copending Application 17/004,460.

Response to Arguments
Applicant's arguments filed 2/23/22 have been fully considered but they are not persuasive.
Regarding the prior rejection of Claims 1 and 10 with reference Wang, Applicant argues that the combined neural network prosodic labeling step described in Wang only wherein a learned artificial neural network articulatory feature extraction model outputs the embedding vector indicative of the articulatory feature of the speaker in response to receiving a speech sample of the speaker as an input” (Amendment, pg. 7, fourth para. – pg. 10, second para.; pg. 12, second para). 
Examiner respectfully disagrees. As provided in the final Office Action (10/25/21, pg. 10), the extracted prosodic feature vectors of the speech from speaker f2b in a secondary corpus corresponds to the claimed embedding vector indicative of articulatory features of a speaker. Wang discloses the use of a small speech secondary corpus (fig. 1; sec. 3.2; sec. 4.1) that includes received speech of a speaker f2b (pg. 2858, sec. 4.1), where a prosodic feature vector (i.e., the claimed articulatory feature) is extracted for each spoken word by the speaker f2b (sec. 3.2.1; sec. 4.1) using a hybrid neural network model that included a Deep neural network (DNN) as well as a Convolutional neural network (CNN), corresponding to limitations “receiving an embedding vector indicative of articulatory feature of a speaker” as well as the argued limitation “wherein a learned artificial neural network articulatory feature extraction model outputs the embedding vector indicative of the articulatory feature of the speaker in response to receiving a speech sample of the speaker as an input”.
Furthermore, in response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., generating speech that mimics the prosody features of a target speaker as such process requires acquiring a lot of word and speech training data obtained by a new speaker) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
Applicant further argues that the Office Action unreasonably combines the steps of training a model and using the model in Wang and admits that the prosodic feature vector of speaker f2b is only extracted in the training process of Wang, and must be distinguished from the process of using the model since the claimed process of receiving the embedding vector in the since amended claim 1 requires input of the vector during use (Amendment, pg. 10, third para. – pg. 11, third para.).
Examiner respectfully disagrees as the claim language does not call for a distinction between the argued training and the use steps. The claim language includes steps of generating an artificial neural network text-to-speech synthesis model and receiving an embedding vector. Nothing in the claim language require distinguishing training or use steps. Wang discloses (see fig. 1, right portion) the use of a Neural network acoustic model (i.e. an artificial neural network text-to-speech synthesis model) 
Regarding the 35 U.S.C. 103 rejection of dependent Claims 6-10 with reference Wang and additional references Graham and Edrenkin, Applicant argues that the references do not disclose the limitations argued above including limitations ““receiving an embedding vector indicative of articulatory feature of a speaker” as well as the argued limitation “wherein a learned artificial neural network articulatory feature extraction model outputs the embedding vector indicative of the articulatory feature of the speaker in response to receiving a speech sample of the speaker as an input” as recited in independent claim 1, and as such cannot disclose limitations recited in dependent claims 6-11. (Amendment, pg. 11, fourth para. – pg. 12, third para.).  
 Examiner respectfully disagrees as presented above and as provided in the rejection below. Also, absent any argument as to why the cited portions of the references fail to disclose limitations recited in the dependent Claims 6-11, Examiner maintains the rejections of the claims are appropriate.
Applicant’s arguments with respect to claim 1 and Wang and additional reference Edrenkin not disclosing limitation “generating output speech data for the input text reflecting the articulatory feature of the speaker by inputting the input text and the embedding vector indicative of the articulatory feature of the speaker directly to the artificial neural network text-to-speech synthesis model, the embedding vector being generated independently of the input text” (Amendment, pg. 6, fourth para. – pg. 7, third para.; pg. 11, fourth para. – pg. 12, third para.) have been fully considered but they are not persuasive. Edrenkin discloses the limitation (fig. 3; fig. 4; para. [0046]; para. [0051]; .
                                      
                                   Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1.        Claims 1, 3-7 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al “Enhance the word vector with prosodic information for the recurrent neural network based TTS system” (“Wang”) in view of Edrenkin US PGPUB 2017/0092258 A1 (“Edrenkin”)
           Per Claim 1, Wang discloses a text-to-speech synthesis method using machine learning, the text-to-speech synthesis method comprising:
            generating an artificial neural network text-to-speech synthesis model by performing machine learning based on a plurality of learning texts and speech data corresponding to the plurality of learning texts (Secondary speech corpus, DNN part, BURNC Secondary speech corpus as including training/learning texts and speech data corresponding to the plurality of training/learning texts, output of corpus subject to machine learning using prosodic labeling model of CNN and DNN);           
            receiving an embedding vector indicative of articulatory feature of a speaker (Secondary speech corpus, Prosodic features, Enhanced word vector, fig. 1; In the post-filter training stage, a prosodic feature vector is extract from a prosodic labeling model for each word in the speech corpus with annotated prosodic tags…we call the small corpus the secondary corpus…, sec. 3.2; The Boston University Radio News Corpus…was used as the secondary corpus to train the post-filter. The speech data of speaker f2b was used…, sec. 4.1, extracted prosodic feature vector of speaker f2b as embedding vector indicative of articulatory feature of a speaker); and
             wherein a learned artificial neural network articulatory feature extraction model outputs the embedding vector indicative of the articulatory feature of the speaker in response to receiving a speech sample of the speaker as an input (Secondary speech corpus , fig. 1; sec. 3.2; After training the prosodic labeling model, we extracted the feature vectors for all words in f2b…, sec. 4.1., BURNC Secondary speech corpus as including received speech ample of speaker f2b, prosodic feature vector extracted from combined neural network prosodic labeling model of CNN and DNN)
directly to the artificial neural network text-to-speech synthesis model, the embedding vector being generated independently of the input text
            However, these features are taught by Edrenkin:
            receiving an input text (para. [0110]);
           generating output speech data for the input text reflecting the articulatory feature of the speaker by inputting the input text and the embedding vector indicative of the articulatory feature of the speaker directly to the artificial neural network text-to-speech synthesis model, the embedding vector being generated independently of the input text (fig. 3; fig. 4; para. [0046]; para. [0051]; the input into the dnn 330 is the training data (not depicted), and the output from the dnn 330 is the acoustic space model 340…, para. [0107]; One or more speech attribute 420 may be selected and received. Speech attribute 420 is not particularly limited and may correspond, for example, to an emotion (angry, happy, sad, etc.), the gender of the speaker, an accent, an intonation, a dynamic, a speaker identity, a speaking style…, para. [0113]; Each speech attribute 326 has a selected attribute weight…The selected attribute weight defines the weight of the speech attribute desired in the synthetic speech 440. The weight is applied for each speech attribute 326, the outputted synthetic speech 440 having a weighted sum of speech attributes…, para. [0114]-[0115]; the text 410 and the one or more speech attribute 420 are inputted into the acoustic space model 340…, para. [0117]),        
directly to the artificial neural network text-to-speech synthesis model, the embedding vector being generated independently of the input text”, because such combination would have resulted in synthesizing new audio output based on existing audio samples and desired voice characteristics (Edrenkin, para. [0023]; para. [0125]).
          Per Claim 3, Wang in view of Edrenkin discloses the text-to-speech synthesis method of claim 1, 
             Wang discloses wherein the embedding vector indicative of the articulatory feature of the speaker includes a prosody sub-embedding vector indicative of a prosody feature of the speaker, wherein the prosody feature includes at least one of information on utterance speed, information on accentuation, information on pause duration, or information on voice pitch (fig. 1; sec. 3.2; The utilized prosodic labeling model predicts prosodic tags at the word level. The output targets are the pitch accents…, sec. 3.2.1; sec. 4.1), and
            Edrenkin discloses: generating the output speech data for the input text reflecting the articulatory feature of the speaker comprises generating output speech data for the input text reflecting the prosody feature of the speaker by inputting the prosody sub-embedding vector indicative of the prosody feature of the speaker to the artificial neural network text-to-speech synthesis model (fig. 3; fig. 4; para. [0046]; para. 
           Per Claim 4, Wang in view of Edrenkin discloses the text-to-speech synthesis method of claim 1, 
             Wang discloses wherein the embedding vector indicative of the articulatory feature of the speaker includes an emotion sub-embedding vector indicative of an emotion feature of the speaker, wherein the emotion feature includes information on an emotion implied in what the speaker utters (sec. 2; sec. 3.2; Among possible ways to extract the prosodic features…The utilized prosodic labeling model predicts prosodic tags at the word level… This prosodic feature vector can be extracted for each word in the secondary speech corpus…, sec. 3.2.1; sec. 4.1, extracted features of multiple words as implying multiple features, prosody as modelling/reflecting emotion in speech); and
             Edrenkin discloses: generating the output speech data for the input text reflecting the articulatory feature of the speaker comprises generating output speech data for the input text reflecting the emotion feature of the speaker by inputting the emotion sub-embedding vector indicative of the emotion feature of the speaker to the artificial neural network text-to speech synthesis model (fig. 3; fig. 4; para. [0046]; para. [0051]; para. [0107]; One or more speech attribute 420 may be selected and received. 
           Per Claim 5, Wang in view of Edrenkin discloses the text-to-speech synthesis method of claim 1, 
              Wang discloses wherein the embedding vector indicative of the articulatory feature of the speaker includes a voice tone and pitch sub-embedding vector indicative of a feature related to a voice tone and pitch of the speaker (sec. 2; sec. 3.2; Among possible ways to extract the prosodic features…The utilized prosodic labeling model predicts prosodic tags at the word level. The output targets are the pitch accents defined in Tones and Break Indices (ToBI)…This prosodic feature vector can be extracted for each word in the secondary speech corpus…, sec. 3.2.1; The acoustic matrices of the central word and its two neighbors were fed as inputs into the prosodic labeling model…, sec. 4.1, extracted features of multiple words as implying multiple features, extracted features as implying multiple features), and
          Edrenkin discloses: generating the output speech data for the input text reflecting the articulatory feature of the speaker comprises generating output speech data for the input text reflecting the feature related to the voice tone and pitch of the speaker by inputting the voice tone and pitch sub-embedding vector indicative of the feature related to the voice tone and pitch of the speaker to the artificial neural network text-to-speech synthesis model (fig. 3; fig. 4; para. [0046]; para. [0051]; para. [0107]; One or more intonation as representing variation in pitch).
        Per Claim 6, Wang in view of Edrenkin discloses the text-to-speech synthesis method of claim 1, 
              Edrenkin discloses:
             receiving an additional input for the output speech data (Abstract; para. [0012]; a vocoder can be used to synthesize a new audio output based on an existing audio sample by adding the characteristic elements to the existing audio sample…Vocoder features" refer to the characteristic elements of an audio sample…, para. [0023]; para. [0110]; para. [0113]; para. [0116]; The synthetic speech 440 has perceivable characteristics 430.  The perceivable characteristics 430 correspond to vocoder or audio features of the synthetic speech 440 that are perceived as corresponding to the selected speech attribute(s)…, para. [0117]);
             modifying the embedding vector indicative of the articulatory feature of the speaker based on the additional input (para. [0116]; para. [0117], adding the perceivable/vocoder characteristics corresponding to the speech attributes to the audio that includes the initial speech attributes as suggesting limitation); and
            converting the output speech data into speech data for the input text reflecting information included in the additional input by inputting the modified embedding vector 
           Per Claim 7, Wang in view of Edrenkin discloses the text-to-speech synthesis method of claim 6, 
                Edrenkin discloses wherein the information included in the additional input for the output speech data comprises at least one of gender information, age information, regional accent information, articulation speed information, voice pitch information, or articulation level information (para. [0012]; para. [0023]; para. [0059]; para. [0117]).
         Per Claim 10, Wang in view of Edrenkin discloses the machine learning of claim 1 
              Wang discloses performing operations of the text-to-speech synthesis method using the machine learning of claim 1 (sec. 1; fig. 1; sec. 3.2; sec. 4.1; sec. 4.2)
              Wang does not explicitly disclose a non-transitory computer-readable storage medium having a program recorded thereon, the program comprising instructions of performing the operations 
             However, it would have been obvious to one of ordinary skill in the art before the effective filing of the invention to implement the use of a non-transitory computer-readable storage medium having a program recorded thereon, the program comprising instructions of performing the operations with the suggestion/motivation of preventing a complete overhaul/update of the existing system when new data is available for .       

2.      Claims 8 is rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Edrenkin as applied to claim 1 above, and further in view of Graham US PGPUB 2016/0140952 A1 (“Graham”)
           Per Claim 8, Wang in view of Edrenkin discloses the text-to-speech synthesis method of claim 1, 
               Wang discloses wherein receiving the embedding vector indicative of the articulatory feature of the speaker comprises receiving a speech input from the speaker (sec. 4.1) and
               extracting the embedding vector indicative of the articulatory feature of the speaker from the speech sample of the speaker (fig. 1; sec. 3.2; sec. 4.1)
              Wang in view of Edrenkin does not explicitly disclose wherein receiving the speech sample comprises receiving a speech input from the speaker within a predetermined time period as the speech sample of the speaker in real time
              However, this feature is taught by Graham (para. [0016]; para. [0023]; para. [0035])
            It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Graham in the method of Wang in view of Edrenkin in arriving at “wherein receiving the speech sample comprises receiving a speech input from the speaker within a predetermined time period as the .

3.      Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Edrenkin as applied to claim 1 above, and further in view of Graham and Spais et al “An enhanced pitch modeling supporting a Greek Text to Speech system” (“Spais”)
          Per Claim 9, Wang in view of Edrenkin discloses the text-to-speech synthesis method of claim 1, 
              Wang discloses: receiving a speech input from the speaker (sec. 3.2.1; sec. 4.1)
              extracting, by the learned artificial neural network articulatory feature extraction model, the embedding vector indicative of the articulatory feature of the speaker from the speech sample of the speaker (fig. 1; sec. 3.2; sec. 4.1)
              Wang does not explicitly disclose receiving a speech input from the speaker within a predetermined time period as the speech sample of the speaker in real time
              However, this feature is taught by Graham (para. [0016]; para. [0023]; para. [0035])
              Wang in view of Graham does not explicitly disclose storing the extracted embedding vector in a database or receiving the embedding vector indicative of the articulatory feature of the speaker comprises receiving the embedding vector indicative of the articulatory feature of the speaker from the database
            However, these features are taught by Spais:

           receiving the embedding vector indicative of the articulatory feature of the speaker comprises receiving the embedding vector indicative of the articulatory feature of the speaker from the database (sec. 3; sec. 3.1; In order to embody speech with the appropriate audio characteristics (such as pitch), the system activates the model…The next step, which is the most crucial, is to select prosody vector P(i) from the training database which is most similar to the input vector…, sec. 3.3)
          It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Graham in the method of Wang in arriving at “receiving a speech input from the speaker within a predetermined time period as the speech sample of the speaker in real time “, as well as to combine the teachings of Spais with the method of Wang in arriving at the missing features of Wang, because such combinations would have resulted in in simplifying collection of speech data while improving the realism of the synthesized speech for a better user experience (Graham, para. [0005]; para. [0016]) as well as correlating the speech corpus (Wang’s BURNC) with the prosodic features, thereby describing the variability of pitch contours for each spoken sentence, while embodying synthesized speech with the appropriate audio characteristics (Spais, sec. 3.1; sec. 3.3)

4.      Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Edrenkin as applied to claim 1 above, and further in view of Ostendorf “The 
             Per Claim 11, Wang in view of Edrenkin discloses text-to-speech synthesis method of claim 1, further comprising:
               Wang discloses: receiving a selection of the speaker among a plurality of speakers (The Boston University Radio News Corpus (BURNC) [31] was used as the secondary corpus to train the post-filter. The speech data of speaker f2b was used…, sec. 4.1, BURNC corpus as including a plurality of speakers); and
            receiving the embedding vector indicative of the articulatory feature of the speaker (fig. 1; sec. 3.2; sec. 4.1).
           Wang in view of Edrenkin does not explicitly disclose each of the plurality of speakers having different articulatory features
           However, this feature is taught by Ostendorf that discloses the Boston University Radio News Corpus (BURNC) that includes speakers F1A, F2B, F3A, M1B, M2B, M3B and M4B with varying prosodic patterns (sec. 2, Table 1; we have recorded six of the announcers reading the same four type-B news stories in our laboratory, referred to as the lab news portion of the corpus. The multiple versions of each story provide insight into the amount of variability in prosodic patters across speakers that is acceptable for a given sentence…, pg. 5, sec. 2)
           Wang in view of Edrenkin and Ostendorf does not explicitly disclose receiving the embedding vector indicative of the articulatory feature of the speaker in response to the selection of the speaker
Sakai that describes receiving the data indicative of the articulatory feature of the speaker in response to the selection of the speaker (para. [0017]; para. [0051])
         It would have been obvious to one of ordinary skill in the art to combine the teachings of Ostendorf with the method of Wang in arriving at the missing feature, because such combination would have resulted in providing a useful way of analyzing the relationship between duration changes and different prosodic markers in prosodic labeling (Ostendorf, pg. 6, sec. 2). Furthermore, it would have been obvious to substitute the prosody data of Sakai with the embedding vector indicative of the articulatory feature as described by Wang in arriving at “receiving the embedding vector indicative of the articulatory feature of the speaker in response to the selection of the speaker” for the predictable result of providing a way to facilitate processing and analysis of the data, as well as providing speech data in the voice of a desired speaker (Sakai, para. [0009]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO 892 form.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUJIMI A ADESANYA whose telephone number is (571)270-3307.  The examiner can normally be reached on Monday-Friday 8:30-5:00pm.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 571-272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/OLUJIMI A ADESANYA/Primary Examiner, Art Unit 2658