DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to because the flow of figure 14 is incorrect, in step S1130 where the method determines if a correction of the second emotion information vector is needed, it should go to step S1140 where it corrects the vector not to step S1170 where it outputs the first synthesized speech.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-4, 8-11 and 15 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Bao U.S. PAP 2013/0054244 A1.
Regarding claim 1 Bao teaches a method for generating synthesized speech (method and system for achieving emotional text to speech, see abstract), the method comprising: 
generating first synthesized speech by using text and a first emotion vector configured for the text (generating emotion tag for the text data by a rhythm piece, and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors, see abstract; Initial emotion score of the rhythm piece is obtained at step 201, see par. [0032]); 
extracting a second emotion vector included in the first synthesized speech (Final emotion score and final emotion category of the rhythm piece are determined at step 203, see par. [0037]) 
determining whether correction of the second emotion information vector is needed by comparing a loss value calculated by using the first emotion information vector and the second emotion information vector with a preconfigured threshold (flowchart in FIG. 4B corresponds to FIG. 2B: where initial emotion score of the rhythm piece is obtained at step 411; the initial emotion score is adjusted based on context semantic of the rhythm piece at step 413; and the adjusted initial emotion score is returned at step 415. The content of steps 411, 413 are similar to steps 211, 213. In the embodiment shown in FIG. 3, the step of performing emotion smoothing on the text data based on emotion tag of the rhythm piece is with the step of determining final emotion score and final emotion category of the rhythm piece. In step 415, the initial emotion score in adjusted emotion vector of the rhythm piece i.e. a set of emotion score is returned, rather than using the initial emotion score to determine final emotion score and final emotion category for TTS, see par. [0048]); 
based on the loss value calculated by using the first emotion information vector and the second emotion information vector exceeding a preconfigured threshold, generating a third emotion information vector by correcting the second emotion information vector based on the first emotion information vector and generating second synthesized speech by using the third emotion information vector (Final emotion path of the text data is determined based on the adjacent probability and emotion scores of respective emotion categories at step 503. For example, for sentence "Don't feel embarrassed about crying as it helps you release these sad emotions and become happy", assuming Table 1 has listed emotion tag of that sentence marked in step 303, a total of 6.sup.16 emotion paths can be described based on all adjacent probabilities obtained in step 501. The path with the highest sum of adjacent probability and the highest sum of emotion score can be selected from these emotion paths at step 503 as final emotion path, see par. [0059]); 
and outputting the second synthesized speech, wherein a loss value calculated by using the first emotion information vector and an emotion information vector included in the second synthesized speech is less than the preconfigured threshold (TTS to the text data is achieved according to the emotion tag at step 105. The present invention will use one emotion category for each rhythm piece, instead of using a unified emotion category for one sentence to perform synthesis. When achieving TTs, the present invention considers a degree of each rhythm on each emotion category. The present invention considers the emotion score under each emotion category, in order to realize TTS that is closer to create an actual speech effect. The detailed content will be described below in detail, see par. [0031]).
Regarding claim 2 Bao teaches the method of claim 1, further comprising outputting the first synthesized speech based on the loss value calculated by using the first emotion information vector and the second emotion information vector being less than the preconfigured threshold (highest value in the multiple initial emotion scores can be determined as final emotion score, and emotion category represented by the final emotion score can be taken as final emotion category, see par. [0037]).
Regarding claim 3 Bao teaches the method of claim 1, wherein the loss value calculated by using the first emotion information vector and the second emotion information vector is a value calculated based on a difference between the first emotion information vector and the second emotion information vector (emotion vector adjustment training data is raised from 0.20 to 0.40), and emotion scores of other emotion categories are correspondingly adjusted, see par. [0043]); 
and the loss value calculated by using the first emotion information vector and an emotion information vector included in the second synthesized speech is a value calculated based on a difference between the first emotion information vector and an emotion information vector included in the second synthesized speech (In addition to using the emotion vector adjustment decision tree to adjust the emotion score, the original emotion score can also be adjusted according to a classifier based on the emotion vector adjustment training data, see par. [0044]).
Regarding claim 4 Bao teaches the method of claim 3, wherein the loss value calculated by using the first emotion information vector and an emotion information vector included in the second synthesized speech is 0 (TABLE 3 Friday neutral 0.00 happy 0.90 sad 0.00 moved 0.00 angry 0.10 uneasiness 0.00, see table 3).

Regarding claim 8 Bao teaches the apparatus for generating synthesized speech (method and system for achieving emotional text to speech, see abstract), the apparatus comprising: 
an input unit receiving text and a first emotion information vector configured for the text (receiving text data, see abstract); 
an output unit outputting synthesized speech (achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors, see abstract); 
and a processor functionally connected to the input unit and the output unit, wherein the processor is configured to generate first synthesized speech by using text and a first emotion vector configured for the text (generating emotion tag for the text data by a rhythm piece, and achieving TTS to the text data corresponding to the emotion tag, where the emotion tags are expressed as a set of emotion vectors, see abstract; Initial emotion score of the rhythm piece is obtained at step 201, see par. [0032]); 
extract a second emotion vector included in the first synthesized speech (Final emotion score and final emotion category of the rhythm piece are determined at step 203, see par. [0037]) 
determine whether correction of the second emotion information vector is needed by comparing a loss value calculated by using the first emotion information vector and the second emotion information vector with a preconfigured threshold (flowchart in FIG. 4B corresponds to FIG. 2B: where initial emotion score of the rhythm piece is obtained at step 411; the initial emotion score is adjusted based on context semantic of the rhythm piece at step 413; and the adjusted initial emotion score is returned at step 415. The content of steps 411, 413 are similar to steps 211, 213. In the embodiment shown in FIG. 3, the step of performing emotion smoothing on the text data based on emotion tag of the rhythm piece is with the step of determining final emotion score and final emotion category of the rhythm piece. In step 415, the initial emotion score in adjusted emotion vector of the rhythm piece i.e. a set of emotion score is returned, rather than using the initial emotion score to determine final emotion score and final emotion category for TTS, see par. [0048]); 
generate a third emotion information vector by correcting the second emotion information vector based on the first emotion information vector based on the loss value calculated by using the first emotion information vector and the second emotion information vector exceeding a preconfigured threshold (Final emotion path of the text data is determined based on the adjacent probability and emotion scores of respective emotion categories at step 503. For example, for sentence "Don't feel embarrassed about crying as it helps you release these sad emotions and become happy", assuming Table 1 has listed emotion tag of that sentence marked in step 303, a total of 6.sup.16 emotion paths can be described based on all adjacent probabilities obtained in step 501. The path with the highest sum of adjacent probability and the highest sum of emotion score can be selected from these emotion paths at step 503 as final emotion path, see par. [0059]); 
and generate second synthesized speech by using the third emotion information vector, wherein a loss value calculated by using the first emotion information vector and an emotion information vector included in the second synthesized speech is less than the preconfigured threshold, and the synthesized speech is the second synthesized speech (TTS to the text data is achieved according to the emotion tag at step 105. The present invention will use one emotion category for each rhythm piece, instead of using a unified emotion category for one sentence to perform synthesis. When achieving TTs, the present invention considers a degree of each rhythm on each emotion category. The present invention considers the emotion score under each emotion category, in order to realize TTS that is closer to create an actual speech effect. The detailed content will be described below in detail, see par. [0031]).
Regarding claim 9 Bao teaches the apparatus of claim 8, wherein, based on the loss value calculated by using the first emotion information vector and the second emotion information vector being less than the preconfigured threshold, the synthesized speech is the first synthesized speech (highest value in the multiple initial emotion scores can be determined as final emotion score, and emotion category represented by the final emotion score can be taken as final emotion category, see par. [0037]).
Regarding claim 10 Bao teaches the apparatus of claim 8, wherein the loss value calculated by using the first emotion information vector and the second emotion information vector is a value calculated based on a difference between the first emotion information vector and the second emotion information vector (emotion vector adjustment training data is raised from 0.20 to 0.40), and emotion scores of other emotion categories are correspondingly adjusted, see par. [0043]); 
and the loss value calculated by using the first emotion information vector and an emotion information vector included in the second synthesized speech is a value calculated based on a difference between the first emotion information vector and an emotion information vector included in the second synthesized speech (In addition to using the emotion vector adjustment decision tree to adjust the emotion score, the original emotion score can also be adjusted according to a classifier based on the emotion vector adjustment training data, see par. [0044]).
Regarding claim 11 Bao teaches the apparatus of claim 10, wherein the loss value calculated by using the first emotion information vector and an emotion information vector included in the second synthesized speech is 0 (TABLE 3 Friday neutral 0.00 happy 0.90 sad 0.00 moved 0.00 angry 0.10 uneasiness 0.00, see table 3).
Regarding claim 15  Bao teaches an electronic device comprising: 
one or more processors (processor); a memory (computer readable medium); and one or more programs configured to be stored in the memory and to be executed by the one or more processors, the one or more programs including commands for performing the method of claim 1 (computer program instructions may be provided to a processor of a general purpose computer…These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, see par. [0096-0097]).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 5 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over [ Bao U.S. PAP 2013/0054244 A1 in view of Jansche U.S. Patent No. 8,321,225 B1.
Regarding claim 5 Bao does not teach  the method of claim 1, wherein the loss value calculated by using the first emotion information vector and the second emotion information vector is a value calculated based on the square of a difference between the first emotion information vector and the second emotion information vector; and the loss value calculated by using the first emotion information vector and an emotion information vector included in the second synthesized speech is a value calculated based on the square of a difference between the first emotion information vector and an emotion information vector included in the second synthesized speech.
In the same field of endeavor Jansche teaches a computer-implemented method including receiving text to be synthesized  as a spoken utterance. The method includes analyzing the received text to determine attributes of the received text and selecting one or more utterances from a database based on a comparison between the attributes of the received text and attributes of text representing the stored utterances, see abstract.  Human speech uses prosody in such varied communicative acts as indicating paralinguistic qualities such as emotion. To make synthesized speech as powerful a communication tool as human speech, synthesized  speech should at least endeavor to approach human-like prosodic assignment, see col. 1 lines 11-28. Extracting the contours from the utterances can include generating for each contour time-value pairs that each include a measurement of a contour value and a time at which the contour value occurs. The extracted contours can include fundamental frequencies, pitches, energy measurements, gain measurements, duration measurements, intensity measurements, measurements of rate of speech, or spectral tilt measurements. The distances between the contours can be calculated using a root mean square difference calculation. The method can include selecting, based on estimated distances between a plurality of determined contours and an unknown contour of text to be synthesized , a final determined contour associated with a smallest distance. The method can include generating a contour for the text to be synthesized  using the final determined contour. The method can include outputting the generated contour and the text to be synthesized  to a speech-to-text engine for speech synthesis.
It would have been obvious to one of ordinary skill in the art to combine the Bao invention with the teachings of Jansche for the benefit of synthesizing speech to approach human-like prosodic assignment, see col. 1 lines 11-28.
Regarding claim 12 Bao teaches the apparatus of claim 8, wherein the loss value calculated by using the first emotion information vector and the second emotion information vector is a value calculated based on the square of a difference between the first emotion information vector and the second emotion information vector; and the loss value calculated by using the first emotion information vector and an emotion information vector included in the second synthesized speech is a value calculated based on the square of a difference between the first emotion information vector and an emotion information vector included in the second synthesized speech.
In the same field of endeavor Jansche teaches a computer-implemented method including receiving text to be synthesized  as a spoken utterance. The method includes analyzing the received text to determine attributes of the received text and selecting one or more utterances from a database based on a comparison between the attributes of the received text and attributes of text representing the stored utterances, see abstract.  Human speech uses prosody in such varied communicative acts as indicating paralinguistic qualities such as emotion. To make synthesized speech as powerful a communication tool as human speech, synthesized  speech should at least endeavor to approach human-like prosodic assignment, see col. 1 lines 11-28. Extracting the contours from the utterances can include generating for each contour time-value pairs that each include a measurement of a contour value and a time at which the contour value occurs. The extracted contours can include fundamental frequencies, pitches, energy measurements, gain measurements, duration measurements, intensity measurements, measurements of rate of speech, or spectral tilt measurements. The distances between the contours can be calculated using a root mean square difference calculation. The method can include selecting, based on estimated distances between a plurality of determined contours and an unknown contour of text to be synthesized , a final determined contour associated with a smallest distance. The method can include generating a contour for the text to be synthesized  using the final determined contour. The method can include outputting the generated contour and the text to be synthesized  to a speech-to-text engine for speech synthesis.
It would have been obvious to one of ordinary skill in the art to combine the Bao invention with the teachings of Jansche for the benefit of synthesizing speech to approach human-like prosodic assignment, see col. 1 lines 11-28.
Claim(s) 6-7 and 13-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over [ Bao U.S. PAP 2013/0054244 A1 in view of Bromand U.S. PAP 2019/0318722 A1.
Regarding claim 6 Bao does not teaches the method of claim 1, wherein a third emotion information vector is generated by using a deep learning model.
In a  similar field of endeavor Bromand teaches Systems, methods, and devices for training and testing utterance based frameworks. The training and testing can be conducting using synthetic utterance samples in addition to natural utterance samples. The synthetic utterance samples can be generated based on a vector space representation of natural utterances. The synthetic voice sample is provided to the utterance-based framework as at least one of a testing or training sample., see abstract. The utterance-based framework 104 includes one or more software or hardware modules that take an action based on an utterance or other sound input. The utterance-based framework 104 is able to take a variety of different forms. In an example, the utterance-based framework 104 detects one or more aspects within speech. For instance, the utterance-based framework 104 is a framework detects an activation trigger (e.g., “ahoy computer”) within audio input. In other examples, the utterance-based framework 104 provides text-to-speech services, speech-to-text services, speaker identification, intent recognition, emotion detection, or other services. The utterance-based framework 104 is configurable in a variety of ways. In many examples, the utterance-based framework 104 is a machine-learning framework, such as one or more deep-learning frameworks (e.g., neural networks), decision trees, or heuristic-based models, among others, see par. [0039].
It would have been obvious to one of ordinary skill in the art to combine the Bao invention with the teachings of Bromand for the benefit of testing the system using synthetic samples, see abstract.
Regarding claim 7 Bromand teaches the method of claim 6, wherein the deep learning model is a model performing deep learning by using the first emotion information vector, second emotion information vector, and third emotion information vector (synthetic samples are, in turn, stored in synthetic samples store 108 as weight vectors or feature vectors, see par. [0043]).
Regarding claim 13 Bao does not teach the apparatus of claim 8, wherein the third emotion information vector is generated by using a deep learning model.
In a  similar field of endeavor Bromand teaches Systems, methods, and devices for training and testing utterance based frameworks. The training and testing can be conducting using synthetic utterance samples in addition to natural utterance samples. The synthetic utterance samples can be generated based on a vector space representation of natural utterances. The synthetic voice sample is provided to the utterance-based framework as at least one of a testing or training sample., see abstract. The utterance-based framework 104 includes one or more software or hardware modules that take an action based on an utterance or other sound input. The utterance-based framework 104 is able to take a variety of different forms. In an example, the utterance-based framework 104 detects one or more aspects within speech. For instance, the utterance-based framework 104 is a framework detects an activation trigger (e.g., “ahoy computer”) within audio input. In other examples, the utterance-based framework 104 provides text-to-speech services, speech-to-text services, speaker identification, intent recognition, emotion detection, or other services. The utterance-based framework 104 is configurable in a variety of ways. In many examples, the utterance-based framework 104 is a machine-learning framework, such as one or more deep-learning frameworks (e.g., neural networks), decision trees, or heuristic-based models, among others, see par. [0033, 0039].
It would have been obvious to one of ordinary skill in the art to combine the Bao invention with the teachings of Bromand for the benefit of testing the system using synthetic samples, see abstract.

Regarding claim 14 Bromand teaches the apparatus of claim 14, wherein the deep learning model is a model performing deep learning by using the first emotion information vector, second emotion information vector, and third emotion information vector (synthetic samples are, in turn, stored in synthetic samples store 108 as weight vectors or feature vectors, see par. [0043]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.
Takano ‘781 teaches examining communication data of a first human user to return one or more sentiment attribute of the communication data; processing the communication data to return sentiment neutral adapted communication data, the processing being in dependence on the one or more sentiment attribute, see abstract.
Deyle ‘095 teaches receive input data including to at least one of text or speech input data during a given period of time. In response, the sender device may use one or more of the emotion detection modules to analyze input data received during the same period of time to detect emotional information in the input data, which corresponds to the textual or speech input received during the given period of time. The sender device may generate a message data stream that includes both: text generated from the textual or speech input during the given period of time, and emotion data providing emotional information the same period of time, see abstract.
Matsumoto teaches  generating emotional sounds similar to speech by obtaining phonetic features from a speech database and re-training the model using emotional speech, see abstract.
Reddy teaches using Sable markup language for emotional speech story telling, see abstract.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711. The examiner can normally be reached Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656