DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 7 and 8 are rejected under 35 U.S.C. 101 because the claims are drawn to a "program" perse as recited in the preamble ("A program for causing a computer to function as”  can be communication media as defined in the disclosure) and as such is non-statutory subject matter. See MPEP § 2106.1V.B.1 .a. Data structures not claimed as embodied in computer readable media are descriptive material per se and are not statutory because they are not capable of causing functional change in the computer. See, e.g., Warmerdam, 33 F.3d at 1361,31 USPQ2d at 1760 (claim to a data structure per se held nonstatutory). Such claimed data structures do not define any structural and functional interrelationships between the data structure and other claimed aspects of the invention, which permit the data structure's functionality to be realized. In contrast, a claimed computer readable medium encoded with a data structure defines structural and functional interrelationships between the data structure and the computer software and hardware components which permit the data structure's functionality to be realized, and is thus statutory. Similarly, computer programs claimed as computer listings per se, i.e., the descriptions or expressions of the programs are not physical "things." They are neither computer components nonstatutory processes, as they are not "acts" being performed. Such claimed computer programs do not define any structural and functional interrelationships between the computer program and other claimed elements of a computer, which permit the computer program's functionality to be realized.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 3-5 and 7 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yamagishi et al. (WO 2019044401 A1).

Claims 1 and 7,
Yamagishi teaches an acoustic model learning device for obtaining an acoustic model used to synthesize voice signals with intonation, comprising ([pgs. 3 & 11] a device for outputting a synthesized speech of an unknown speaker corresponding to an input text by using an acoustic model of multiple speakers represented by a deep neural network): a first learning unit that learns the acoustic model to estimate a plurality of synthetic acoustic feature values using a voice determination model ([pgs. 3 & 11] generating a synthesized acoustic features using deep neural network) and 
a speaker determination model based on a plurality of acoustic feature values of a plurality of speakers, a plurality of language feature values corresponding to the plurality of acoustic feature values and a plurality of speaker data items; a second learning unit that learns the voice determination model to determine whether the synthetic acoustic feature value is a predetermined acoustic feature value or not based on the plurality of acoustic feature values and the plurality of synthetic acoustic feature values; and a third learning unit that learns the speaker determination model to determine whether the speaker of the synthetic acoustic feature value is a predetermined speaker or not based on the plurality of acoustic feature values and the plurality of synthetic acoustic feature values ([pgs. 3 & 11] the synthesis part 200 uses the multi-speaker acoustic model (DNN) 230 to determine the unknown corresponding to the input text according to the input speaker information of the unknown speaker; it functions as a speech synthesizer that changes the speaker's synthesized speech; the text analysis unit 210 generates the language feature of the input text by analyzing the input text; the linguistic feature quantities of the input text generated by the text analysis unit 210 are input to the synthetic acoustic feature quantity generation unit 220; the speaker information of the unknown speaker is input to the synthetic acoustic feature quantity generation unit 220; the input speaker information of the unknown speaker includes a speaker code representing, in probability, the similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic features of each of the plurality of known speakers  (determining whether the speaker is a predetermined speaker or not); the input speaker information of the unknown speaker is estimated; the synthetic acoustic feature quantity generation unit 220 generates a synthetic acoustic feature quantity of the unknown speaker based on the input language feature quantity of the text and the input speaker information of the unknown speaker; the learning of the multi-speaker acoustic model (DNN) 230 includes learning of speaker information of the known speaker and/or learning of acoustic features of the known speaker).

Claim 3,
Yamagishi further teaches the acoustic model learning device according to claim, wherein the voice determination model and the speaker determination model are optimized simultaneously ([pg. 3] the similarity vector is defined as a vector representing the similarity between the kth element of the similarity vector = the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic feature of the kth known speaker. (Where k = 1, 2, 3, 4, 5); for example, the similarity vector is expressed as (0.8, 0.05, 0.05, 0.05, 0.05); an example of the acoustic feature value is, but not limited to, mel frequency cepstrum coefficient (MFCC) and / or voice height (fundamental frequency)).

Claim 4,
Yamagishi further teaches the acoustic model learning device according to any one of claim 1, further comprising a data amount control unit that makes differences in data amounts generated among the plurality of speakers uniform ([pg. 3] the probability of expressing the similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic features of the five known speakers is calculated, and the five-dimensional similarity vector is used as the speaker code).

Claim 5,
Yamagishi teaches a voice synthesis device that synthesizes voice data including features of a desired speaker, comprising: a synthesis unit that synthesizes, from text data and speaker data representing the desired speaker, synthetic voice data, which is voice data corresponding to the text data and including the features of the desired speaker, wherein the synthetic voice data is determined by a predetermined determiner to be a natural sound and a voice uttered by the desired speaker ([pgs. 3 & 11] a device for outputting a synthesized speech of an unknown speaker corresponding to an input text by using an acoustic model of multiple speakers represented by a deep neural network; generating a synthesized acoustic features using deep neural network; the synthesis part 200 uses the multi-speaker acoustic model (DNN) 230 to determine the unknown corresponding to the input text according to the input speaker information of the unknown speaker; it functions as a speech synthesizer that changes the speaker's synthesized speech; the text analysis unit 210 generates the language feature of the input text by analyzing the input text; the linguistic feature quantities of the input text generated by the text analysis unit 210 are input to the synthetic acoustic feature quantity generation unit 220; the speaker information of the unknown speaker is input to the synthetic acoustic feature quantity generation unit 220; the input speaker information of the unknown speaker includes a speaker code representing, in probability, the similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic features of each of the plurality of known speakers  (determining whether the speaker is a predetermined speaker or not); the input speaker information of the unknown speaker is estimated; the synthetic acoustic feature quantity generation unit 220 generates a synthetic acoustic feature quantity of the unknown speaker based on the input language feature quantity of the text and the input speaker information of the unknown speaker; the learning of the multi-speaker acoustic model (DNN) 230 includes learning of speaker information of the known speaker and/or learning of acoustic features of the known speaker).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Yamagishi et al. (WO 2019044401 A1) and further in view of Zhou (CN 107945786 B).

Claim 2,
Yamagishi teaches all the limitations in claim 1. The difference between the prior art and the claimed invention is that Yamagishi does not explicitly teach wherein the first learning unit learns the acoustic model so as to minimize a loss function of the acoustic model, the second learning unit learns the voice determination model so as to minimize a loss function of the voice determination model, and the third learning unit learns the speaker determination model so as to minimize a loss function of the speaker determination model.
Zhou teaches wherein the first learning unit learns the acoustic model so as to minimize a loss function of the acoustic model, the second learning unit learns the voice determination model so as to minimize a loss function of the voice determination model, and the third learning unit learns the speaker determination model so as to minimize a loss function of the speaker determination model ([pgs. 6-7 last para. & 8-9 last para.] the electronic device can be pre-stored cost function, wherein the cost function can include a target cost function and connection cost function, the target cost function can be used for characterizing the matching degree of the voice waveform unit and the acoustic feature; the connection cost function can be used for characterizing the continuous degree of the adjacent speech waveform unit; the target cost function and the connection cost function can be established based on the Euclidean distance function; the value of the target cost function is smaller, the voice waveform unit is matched with the acoustic feature; the value of the connection cost function is smaller, the continuous degree of the adjacent voice waveform unit is higher).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Yamagishi with teachings of Zhou by modifying the unsupervised speaker adaptation of DNN speech synthesis as taught by Yamagishi to include wherein the first learning unit learns the acoustic model so as to minimize a loss function of the acoustic model, the second learning unit learns the voice determination model so as to minimize a loss function of the voice determination model, and the third learning unit learns the speaker determination model so as to minimize a loss function of the speaker determination model as taught by Zhou for the benefit of improving the voice synthesis effect and voice synthesis efficiency (Zhou [pg. 4]).

Claims 6 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Yamagishi et al. (WO 2019044401 A1) and further in view of Zhu (CN 108615524 A).

Claims 6 and 8,
Yamagishi teaches an acoustic model learning device for obtaining an acoustic model used to synthesize voice signals with intonation, comprising: a first learning unit that learns the acoustic model to estimate a plurality of synthetic acoustic feature values using a voice determination model and a speaker determination model based on a plurality of acoustic feature values, a plurality of language feature values corresponding to the plurality of acoustic feature values and a plurality of speaker data items; a second learning unit that learns the voice determination model to determine whether the synthetic acoustic feature value is a predetermined acoustic feature value or not based on the plurality of acoustic feature values and the plurality of synthetic acoustic feature values; and a third learning unit that learns the speaker determination model to determine whether the speaker of the synthetic acoustic feature value is an acoustic feature value representing a predetermined speaker or not based on the plurality of acoustic feature values and the plurality of synthetic acoustic feature values ([pgs. 3 & 11] a device for outputting a synthesized speech of an unknown speaker corresponding to an input text by using an acoustic model of multiple speakers represented by a deep neural network; generating a synthesized acoustic features using deep neural network; the synthesis part 200 uses the multi-speaker acoustic model (DNN) 230 to determine the unknown corresponding to the input text according to the input speaker information of the unknown speaker; it functions as a speech synthesizer that changes the speaker's synthesized speech; the text analysis unit 210 generates the language feature of the input text by analyzing the input text; the linguistic feature quantities of the input text generated by the text analysis unit 210 are input to the synthetic acoustic feature quantity generation unit 220; the speaker information of the unknown speaker is input to the synthetic acoustic feature quantity generation unit 220; the input speaker information of the unknown speaker includes a speaker code representing, in probability, the similarity between the distribution of the acoustic feature of the unknown speaker and the distribution of the acoustic features of each of the plurality of known speakers  (determining whether the speaker is a predetermined speaker or not); the input speaker information of the unknown speaker is estimated; the synthetic acoustic feature quantity generation unit 220 generates a synthetic acoustic feature quantity of the unknown speaker based on the input language feature quantity of the text and the input speaker information of the unknown speaker; the learning of the multi-speaker acoustic model (DNN) 230 includes learning of speaker information of the known speaker and/or learning of acoustic features of the known speaker).
The difference between the prior art and the claimed invention is that Yamagishi does not explicitly teach an emotional determination model.
Zhou teaches an emotional determination model ([Fig. 5] [pgs. 8-9] emotion analysis module 101 for obtaining text data and the clause extracting mood characteristic words, and analyzing each sentence according to the tone characteristic words of emotional attribute; a speech synthesis module 102 for basic voice data synthesis each sentence according to the emotional attribute of each sentence based on the preset voice database and preset voice pronunciation model).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Yamagishi with teachings of Zhu by modifying the unsupervised speaker adaptation of DNN speech synthesis as taught by Yamagishi to include an emotional determination model as taught by Zhu for the benefit of improving the quality of voice synthesis data (Zhu [Abstract]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
CN 104112444 B - The invention relates to a text message based waveform concatenation speech synthesis method. The text message based waveform concatenation speech synthesis method includes steps of S1, extracting acoustic parameters and text parameters of all elements in an original voice frequency through segment cutting, and training a duration prediction model and a weight prediction model according to extracted parameters; S2, using a layered pre-selection method to primarily pre-select the elements in a corpus to obtain candidate elements by means of a target element of text analysis and a duration predicted by the duration prediction model; S3, calculating the target element, the candidate elements, and weight information predicted by the weight prediction model to obtain a target cost; calculating Integrating degrees of two adjacent elements to obtain a concatenation cost; using a viterbi searching method to search the target cost and the concatenation cost to obtain a minimum cost path so as to further obtain an optimum element and obtain synthesis speeches through smooth concatenation.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/Examiner, Art Unit 2656