DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments with respect to 35 U.S.C. 112 rejection of claim 1 have been considered and found persuasive due to amendments, and the rejection has been withdrawn.
Applicant's arguments with respect to 35 U.S.C. 102 in regards to claims 1-20 have been considered but are moot due to new grounds of rejection necessitated by amendments. See detailed rejection below. 
New claims 21-22 are rejected under Iwase et al. (US 2020/0051545) in view of Chicote et al. (US 2021/0097976). See detailed rejection below.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-16, 18-19 and 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Iwase et al. (US 2020/0051545) in view of Chicote et al. (US 2021/0097976).

Claim 1,
Iwase teaches a method comprising: receiving, by an encoder that comprises a neural network, a plurality of audio samples of a user; generating, by the encoder according to the plurality of audio samples, a sequence of values corresponding to speech features of the plurality of audio samples; receiving, by a decoder from the encoder, the sequence of values; and establishing, by the decoder using the sequence of values and one or more speaker embeddings of the user, a voice model for generating a synthetic audio output resembling a vocal output of the user ([Fig. 2] [0065-0068] [0075-0080] the learning device detects speech voice obtained by a daily family conversation or speech voice emitted from a family to the learning device, and on the basis of the detected speech voice, learns voice synthesis data for generating voice resembling the voice of each user by voice synthesis; by learning on the basis of the speech voice of the family, each of voice synthesis data for generating the voice of the father, mother, and child is generated; learning of the voice synthesis data is performed using user speech voice waveform data as data on the speech voice, user speech text obtained by voice recognition of the speech voice and context information indicating a status sensing result when a speech is made; generating a voice synthesis dictionary for each family member; voice synthesis network including a neural network).
The difference between the prior art and the claimed invention is that Iwase does not explicitly teach an encoder and decoder used for speech synthesis.
Chicote teaches an encoder and decoder used for speech synthesis ([Fig. 1] [0019] speech model that includes encoder and decoder for speech synthesis).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Iwase with teachings of Chicote by modifying the learning device and method for voice synthesis as taught by Iwase to include an encoder and decoder for speech synthesis as taught by Chicote for the benefit of having higher audio quality for speech synthesis and improving human-computer interactions (Chicote [0001] [0016]).

Claims 10 and 18 contains subject matter similar to claim 1, and thus is rejected under similar rationale.

Claim 2,
Iwase further teaches the method of claim 1, further comprising: establishing the voice model as a machine learning model; and generating, using the voice model and input text, the synthetic audio output resembling the user reciting the input text ([0065] [0077] a learning device 1 detects speech voice of each member of a family as a speech user, and sequentially automatically learn the voice synthesis dictionary by means of user speech voice waveform data and user speech text as a voice recognition result; learning device learns voice synthesis data for generating voice resembling the voice of each user by voice synthesis).

Claims 11 and 19 contains subject matter similar to claim 2, and thus is rejected under similar rationale.

Claim 3,
Iwase further teaches the method of claim 1, further comprising: receiving input text from the user; determining the user that provided the input text; identifying the voice model for the determined user and at least one of the one or more speaker embeddings of the determined user; and converting, using the voice model and the at least one of the one or more speaker embeddings, the input text to the synthetic audio output resembling the user reciting the input text ([0252-0253] at step S121, the voice synthesis control unit 57 performs the natural language processing and the semantic analysis processing, and analyzes the system speech text; at step S122, the voice synthesis control unit 57 determines (determines the user as the speaker) the speaker ID to be used for the system speech; determination on the speaker ID is, for example, performed using the contents of the context information, the contents of the system speech text, and the user relationship data).

Claim 12 contains subject matter similar to claim 3, and thus is rejected under similar rationale.

Claim 4,
Iwase further teaches the method of claim 1, further comprising training, by the decoder, the voice model for the user using the one or more speaker embeddings and one or more subsequent audio samples of the user ([0078] the learning device 1 uses a sensing result obtained by sensing of surrounding statuses to specify which member of the family is the speech user, and for each user, generates voice synthesis dictionaries for voice with different voice qualities and tones; the learning device 1 uses the sensing result to detect the statuses such as a speech user's emotion, noise, and a speech destination, and for each status, generates dictionaries for voice with different voice qualities and tones).

Claim 13 contains subject matter similar to claim 4, and thus is rejected under similar rationale.

Claim 5,
Iwase further teaches the method of claim 1, further comprising: applying, by the decoder, the one or more speaker embeddings of the user to the sequence of values to generate a conditioning signal; and providing, by the decoder, the conditioning signal to a plurality of residual layers to establish the voice model ([0079] a plurality of dictionaries storing data on prosody and a phoneme piece of voice of each user in each status is generated as the voice synthesis dictionaries).

Claim 14 contains subject matter similar to claim 5, and thus is rejected under similar rationale.

Claim 6,
Chicote further teaches the method of claim 1, further comprising modifying, by the encoder, a sampling rate of the plurality of audio samples via at least one convolutional layer of at least one convolutional block of the neural network of the encoder ([0016] the speech model is a probabilistic and/or autoregressive; the predictive distribution of each audio sample may be conditioned on previous audio samples; the speech model uses causal convolutions to predict output audio; in some embodiments, the model uses dilated convolutions to generate an output sample using a greater area of input samples than would otherwise be possible; the speech model is trained using a conditioning network that conditions hidden layers of the network using linguistic context features, such as phoneme data).

Claim 15 contains subject matter similar to claim 6, and thus is rejected under similar rationale.

Claim 7,
Chicote further teaches the method of claim 1, further comprising: providing one or more subsequent audio samples of the user to a first residual layer and a second residual layer of a neural network of the decoder; and providing the one or more speaker embeddings to the first residual layer and the second residual layer of the neural network of the decoder, wherein an output from the first residual layer is provided to an input of the second residual layer to train the voice model for the user ([Fig. 8] [0059-0060] decoder 106 receiving second speech input; multi-layer hidden neural network; the neural network includes an input layer, hidden layers, and output layers; each node in a hidden layer 804 may connect to each node in the next higher layer and next lower layer; each node of the input layer 802 represents a potential input to the neural network and each node of the output layer represents a potential output of the neural network; each connection from one node to another node in the next layer may be associated with a weight or score).

Claim 16 contains subject matter similar to claim 7, and thus is rejected under similar rationale.

Claim 8,
Chicote further teaches the method of claim 1, wherein the decoder includes a neural network that includes at least two fully connected residual layers and a normalization function ([0057-0058] pre-net layers; two fully connected layers of hidden units and normalization layer).

Claim 9,
Chicote further teaches the method of claim 1, further comprising applying, by the decoder, a normalization function to an output of a plurality of residual layers of a neural network of the decoder to establish the voice model ([0057-0058] normalizing the post-net hidden layers of the decoder to create a speech model for the user).

Claim 21,
Iwase further teaches the method of claim 1, wherein the speech features include at least one of temporal aspects, rhythm, pitch, tone, or a rate of speech ([0078-0079] emotions, tones, intonation, rhythm).

Claim 22,
Iwase further teaches the method of claim 1, wherein a speaker embedding of the one or more speaker embeddings of the user includes a portion of an audio sample of the plurality of audio samples ([0072] learning of the voice synthesis data in the learning device and voice synthesis in the voice synthesis device are performed considering the statuses at each timing).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/               Examiner, Art Unit 2656