DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 12/09/2021 have been fully considered but they are not persuasive. Regarding arguments on pages 16-18, Examiner notes that Nakashika page 582 section III first paragraph teaches encoding and decoding, with subsequent paragraphs explaining the processes involved. The RTRBMs are trained and used for encoding and decoding using the input vectors and source/target information.

Claim Objections
Claims 9 objected to because of the following informalities:  line 21 reads “according to on a target” which should read “based on a target” or “according to a target”; the third to last line reads “based the” which should read “based on the”.  Appropriate correction is required.
Claims 16 and 23 objected to because of the following informalities:  the third to last line reads “based the” which should read “based on the”. Further, the last 5 limitations end in “ing” when they should not.  Appropriate correction is required.

	Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 9-28 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nakashika et al. (Nakashika, T., Takiguchi, T., & Ariki, Y. (2014). Voice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3), 580-587.), hereinafter referred to as Nakashika, in view of Arik et al. (Arik, S. O., Chen, J., Peng, K., Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. arXiv preprint arXiv:1802.06006.), hereinafter referred to as Arik.

Regarding claim 9, Nakashika teaches:
A computer-implemented method for converting aspects of voice, the method comprising: 
receiving conversation source data, wherein the conversation source data includes a plurality of utterances, and wherein each utterance includes a plurality of frames of voice data (Page 584 section IV-A first paragraph, where source and target speaker data is obtained, the training data containing multiple frames); 
generating a series of sound feature vectors based on the received conversation source data, wherein the series of sound feature vectors includes a sound feature vector corresponding to the utterance in the conversation source data (page 582, Section III, first two paragraphs, where feature vectors of MFCCs are determined from the source and target speakers); 

determining an attribution label associated with the utterance in the conversation source data (page 582 section III second paragraph, where the models are trained for each speaker);
generating an encoder based on training, wherein the encoder upon being trained determines a series of output latent vectors based on a series of input sound vectors and an input attribution label associated with an input utterance (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for encoding depending on the source and target identities, where the input vectors are projected into latent space), and wherein the training is based on the determined attribution label and first parallel data between the identified series of latent vectors and the generated series of sound feature vectors (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for encoding depending on the source and target identities, where the input vectors are projected into latent space, indicating parallel data); 
generating a decoder based on training, wherein the decoder upon being trained reconstructs the series of input sound feature vectors of the input utterance according to on a target attribution label (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for decoding, where target acoustic features are determined from the latent features depending on the source and target identities), wherein the training is based on the determined attribution label and the first parallel data between the identified series of latent vectors and the series of sound feature vectors (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for decoding, where target acoustic features are determined from the latent features and parallel data, depending on the source and target identities);

receiving the target attribution label (Page 584 section IV-A first paragraph, where the target speaker is selected);
reconstructing, based on a combination of the trained encoder and the trained decoder, the series of input sound feature vectors of the input utterance according to the target attribution label (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice); and
generating a target utterance based the reconstructed series of input sound feature vectors (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice); and
providing the target utterance with the target attribution label (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice).  
Nakashika does not explicitly teach an attribution label.
Arik teaches:
an attribution label (page 17 section E first paragraph, where labels for gender are used)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Nakashika by using the labels of Arik (Arik page 17 section E first paragraph) in the encoding of Nakashika (Nakashika page 582-583 Section III) in order to achieve a high discriminative accuracy using the speaker embeddings (Arik page 17 section E first paragraph).

Regarding claim 10, Nakashika in view of Arik teaches:
The computer-implemented method of claim 9, the method further comprising: 

wherein the objective function at least relates to an error between the reconstructed series of input sound feature vectors and the generated series of sound feature vector as second parallel data (Nakashika page 582-583 section III paragraph spanning pages, where error between the latent vectors is minimized, and where the latent vectors and input are used in determining the output), and 
wherein the objective function further relates to a distance between the determined series of output latent vectors and the identified series of latent vectors as third parallel data (Nakashika page 582-583 section III paragraph spanning pages, where error between the latent vectors is minimized).  

Regarding claim 11, Nakashika in view of Arik teaches:
The computer-implemented method of claim 9, wherein each of the encoder and the decoder is configured using one of: a convolutional network or a recurrent network (Nakashika page 583 second column, paragraph after eq. 24, where an RNN is used).  

Regarding claim 12, Nakashika in view of Arik teaches:
The computer-implemented method of claim 9, wherein the attribution label of the utterance in the conversation source data includes one or more of: 
gender of a speaker (Arik page 17 section E first paragraph, where the labels include gender of the speaker), 
a status of the speaker being a native speaker of a language used (Arik page 17 section E first paragraph, where the labels include region of accent of the speaker, indicating a native speaker), 
a type of utterance mood of the speaker (where another limitation is chosen), or 


Regarding claim 13, Nakashika in view of Arik teaches:
The computer-implemented method of claim 9, wherein each sound features vector is based on one of: 
a logarithmic amplitude spectrum (where another limitation is chosen); 
a mel-cepstrum coefficient (Nakashika page 582, Section III, first two paragraphs, where feature vectors of MFCCs are used); 
a linear predictive coefficient (where another limitation is chosen); 
a Partial Correlation (PARCOR) coefficient (where another limitation is chosen); or 
a Line Spectral Pair (LSP) parameter (where another limitation is chosen).  

Regarding claim 14, Nakashika in view of Arik teaches:
The computer-implemented method of claim 9, the method further comprising: 
receiving a voice data for a conversion of sound quality (Nakashika Page 584 section IV-A first paragraph, where source and target speaker data is obtained); 
extracting the series of input sound feature vectors from the voice data (Nakashika page 582, Section III, first two paragraphs, where feature vectors of MFCCs are used); 
estimating, using the generated encoder, the series of output latent vectors (Nakashika page 582-583 Section III, paragraph spanning pages, where latent features are obtained); 
estimating, using the generated decoder, a series of target sound feature vectors for a target voice data based at least on the estimated series of output latent vectors and the target attribution label (Nakashika Fig. 2, page 582-583 Section III, paragraph spanning pages, where latent features are used to calculate the output features); 

providing the generated target voice data as a converted voice data (Nakashika page 583 first column, where the output target vector is generated, and Arik page 3 section 3 last paragraph, where the target audio is generated).  

Regarding claim 15, Nakashika in view of Arik teaches:
The computer-implemented method of claim 14, wherein the generated target voice data relates to converting non-language aspects of the voice data while maintaining utterance sentences in the voice data, and wherein the non-language aspects of the voice include one or more of individuality and an utterance style of a speaker (Nakashika page 580 introduction first paragraph, where voice conversion transforms specific information about speakers while retaining linguistic information, and page 583 first column, where the output target vector is generated, and Arik page 3 section 3 last paragraph, where the target audio is generated).  

Regarding claim 16, Nakashika teaches:
A system for converting aspects of voice, the system comprises:
receive conversation source data, wherein the conversation source data includes a plurality of utterances, and wherein each utterance includes a plurality of frames of voice data (Page 584 section IV-A first paragraph, where source and target speaker data is obtained, the training data containing multiple frames); 
generate a series of sound feature vectors based on the received conversation source data, wherein the series of sound feature vectors includes a sound feature vector corresponding to the 
identify a series of latent vectors based on the received conversation source data, wherein the series of latent vectors includes a latent vector corresponding to the utterance in the conversation source data (page 582-583 Section III, paragraph spanning pages, where latent features are obtained for source and target speakers); 
determine an attribution label associated with  the utterance in the conversation source data (page 582 section III second paragraph, where the models are trained for each speaker);
generate an encoder based on training, wherein the encoder upon being trained determines a series of output latent vectors based on a series of input sound vectors and an input attribution label associated with an input utterance (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for encoding depending on the source and target identities, where the input vectors are projected into latent space), and wherein the training is based on the determined attribution label and first parallel data between the identified series of latent vectors and the generated series of sound feature vectors (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for encoding depending on the source and target identities, where the input vectors are projected into latent space, indicating parallel data); 
generate an decoder based on training, wherein the decoder upon being trained reconstructs the series of input sound feature vectors of the input utterance based on a target attribution label (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for decoding, where target acoustic features are determined from the latent features depending on the source and target identities), wherein the training is based on the determined attribution label and the first parallel data between the series of latent vectors and the series of sound feature vectors (page 582-583 section III 
receiving the input utterance (Page 584 section IV-A first paragraph, where source and target speaker data is obtained);
receiving the target attribution label (Page 584 section IV-A first paragraph, where the target speaker is selected);
reconstructing, based on a combination of the trained encoder and the trained decoder, the series of input sound feature vectors of the input utterance according to the target attribution label (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice); and
generating a target utterance based the reconstructed series of input sound feature vectors (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice); and
providing the target utterance with the target attribution label (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice).  
Nakashika does not explicitly teach an attribution label, a processor, or memory.
Arik teaches:
a processor (Appendix A, where encoding includes processing the data, indicating a processor); and 
a memory (Abstract, where memory is required) storing computer-executable instructions that when executed by the processor cause the system to: 
an attribution label (page 17 section E first paragraph, where labels for gender are used)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Nakashika by using the labels of Arik (Arik page 17 section 

Regarding claim 17, Nakashika in view of Arik teaches:
The system of claim 16, the computer-executable instructions when executed further causing the system to: 
generate the encoder and the decoder based on maximizing a value of an objective function (Nakashika page 582 Section III first paragraph, where the parameters are estimated to maximize a probability), 
wherein the objective function at least relates to an error between the reconstructed series of input sound feature vectors and the generated series of sound feature vector as second parallel data (Nakashika page 582-583 section III paragraph spanning pages, where error between the latent vectors is minimized, and where the latent vectors and input are used in determining the output), and 
wherein the objective function further relates to a distance between the determined series of output latent vectors and the identified series of latent vectors as third parallel data (Nakashika page 582-583 section III paragraph spanning pages, where error between the latent vectors is minimized).  

Regarding claim 18, Nakashika in view of Arik teaches:
The system of claim 16, wherein each of the encoder and the decoder is configured using one of: a convolutional network or a recurrent network (Nakashika page 583 second column, paragraph after eq. 24, where an RNN is used).  

Regarding claim 19, Nakashika in view of Arik teaches:

gender of a speaker (Arik page 17 section E first paragraph, where the labels include gender of the speaker), 
a status of the speaker being a native speaker of a language used (Arik page 17 section E first paragraph, where the labels include region of accent of the speaker, indicating a native speaker), 
a type of utterance mood of the speaker (where another limitation is chosen), or 
a style of utterance in lecture or non-lecture (where another limitation is chosen).  

Regarding claim 20, Nakashika in view of Arik teaches:
The system of claim 16, wherein each sound features vector is based on one of: 
a logarithmic amplitude spectrum (where another limitation is chosen); 
a mel-cepstrum coefficient (Nakashika page 582, Section III, first two paragraphs, where feature vectors of MFCCs are used); 
a linear predictive coefficient (where another limitation is chosen); 
a Partial Correlation (PARCOR) coefficient (where another limitation is chosen); or 
a Line Spectral Pair (LSP) parameter (where another limitation is chosen).  

Regarding claim 21, Nakashika in view of Arik teaches:
The system of claim 16, the computer-executable instructions when executed further causing the system to: 
receive a voice data for conversion (Nakashika Page 584 section IV-A first paragraph, where source and target speaker data is obtained); 

estimate, using the generated encoder, the series of output latent vectors (Nakashika page 582-583 Section III, paragraph spanning pages, where latent features are obtained). 
estimate, using the generated decoder, a series of target sound feature vectors for a target voice data based at least on the estimated series of output latent vectors and the target attribution label (Nakashika Fig. 2, page 582-583 Section III, paragraph spanning pages, where latent features are used to calculate the output features); 
generate the target voice data based on the estimated series of target sound feature vectors (Nakashika page 583 first column, where the output target vector is generated, and Arik page 3 section 3 last paragraph, where the target audio is generated); and 
provide the generated target voice data as a converted voice data (Nakashika page 583 first column, where the output target vector is generated, and Arik page 3 section 3 last paragraph, where the target audio is generated).  

Regarding claim 22, Nakashika in view of Arik teaches:
The system of claim 21, wherein the generated target voice data relates to converting non-language aspects of the voice data while maintaining utterance sentences in the voice data, and wherein the non-language aspects of the voice include one or more of individuality and an utterance style of a speaker (Nakashika page 580 introduction first paragraph, where voice conversion transforms specific information about speakers while retaining linguistic information, and page 583 first column, where the output target vector is generated, and Arik page 3 section 3 last paragraph, where the target audio is generated).  

Regarding claim 23, Nakashika teaches:
receive conversation source data, wherein the conversation source data includes a plurality of utterances, and wherein each utterance includes a plurality of frames of voice data (Page 584 section IV-A first paragraph, where source and target speaker data is obtained, the training data containing multiple frames); 
generate a series of sound feature vectors based on the received conversation source data, wherein the series of sound feature vectors includes a sound feature vector corresponding to the utterance in the conversation source data (page 582, Section III, first two paragraphs, where feature vectors of MFCCs are determined from the source and target speakers); 
identify a series of latent vectors based on the received conversation source data, wherein the series of latent vectors includes a latent vector corresponding to the utterance in the conversation source data (page 582-583 Section III, paragraph spanning pages, where latent features are obtained for source and target speakers); 
determine an attribution label associated with the utterance in the conversation source data (page 582 section III second paragraph, where the models are trained for each speaker);
generate an encoder based on training, wherein the encoder upon being trained determines a series of output latent vectors based on a series of input sound vectors and an input attribution label associated with an input utterance (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for encoding depending on the source and target identities, where the input vectors are projected into latent space), and wherein the training is based on the determined attribution label and first parallel data between the identified series of latent vectors and the generated series of sound feature vectors (page 582-583 section III first three paragraphs, where SD-RTRBMs are trained for encoding depending on the source and target identities, where the input vectors are projected into latent space, indicating parallel data); 

receiving the input utterance (Page 584 section IV-A first paragraph, where source and target speaker data is obtained);
receiving the target attribution label (Page 584 section IV-A first paragraph, where the target speaker is selected);
reconstructing, based on a combination of the trained encoder and the trained decoder, the series of input sound feature vectors of the input utterance according to the target attribution label (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice); and
generating a target utterance based the reconstructed series of input sound feature vectors (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice); and
providing the target utterance with the target attribution label (page 582-583 section III first three paragraphs, where the input voice is converted to a target voice).  
Nakashika does not explicitly teach a computer readable medium, nor attribution labels.
Arik teaches:

 and an attribution label (page 17 section E first paragraph, where labels for gender are used)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Nakashika by using the labels of Arik (Arik page 17 section E first paragraph) in the encoding of Nakashika (Nakashika page 582-583 Section III) in order to achieve a high discriminative accuracy using the speaker embeddings (Arik page 17 section E first paragraph).

Regarding claim 24, Nakashika in view of Arik teaches:
The computer-readable non-transitory recording medium of claim 23, the computer-executable instructions when executed further causing the system to: 
generate the encoder and the decoder based on maximizing a value of an objective function (Nakashika page 582 Section III first paragraph, where the parameters are estimated to maximize a probability), 
wherein the objective function at least relates to an error between the reconstructed series of input sound feature vectors and the generated series of sound feature vector as second parallel data (Nakashika page 582-583 section III paragraph spanning pages, where error between the latent vectors is minimized, and where the latent vectors and input are used in determining the output), and 
wherein the objective function further relates to a distance between the determined series of output latent vectors and the identified series of latent vectors as third parallel data (Nakashika page 582-583 section III paragraph spanning pages, where error between the latent vectors is minimized).  

Regarding claim 25, Nakashika in view of Arik teaches:


Regarding claim 26, Nakashika in view of Arik teaches:
The computer-readable non-transitory recording medium of claim 23, wherein the attribution of the conversation source data includes one or more of: 
gender of a speaker (Arik page 17 section E first paragraph, where the labels include gender of the speaker), 
a status of the speaker being a native speaker of a language used (Arik page 17 section E first paragraph, where the labels include region of accent of the speaker, indicating a native speaker), 
a type of utterance mood of the speaker (where another limitation is chosen), or 
a style of utterance in lecture or non-lecture (where another limitation is chosen).  

Regarding claim 27, Nakashika in view of Arik teaches:
The computer-readable non-transitory recording medium of claim 23, the computer-executable instructions when executed further causing the system to: 
receive a voice data for conversion (Nakashika Page 584 section IV-A first paragraph, where source and target speaker data is obtained); 
extract the series of input sound feature vectors from the voice data (Nakashika page 582, Section III, first two paragraphs, where feature vectors of MFCCs are used); 
estimate, using the generated encoder, the series of output latent vectors (Nakashika page 582-583 Section III, paragraph spanning pages, where latent features are obtained); 

generate the target voice data based on the estimated series of target sound feature vectors (Nakashika page 583 first column, where the output target vector is generated, and Arik page 3 section 3 last paragraph, where the target audio is generated); and 
provide the generated target voice data as a converted voice data (Nakashika page 583 first column, where the output target vector is generated, and Arik page 3 section 3 last paragraph, where the target audio is generated).  

Regarding claim 28, Nakashika in view of Arik teaches:
The computer-readable non-transitory recording medium of claim 27, 
wherein the generated target voice data relates to converting non-language aspects of the voice data while maintaining utterance sentences in the voice data, and wherein the non-language aspects of the voice include one or more of individuality and an utterance style of a speaker (Nakashika page 580 introduction first paragraph, where voice conversion transforms specific information about speakers while retaining linguistic information, and page 583 first column, where the output target vector is generated, and Arik page 3 section 3 last paragraph, where the target audio is generated).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Hsu et al. (Hsu, C. C., Hwang, H. T., Wu, Y. C., Tsao, Y., & Wang, H. M. (2016, December). Voice conversion from non-parallel corpora using variational auto-encoder. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (pp. 1-6). IEEE.) Abstract teaches performing voice conversion using an encoder and decoder.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRYAN S BLANKENAGEL whose telephone number is (571)270-0685. The examiner can normally be reached 8:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 571-272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRYAN S BLANKENAGEL/Primary Examiner, Art Unit 2658