Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1.	This action is responsive to Application no.17/095,751.  All claims have been examined and are currently pending.
Information Disclosure Statement
2.	The information disclosure statement (IDS) submitted is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
EXAMINER’S AMENDMENT
3.	An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.

	Please amend claim 1 to read:
inputting the second source speaker voice data group and the personalized voice data group 
Please amend claim 9 to read:
Instructions for inputting the second source speaker voice data group and the personalized voice data group 


A non-transitory computer-readable storage medium storing one or more computer 13programs executable on a processor to implement a voice conversion training method, wherein the one or more computer programs comprise

Instructions for inputting the second source speaker voice data group and the personalized voice data group 


Allowable Subject Matter
4.	Claims 1-20 are allowed.
5.	The following is an examiner’s statement of reasons for allowance: the claims are allowed as they further teach

Regarding claim 1 A voice conversion training method, comprising steps of: 
forming a first training data set, wherein the first training data set comprises a plurality of training voice data groups; 
selecting two of the training voice data groups from the first training data set to input into a voice conversion neural network for training; 
forming a second training data set, wherein the second training set comprises the first training data set and a first source speaker voice data group; 

forming a third training data set, wherein the third training data set comprises a second source speaker voice data group and a personalized voice data group, the second source speaker voice data group comprises a second quantity of second source speaker voice data and corresponds to a same speaker with the first source speaker voice data group, and the personalized voice data group comprises the second quantity of personalized voice data; 
wherein the personalized voice data group is obtained from a terminal device; 
1inputting the second source speaker voice data group and the personalized voice data group are into the voice conversion neural network for training; 
obtaining to-be-converted voice data, wherein the to-be-converted voice data corresponds to a same speaker with the personalized voice data group; and 
inputting the to-be-converted voice data into the voice conversion neural network, and obtaining target voice data based on an output of the voice conversion neural network.  

According to the application at hand:
[0098] It can be seen from the above-mentioned description that, in the smart device of this embodiment, the voice conversion neural network is trained through the two training voice data groups in the first training data set first, so that the learning of the voice conversion neural network covers a large number of corpus to learn a 

Closely related references teach:
Turk et al (2006/0129399)
Abstract: The conversion of speech can be used to transform an utterance by a source speaker to match the speech characteristic of a target speaker. During a training phase, utterances corresponding to the same sentences by both the target speaker can source speaker can be force aligned according to the phonemes within the sentences. A target codebook and source codebook as well as a transformation between the two can be trained. After the completion of a training phase, a source utterance can be divided into entries in the source codebook and transformed into entries in the target codebook. During the transformation, the situation arises where a single source codebook entry can have several target codebook entries. The number of entries can be reduced with the application of confidence measures.

Nurminen et al (2008/0082333)
[0026] As indicated above, codebook 80 is created using training material that is spoken by source and target voices. The spoken training material is segmented into syllables, and a pitch analysis is performed to generate a pitch contour (a set of pitch values at different times) for each syllable. Pitch analysis can be performed prior to segmentation. Pitch contours can be generated in various manners. In some embodiments, a spectral analysis for input speech (or a TTS analysis of input text) undergoing conversion outputs pitch values (F0) for each syllable. As part of such an analysis, a duration of the analyzed speech (and/or segments thereof) is also provided or is readily calculable from the output. For example, FIG. 3A shows a source pitch contour 81 for syllable j spoken by a source. In the example of FIG. 3A, the contour is for the word "is" spoken by a first speaker. Contour 81 includes values for pitch at each of times n=1 through n=N. The duration of pitch contour 81 (and thus of the source-spoken version of that syllable) is calculable from the number of pitch samples and the known time between samples. As explained in more detail below, a lower-case "z" represents a pitch contour or a value in a pitch contour (e.g., z.sub.j.sup.SRC(n) as shown on the vertical axis in FIG. 3A); an upper-case "Z" represents a transform of a pitch contour. FIG. 3B shows a target pitch contour 82 (also shown as z.sub.j.sup.TGT(n) on the vertical axis) for the same syllable ("is") as spoken by a second speaker. Target pitch contour 82 also includes values for pitch at each of times n=1 through n=N'. In the examples of FIGS. 3A and 3B, and as will often be the case, N.noteq.N'.
[0027] Returning to FIG. 2, the source and target pitch contours for each syllable are stored in codebook 80 using transformed representations.


Sun et al (2018/0012613)
[0023] FIG. 3 shows a diagram of a PPGs-based voice conversion approach 300 with non-parallel training data, according to some embodiments of the present disclosure. With non-parallel training data, the target speech and source speech may not have any overlapped portion, or at least may not have significant overlapping portions. In some embodiments, the target speech and the source speech may be identical. The PPGs-based approach 300 solves many of the limitations of the DBLSTM-based approach 200, and is partially based on the assumption that PPGs obtained from an SI-ASR system can bridge across speakers (SI stands for speaker-independent). As illustrated in FIG. 3, the PPGs-based approach 300 is divided into three stages: a first training stage 302 (labeled as “Training Stage 1”), a second training stage 304 (labeled as “Training Stage 2”), and a conversion stage 306. The role of the SI-ASR model is to obtain a PPGs representation of the input speech. The second training stage 304 models the relationships between the PPGs and MCEPs features of the target speaker for speech parameter generation and performs a DBLSTM model training 308. The conversion stage 306 drives a trained DBLSTM model 310 with PPGs of the source speech (obtained from the same SI-ASR) for voice conversion.

Huffman et al (20180342256)
Abstract: A method of building a speech conversion system uses target information from a target voice and source speech data. The method receives the source speech data and the target timbre data, which is within a timbre space. A generator produces first candidate data as a function of the source speech data and the target timbre data. A discriminator compares the first candidate data to the target timbre data with reference to timbre data of a plurality of different voices. The discriminator determines inconsistencies between the first candidate data and the target timbre data. The discriminator produces an inconsistency message containing information relating to the inconsistencies. The inconsistency message is fed back to the generator, and the generator produces a second candidate data. The target timbre data in the timbre space is refined using information produced by the generator and/or discriminator as a result of the feeding back.

	Aryal – 
	Abstract: A voice conversion system for generating realistic, natural-sounding target speech is disclosed. The voice conversion system preferably comprises a neural network for converting the source speech data to estimated target speech data; a global variance correction module; a modulation spectrum correction module; and a waveform generator. The global variance correction module is configured to scale and shift (or normalize and de-normalize) the estimated target speech based on (i) a mean and standard deviation of the source speech data, and further based on (ii) a mean and standard deviation of the estimated target speech data. The modulation spectrum correction module is configured to apply a plurality of filters to the estimated target speech data after it has been scaled and shifted by the global variance correction module. Each filter is designed to correct the trajectory representing the curve of one MCEP coefficient over time. Collectively, the plurality of filters are designed to correct the trajectories of each of the MCEP coefficients in the target voice data being generated from the source speech data. Once the MCEP coefficients are corrected, they are then provided to a waveform generator configured to generate the target voice signal that can then be played to the user via a speaker.

	Zhang 
Abstract: A method (and structure and computer product) to permit zero-shot voice conversion with non-parallel data includes receiving source speaker speech data as input data into a content encoder of a style transfer autoencoder system, the content encoder providing a source speaker disentanglement of the source speaker speech data by reducing speaker style information of the input source speech data while retaining content information and receiving target speaker input speech as input data into a target speaker encoder. The output of the content encoder and the target speaker encoder are combined in a decoder of the style transfer autoencoder, and the output of the decoder provides the content information of the input source speech data in a style of the target speaker speech information.

[0007] In accordance with yet another exemplary embodiment, also disclosed herein is a method (and apparatus and computer product) for transferring a style of voice utterances, as capable of a zero-shot voice conversion with non-parallel data, including preliminarily training a first neural network in a target speaker encoder, using speech information of a target speaker. The first neural network is trained to maximize an embedding similarity among different utterances of the target speaker and minimize similarities with other speakers. An autoencoder system is operated first in a training mode, the autoencoder system including a content encoder having a second neural network that compresses original input data from an input layer into a shorter code and a decoder having a third neural network that learns to un-compress the shorter code to closely match the original input data. The training mode implements a self-reconstruction training using speech inputs from a source speaker into the content encoder and into the target speaker encoder that has been preliminarily trained using target speaker speech information. The self-reconstruction training thereby trains the second neural network and the third neural network to adapt to a style of the target speaker. After the training mode, the autoencoder system can be operated in a conversion mode in which utterances of a source speaker provide source speech utterances in a style of the target speaker.

However the closest art of record does not teach or make obvious the limitations of the claim.

The additional independent claims are allowed for similar rationale and reasoning as claim 1.
The dependent claims are allowed as they further limit the parent claims.

6.	Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SHAUN ROBERTS/
Primary Examiner, Art Unit 2655