Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 14, and 19 are independent and have been amended.
This Application is published as U.S. 20220148561.
Apparent priority: 10 November 2020.
Pending Claims are allowed.
Response to Amendments
Objection to the Specification is withdrawn in view of the amendments to the Specification.
Objection to Claim 13 is withdrawn in view of the amendments to Claim 13.
Response to Arguments
The Independent Claims are amended and now include:
1. A method, comprising:
receiving, by a computer system, an audio stream comprising human speech; 
determining one or more features of the audio stream;
generating, based on the one or more features of the audio stream, a pipeline affinity vector, wherein each pipeline affinity vector element of the pipeline affinity vector reflects a degree of suitability of the audio stream for training an audio asset synthesizing pipeline identified by an index of the pipeline affinity vector element;
selecting, an audio asset synthesizing pipeline identified by a pipeline affinity vector element corresponding to a maximum value of the degree of suitability;
training, using the audio stream, one or more audio asset synthesizing models implementing respective stages of the selected audio asset synthesizing pipeline; and 
responsive to determining that a quality metric of the audio asset synthesizing pipeline satisfies a predetermined quality condition, synthesizing one or more audio assets by the selected audio asset synthesizing pipeline.

This Claim finds the “audio asset synthesizing pipeline” which is most suitable for a particular “audio stream” as determined by “features of the audio stream.”
The method by which “suitability” is determined is through the formation of the “pipeline affinity vector” where each element of this vector shows the affinity/suitability of the particular “audio stream” for training a particular “synthesizing pipeline.”  The vector element with the maximum value indicates which “synthesizing pipeline” is best suited to this “audio stream” and should be trained with this particular “audio stream.”
The “features” which form the “elements” of the vector include gender, style of speech, and the language (i.e. French vs. Spanish) that is used in the “audio stream.”
A key feature of the Claim is considered the recitation of “a degree of suitability” of the audio stream for training a particular synthesizer based on which a particular synthesizer is “selected” to be trained by the particular input “audio stream.”

Applicant argues that “Richards has no teachings … indicating that the ‘voice property values’ are utilized for selecting a model. …. [or that] a model is selected or chosen.”  Response 10.
In Reply, as Figures 10 and 11 of Richards readily show, the synthesizer models of Richards are trained for audio features similar to those recited in the Specification of the instant Application.
Further, “selecting” was not bolded in the mapping to Richards, which the Examiner’s way of signaling that this word or phrase is not being mapped to the reference and the Office action provided:  “Richards impliedly selects the proper model but is not express.”  As a result, Min was cited for the aspect of “selecting” a model to be further trained by the particular “audio stream.”
Min, as applied to the Claim, expressly teaches: “selecting the pre-trained model to learn the target voice data, from among one or more pre-trained models stored in a memory, based on the data features of the target voice data,” and then “training a pre-trained model … by using the target voice data as training data.”

Applicant admits that Min teaches “selecting a pre-trained model” but argues that the “pipeline affinity vector” added by amendment is not taught by Min.
The amendment states that each “audio stream” is represented by a vector such as V = [S1, S2, S3, S4, …, Sn] and S, according to the Specification, is a number between 0 and 10, such that V = [9, 7, 6, 1, 5] means that this “audio stream” has audio features that are best suited (suitability =9) for training the models of Synthesize number 1 and least suited for training the models of synthesizer number 4.  (“Training, using the audio stream, one or more audio asset synthesizing models implementing respective stages of the selected audio asset synthesizing pipeline.”)
Min does not mention the word “vector” but the format it provides for “feature value of the target voice data” is a vector:  “[0159] According to an embodiment of the disclosure, the electronic device 100 may calculate 1001111011001111 (see reference numeral 620) as a data feature value of the target voice data WAV2 with reference to the prestored table of FIG. 6.”  See also [0158].  The “target voice data” of Min teaches the “audio stream” of the Claim.  Further, the selection of one of the “pre-trained models” for further training is done “based on the data features of the target voice data.”

    PNG
    media_image1.png
    385
    658
    media_image1.png
    Greyscale

	Min, in Figure 10, teaches that one method of “selection” of the pre-trained model to be further trained is “based on data features of the target voice data.  S1001.”
	For WAV2 of Figure 6 above, where the Gender is Female Child: F_Kid, the corresponding “data feature value” is 11 (615 in Figure 6) which is also the ID number of the Pre-Trained Model for Gender = F_Kid in Figure 12.  

    PNG
    media_image2.png
    309
    538
    media_image2.png
    Greyscale


See Min:  “[0206] …FIG. 12 shows an example of data stored according to an identifier (ID) of each pre-trained model.”  “[0207] For example, according to the ID of each pre-trained model, data on acoustic features (e.g., FS and BW), speaker features (e.g., Gend, Lang, FO/Pitch, and Tempo), and the number of pre-learned steps of voice data pre-learned by the pre-trained model may be stored.”
	Min looks for the TYPE of data feature to match the ID of the pre-trained model.  See the leftmost columns in Figure 6 and 12.  Min does not teach a “degree of suitability” that is claimed and requires a variable value.  The “suitability” in Min is 0 or 1 and no degrees.  Additionally, and as minor difference, the organizations of vector of the Claim and vector of Min are different.  The vector of Min is a list of IDs of the synthesizers in an order corresponding to the order of acoustic features.  For acoustic feature 1, synthesizer 10 is suitable and for acoustic feature 2, the synthesizer 01 is most suitable and so on.  The vector of the instant Application and that of the Claim is a list of suitability degrees arranged in the order of synthesizers, such as the first element of the vector the suitability degree of the input speech for synthesizer number 1 and the second element of the vector is suitability degree of this same input for synthesizer number 2, and so on.
	Accordingly, Min does not teach the independent Claims as amended.

SPECIFICATION
Note the supporting Specification:
[0017] The feature extraction functional module 115 analyzes the input audio stream to extract various features 120A-120K representing the audio stream properties, parameters, and/or characteristics. In an illustrative example, the audio stream features 120A-120K include the size of the audio stream or its portion, the sampling rate of the audio stream, the style of the speech (e.g., sports announcer style, dramatic, neutral), the perceived gender of the speaker, the natural language utilized by the speaker, the pitch, etc. The extracted features may be represented by a vector, every element of which represents a corresponding feature value.

 [0018] A vector of the extracted features 120A-120K is fed to the pipeline selection functional module 125, which applies one or more trainable models and/or rule engines to the extracted features 120A-120K in order to select the audio asset synthesizing pipeline 130 that is best suitable for processing the audio stream 110 for model training. In an illustrative example, the pipeline selection functional module 125 may employ a trainable classifier that processes the set of extracted features 120A-120K and produces a pipeline affinity vector, such that each element of the pipeline affinity vector is indicative of a degree of suitability of an audio stream characterized by the particular set of extracted features for training the audio asset synthesizing pipeline identified by the index of the vector element. Thus, the element Si of the numeric vector produced by the trainable classifier would store a number that is indicative of the degree of suitability of an audio stream characterized by the set of extracted features for training the i-th audio asset synthesizing pipeline. In an illustrative example, the suitability degrees may be provided by real or integer numbers selected from a predefined range (e.g., 0 to 10), such that a smaller number would indicate a lower suitability degree, while a larger number would indicate a larger suitability degree. Accordingly, the pipeline selection functional module 125 may select the audio asset synthesizing pipeline that is associated with the maximum value of the degree of suitability specified by the pipeline affinity vector. 

[0019] As schematically illustrated by FIG. 2, in some implementations, the pipeline selection functional module 125 may comprise a neural network 210. Training the neural network 210 may involve determining or adjusting the values of various network parameters 215 by processing a training data set 215 comprising a plurality of training samples 220A-220N. In an illustrative example, the network parameters may include a set of edge weights, which increase or attenuate the signals being transmitted through respective edges connecting artificial neurons. Each training sample 220 may include an input feature set 221 labeled with a vector of suitability values 222, such that each vector element would store a number that is indicative of the degree of suitability of an audio stream characterized by the input feature set for training the audio asset synthesizing pipeline identified by the index of the vector element. Accordingly, the supervised training process may involve determining a set of neural network parameters 215 that minimizes a fitness function 230 reflecting the difference between the pipeline suitability vector 240 produced by the trainable classifier processing a given input feature set 220N from the training data set and the pipeline affinity vector 222N associated with the input feature set. In some implementations, the labels (i.e., the pipeline affinity vectors 222A-222N) for the training data set 215 may be produced by the quality evaluation functional module 145. The pipeline training workflow 100 may be utilized for simultaneously or sequentially training multiple pipelines. The quality evaluation functional module 145 may associate each pipeline with a pass/fail label or a degree of suitability of the processed audio stream to the pipeline, based on the result of performing the quality evaluation of the trained pipeline.

Examiner’s Amendments
Authorization for this examiner’s amendment was granted in an interview with Mr. Dmitry Andreev on 7/13/2022.
Amend paragraph [0019] on pp. 7-8 of the Specification as filed as follows:
[0019] As schematically illustrated by FIG. 2, in some implementations, the pipeline selection functional module 125 may comprise a neural network 210. Training the neural network 210 may involve determining or adjusting the values of various network parameters 215 by processing a training data set 215 comprising a plurality of training samples 220A-220N. In an illustrative example, the network parameters may include a set of edge weights, which increase or attenuate the signals being transmitted through respective edges connecting artificial neurons. Each training sample 220 may include an input feature set 221 labeled with a vector of suitability values 222, such that each vector element would store a number that is indicative of the degree of suitability of an audio stream characterized by the input feature set for training the audio asset synthesizing pipeline identified by the index of the vector element. Accordingly, the supervised training process may involve determining a set of neural network parameters 215 that minimizes a fitness function 230 reflecting the difference between tthe pipeline suitability vector 240 produced by the trainable classifier processing a given input feature set 220N from the training data set and the pipeline affinity vector 222N associated with the input feature set. In some implementations, the labels (i.e., the pipeline affinity vectors 222A-222N) for the training data set 215 may be produced by the quality evaluation functional module 145. The pipeline training workflow 100 may be utilized for simultaneously or sequentially training multiple pipelines. The quality evaluation functional module 145 may associate each pipeline with a pass/fail label or a degree of suitability of the processed audio stream to the pipeline, based on the result of performing the quality evaluation of the trained pipeline.

As support, note Figure 2 and “Suitability Vector 240” shown on Figure 2:

    PNG
    media_image3.png
    417
    814
    media_image3.png
    Greyscale
 

Allowable Subject Matter
Pending Claims 1-20 are allowed.
The following is an examiner’s statement of reasons for allowance: In view of each of the particular limitations of the independent Claims when considered in the order established by the Claim language and in the context of the language of the independent Claims when each Claim is considered as a whole, the independent Claims of this Application were not found in the prior art that was viewed.
In particular, the features added by amendment when considered in the context of the language of the independent Claims as a whole and considering all of the limitations of these Claims was not found in the prior art.  Note the Response to Arguments section.  The Claim receive an “audio stream” which is “human speech” and wants to use this “audio stream” for training a speech synthesizer (“audio asset synthesizing pipeline”) from among a number of speech synthesizers that are being trained and then uses the trained speech synthesizer for synthesizing speech if the “quality metric” of the synthesizer satisfies a “predetermined condition” indicating that the synthesizer is well-trained and if the synthesizer fails at this “quality metric,” the synthesizer is looped back into further training.  When an “audio stream” / “speech sample” comes in, the synthesizer, for the training of which this particular “audio stream” is used, is selected according to the extracted features of the incoming “audio stream” that are placed into a vector format (“pipeline affinity vector”) where the elements of the vector indicate a degree of suitability of the audio stream for training of a particular synthesizer and the synthesizer with the maximum value of the degree of suitability is selected.  The degree of suitability is a number that could be from 0 to 10.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Close Art of Record
In addition to the art applied to the Claims during the prosecution of the instant Application, note the references cited on the Notice of References Cited PTO 892.
Min (U.S. 20210134269) remains the closest reference that was found.
 Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659