DETAILED ACTION

Introduction
1.         This office action is in response to Applicant’s submission filed on 09/11/2019.  Claims 1-20 are pending in the application. As such, Claims 1-20 have been examined.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
3.	The drawings filed on 09/11/2019 have been accepted and considered by the Examiner.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

4.	Claim(s) 1-4, 7, 9, 10, 12-15, 17-19 is/are rejected under 35 U.S.C. 102(a)(1) and/or 102(b)(1) as being anticipated by Barbulescu et al., (A. Barbulescu, R. Ronfard and G. Bailly, "A Generative Audio-Visual Prosodic Model for Virtual Actors," in IEEE Computer Graphics and Applications, vol. 37, no. 6, pp. 40-51, November/December 2017), hereinafter referred to as BARBULESCU.
	With respect to Claim 1, BARBULESCU discloses:
1. A computer system, comprising: a computation device; memory configured to store program instructions, wherein, when executed by the computation device, the program instructions cause the computer system to perform one or more operations comprising: 
receiving an input associated with a type of interaction (See e.g., “…as input, the neural networks receive a set of linear ramps that give the absolute position (which count the distance toward the beginning and end of the sentence) and relative position (which describe the position of the syllable relative to the end of the sentence) of the current syllable…output is represented by the prosodic characteristics (stylized contour) for the current syllable…we use the term contour generator to denominate a neural network trained for a specific attitude and actor. Figure 3 illustrates a contour generator with inputs (ramps) and outputs (stylized contours)…” See e.g., BARBULESCU pp. 41-45, Figs. 2, 3); and generating, using a voice synthesis engine, output speech corresponding to an individual based at least in part on the input (See e.g., “…generating prosodic contours based on the direct link between phonetic forms and prosodic functions (such as attitude, emphasis, segmentation, and dependency relation) acting at different scales (such as utterance, phrase, word, syllable, and phone)…” See e.g., BARBULESCU pp. 41-45, Fig. 2), wherein the voice synthesis engine is configured to predict positions and duration of a prosodic characteristic of speech by the individual (See e.g.,“…Generating audio-visual speaking styles…we use the phonotactic information to predict prosodic feature contours. The predicted rhythm is used to compute phoneme durations. The expressive speech is synthesized with a vocoder that uses the neutral utterance, predicted rhythm, energy, and voice pitch, and the facial animation parameters are obtained by adding the warped neutral motion to the reconstructed and warped predicted motion contours…” See e.g., BARBULESCU pp. 41-46, Fig. 2), and to selectively add the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction (See e.g., “…choose neural networks for carrying this type of nonlinear mapping between phonotactic information and the stylized contour values…the 
    PNG
    media_image1.png
    357
    792
    media_image1.png
    Greyscale
model 
    PNG
    media_image2.png
    415
    690
    media_image2.png
    Greyscale
should also be able to extrapolate in the case of new phonotactic information—that is, when we want to generate contours for an utterance with a number of syllables different from the ones seen in the training set. Expressive modeling is carried out separately for each feature (melody, rhythm, energy, and motion) by training a feed-forward neural network with a hidden layer of 17 neurons and a logistic activation function…” See e.g., BARBULESCU pp. 41-45, Fig. 2), and wherein the prosodic characteristic comprises one of: pauses in the speech by the individual, or disfluences in the speech by the individual (See e.g., “…learning audio-visual speaking styles…extract audio and visual prosodic features from the training example and learn SFC models and GV equalization parameters for all dramatic attitudes, resulting in a database of audio-visual prosodic contours, including melody, rhythm, and differential motion…” See e.g., BARBULESCU pp. 41-45, Fig. 2).

With respect to Claim 2, BARBULESCU discloses:
2. The computer system of claim 1, wherein the input comprises one of: text; or speech of a second individual, who is different from the individual (See e.g., “…an end-to-end system for learning generative prosodic models of attitudes from paired examples of neutral and expressive sentences performed by semiprofessional actors…system requires the following input: a neutral version of the audiovisual speech and the label of the desired attitude, which we refer to as didascalia, in the context of a dramatic work…” See e.g., BARBULESCU pp. 41-45, Fig. 2).

With respect to Claim 3, BARBULESCU discloses:
3. The computer system of claim 1, wherein one or more operations comprise generating, using a rendering engine, video of a visual representation corresponding to the individual based at least in part on the output speech (See e.g., video generation in Fig. 6 in agreement with “…generating audio-visual speaking styles. Given a neutral sentence, we use the phonotactic information to predict prosodic feature contours…predicted rhythm is used to compute phoneme durations. The expressive speech is synthesized with a vocoder that uses the neutral utterance, predicted rhythm, energy, and voice pitch, and the facial animation parameters are obtained by adding the warped neutral motion to the reconstructed and warped predicted motion contours…” See e.g., BARBULESCU pp. 41-47, Figs. 2, 5, 6); and wherein the video of the visual representation comprises facial and lip movements corresponding to and (See e.g., “…verbal and nonverbal motion… facial expressions are further processed by splitting them into two main groups: upper face (eyebrow motion, blinking, squinting…) and lower face (smiling, mouth opening, lips protrusion…” See e.g., BARBULESCU pp. 41-47, Figs. 2, 5, 6).

With respect to Claim 4, BARBULESCU discloses:
4. The computer system of claim 3, wherein one or more operations comprise providing the video of the visual representation and the output speech (See e.g., video generation in Fig. 6 in agreement with “…generating audio-visual speaking styles. Given a neutral sentence, we use the phonotactic information to predict prosodic feature contours…predicted rhythm is used to compute phoneme durations. The expressive speech is synthesized with a vocoder that uses the neutral utterance, predicted rhythm, energy, and voice pitch, and the facial animation parameters are obtained by adding the warped neutral motion to the reconstructed and warped predicted motion contours…” See e.g., BARBULESCU pp. 41-47, Figs. 2, 5, 6).

With respect to Claim 7, BARBULESCU discloses:
7. The computer system of claim 1, wherein one or more operations comprise determining, using a natural language processing engine, a response based at least in part on the input (See e.g., “…utterances were automatically aligned with their phonetic transcription obtained by an automatic TTS phonetizer. The linguistic analysis (part-of-speech tagging and syllabation)…” See e.g., BARBULESCU pp. 41-47, Figs. 2, 5, 6).

Claim 9, BARBULESCU discloses:
9. The computer system of claim 1, wherein the output speech is generated based at least in part on a gender of the individual, an ethnicity of the individual or a demographic attribute associated with the individual (See e.g., “…gathered user information, such as age, gender, and native language. We performed likelihood ratio tests comparing the combined multinomial model selected attitude (ground-truth attitude + condition + gender + language + age) with the reduced models obtained by eliminating one factor until all remaining factors significantly contributed to the model…” See e.g., BARBULESCU pp. 47-49, Figs. 2, 5, 6).

With respect to Claim 10, BARBULESCU discloses:
10. The computer system of claim 1, wherein the voice synthesis engine is configured to selectively add the prosodic characteristic based at least in part on the type of interaction (See e.g., “…as input, the neural networks receive a set of linear ramps that give the absolute position (which count the distance toward the beginning and end of the sentence) and relative position (which describe the position of the syllable relative to the end of the sentence) of the current syllable…output is represented by the prosodic characteristics (stylized contour) for the current syllable…we use the term contour generator to denominate a neural network trained for a specific attitude and actor. Figure 3 illustrates a contour generator with inputs (ramps) and outputs (stylized contours)…” See e.g., BARBULESCU pp. 41-45, Figs. 2, 3).

With respect to Claim 12, BARBULESCU discloses:
12. A non-transitory computer-readable storage medium for use in conjunction with a computer system, the computer-readable storage medium configured to store program instructions 
receiving an input associated with a type of interaction (See e.g., “…as input, the neural networks receive a set of linear ramps that give the absolute position (which count the distance toward the beginning and end of the sentence) and relative position (which describe the position of the syllable relative to the end of the sentence) of the current syllable…output is represented by the prosodic characteristics (stylized contour) for the current syllable…we use the term contour generator to denominate a neural network trained for a specific attitude and actor. Figure 3 illustrates a contour generator with inputs (ramps) and outputs (stylized contours)…” See e.g., BARBULESCU pp. 41-45, Figs. 2, 3); and generating, using a voice synthesis engine, output speech corresponding to an individual based at least in part on the input (See e.g., “…generating prosodic contours based on the direct link between phonetic forms and prosodic functions (such as attitude, emphasis, segmentation, and dependency relation) acting at different scales (such as utterance, phrase, word, syllable, and phone)…” See e.g., BARBULESCU pp. 41-45, Fig. 2), wherein the voice synthesis engine is configured to predict positions and duration of a prosodic characteristic of speech by the individual (See e.g., “…Generating audio-visual speaking styles…we use the phonotactic information to predict prosodic feature contours. The predicted rhythm is used to compute phoneme durations. The expressive speech is synthesized with a vocoder that uses the neutral utterance, predicted rhythm, energy, and voice pitch, and the facial animation parameters are obtained by adding the warped neutral motion to the reconstructed and warped predicted motion contours…” See e.g., BARBULESCU pp. 41-46, Fig. 2),, and to selectively add the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction (See e.g., “…choose neural networks for carrying this type of nonlinear mapping between phonotactic information and the stylized contour values…the model should also be able to extrapolate in the case of new phonotactic information—that is, when we want to generate contours for an utterance with a number of syllables different from the ones seen in the training set. Expressive modeling is carried out separately for each feature (melody, rhythm, energy, and motion) by training a feed-forward neural network with a hidden layer of 17 neurons and a logistic activation function…” See e.g., BARBULESCU pp. 41-45, Fig. 2), and wherein the prosodic characteristic comprises one of: pauses in the speech by the individual, or disfluences in the speech by the individual (See e.g., “…learning audio-visual speaking styles…extract audio and visual prosodic features from the training example and learn SFC models and GV equalization parameters for all dramatic attitudes, resulting in a database of audio-visual prosodic contours, including melody, rhythm, and differential motion…” See e.g., BARBULESCU pp. 41-45, Fig. 2).

With respect to Claim 13, BARBULESCU discloses:
13. The computer-readable storage medium of claim 12, wherein the input comprises one of: text; or speech of a second individual, who is different from the individual(See e.g., “…an end-to-end system for learning generative prosodic models of attitudes from paired examples of neutral and expressive sentences performed by semiprofessional actors…system requires the following input: a neutral version of the audiovisual speech and the label of the desired attitude, which we refer to as didascalia, in the context of a dramatic work…” See e.g., BARBULESCU pp. 41-45, Fig. 2).

Claim 14, BARBULESCU discloses:
14. The computer-readable storage medium of claim 12, wherein one or more operations comprise generating, using a rendering engine, video of a visual representation corresponding to the individual based at least in part on the output speech (See e.g., video generation in Fig. 6 in agreement with “…generating audio-visual speaking styles. Given a neutral sentence, we use the phonotactic information to predict prosodic feature contours…predicted rhythm is used to compute phoneme durations. The expressive speech is synthesized with a vocoder that uses the neutral utterance, predicted rhythm, energy, and voice pitch, and the facial animation parameters are obtained by adding the warped neutral motion to the reconstructed and warped predicted motion contours…” See e.g., BARBULESCU pp. 41-47, Figs. 2, 5, 6); and wherein the video of the visual representation comprises facial and lip movements corresponding to and synchronized with the generated output speech (See e.g., “…verbal and nonverbal motion… facial expressions are further processed by splitting them into two main groups: upper face (eyebrow motion, blinking, squinting…) and lower face (smiling, mouth opening, lips protrusion…” See e.g., BARBULESCU pp. 41-47, Figs. 2, 5, 6). 

With respect to Claim 15, BARBULESCU discloses:
15. The computer-readable storage medium of claim 14, wherein one or more operations comprise providing the video of the visual representation and the output speech (See e.g., video generation in Fig. 6 in agreement with “…generating audio-visual speaking styles. Given a neutral sentence, we use the phonotactic information to predict prosodic feature contours…predicted rhythm is used to compute phoneme durations. The expressive speech is synthesized with a vocoder that uses the neutral utterance, predicted rhythm, energy, and voice pitch, and the facial animation parameters are obtained by adding the warped neutral motion to the reconstructed and warped predicted motion contours…” See e.g., BARBULESCU pp. 41-47, Figs. 2, 5, 6).

With respect to Claim 17, BARBULESCU discloses:
17. The computer-readable storage medium of claim 12, wherein the output speech is generated based at least in part on a gender of the individual, an ethnicity of the individual or a demographic attribute associated with the individual (See e.g., “…gathered user information, such as age, gender, and native language. We performed likelihood ratio tests comparing the combined multinomial model selected attitude (ground-truth attitude + condition + gender + language + age) with the reduced models obtained by eliminating one factor until all remaining factors significantly contributed to the model…” See e.g., BARBULESCU pp. 47-49, Figs. 2, 5, 6).

With respect to Claim 18, BARBULESCU discloses:
18. The computer-readable storage medium of claim 12, wherein the voice synthesis engine is configured to selectively add the prosodic characteristic based at least in part on the type of interaction (See e.g., “…as input, the neural networks receive a set of linear ramps that give the absolute position (which count the distance toward the beginning and end of the sentence) and relative position (which describe the position of the syllable relative to the end of the sentence) of the current syllable…output is represented by the prosodic characteristics (stylized contour) for the current syllable…we use the term contour generator to denominate a neural network trained for a specific attitude and actor. Figure 3 illustrates a contour generator with inputs (ramps) and outputs (stylized contours)…” See e.g., BARBULESCU pp. 41-45, Figs. 2, 3).

With respect to Claim 19, BARBULESCU discloses:
19. A method for generating output speech, wherein the method comprises: by a computer system: receiving an input associated with a type of interaction (See e.g., “…as input, the neural networks receive a set of linear ramps that give the absolute position (which count the distance toward the beginning and end of the sentence) and relative position (which describe the position of the syllable relative to the end of the sentence) of the current syllable…output is represented by the prosodic characteristics (stylized contour) for the current syllable…we use the term contour generator to denominate a neural network trained for a specific attitude and actor. Figure 3 illustrates a contour generator with inputs (ramps) and outputs (stylized contours)…” See e.g., BARBULESCU pp. 41-45, Figs. 2, 3); and generating, using a voice synthesis engine, output speech corresponding to an individual based at least in part on the input (See e.g., “…generating prosodic contours based on the direct link between phonetic forms and prosodic functions (such as attitude, emphasis, segmentation, and dependency relation) acting at different scales (such as utterance, phrase, word, syllable, and phone)…” See e.g., BARBULESCU pp. 41-45, Fig. 2), wherein the voice synthesis engine is configured to predict positions and duration of a prosodic characteristic of speech by the individual(See e.g., “…Generating audio-visual speaking styles…we use the phonotactic information to predict prosodic feature contours. The predicted rhythm is used to compute phoneme durations. The expressive speech is synthesized with a vocoder that uses the neutral utterance, predicted rhythm, energy, and voice pitch, and the facial animation parameters are obtained by adding the warped neutral motion to the reconstructed and warped predicted motion contours…” See e.g., BARBULESCU pp. 41-46, Fig. 2), and to selectively add the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction (See e.g., “…choose neural networks for carrying this type of nonlinear mapping between phonotactic information and the stylized contour values…the model should also be able to extrapolate in the case of new phonotactic information—that is, when we want to generate contours for an utterance with a number of syllables different from the ones seen in the training set. Expressive modeling is carried out separately for each feature (melody, rhythm, energy, and motion) by training a feed-forward neural network with a hidden layer of 17 neurons and a logistic activation function…” See e.g., BARBULESCU pp. 41-45, Fig. 2), and wherein the prosodic characteristic comprises one of: pauses in the speech by the individual, or disfluences in the speech by the individual (See e.g., “…learning audio-visual speaking styles…extract audio and visual prosodic features from the training example and learn SFC models and GV equalization parameters for all dramatic attitudes, resulting in a database of audio-visual prosodic contours, including melody, rhythm, and differential motion…” See e.g., BARBULESCU pp. 41-45, Fig. 2).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

5.	Claim 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Barbulescu et al., (A. Barbulescu, R. Ronfard and G. Bailly, "A Generative Audio-Visual Prosodic Model for Virtual Actors," in IEEE Computer Graphics and Applications, vol. 37, no. 6, pp. 40-51, November/December 2017), in view of Pham et al., (H. X. Pham, S. Cheung and V. Pavlovic, "Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach," 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 2328-2336), hereinafter referred to as BARBULESCU and PHAM.
With respect to Claim 6, BARBULESCU does not explicitly disclose, but PHAM discloses: 6. The computer system of claim 1, wherein the voice synthesis engine comprises a long short-term memory model using a recurrent neural network architecture (See e.g., how voice synthesis engine can use “… a long short-term memory recurrent neural network (LSTM-RNN) approach for real-time facial animation, which automatically estimates head rotation and facial action unit activations of a speaker from just her speech…time-varying contextual non-linear mapping between audio stream and visual facial movements is realized by training a LSTM neural network on a large audio-visual data corpus…” (See e.g., PHAM, Abstract, § 5).
BARBULESCU and PHAM are analogous art because they are from a similar field of endeavor in speech processing techniques and applications.  Thus, it would have been obvious to a person of ordinary skill in the art, before the effective filling date of the claimed invention, to modify the teachings of BARBULESCU with see e.g., how voice synthesis engine can use “… a long short-term memory recurrent neural network (LSTM-RNN) approach for real-time facial animation…” functionalities and capabilities taught by PHAM in order to advantageously furnish and provide the applicable and/or extendable capabilities of “…Recurrent Neural Networks (RNNs) have the ability to memorize past inputs in internal states… to  (See e.g., PHAM, Abstract, § 5).

Allowable Subject Matter
6.	Claims 5, 8, 11, 16, and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
7.       The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.  
Sadoughi et al., (N. Sadoughi and C. Busso, "Expressive Speech-Driven Lip Movements with Multitask Learning," 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 2018, pp. 409-415), discloses see e.g., “…a conditional generative adversarial network, called conditional sequential GAN (CSG), which learns the relationship between emotion and lexical content in a principled manner. This model uses a set of articulatory and emotional features directly extracted from the speech signal as conditioning inputs, generating realistic movements…to create emotionally dependent models by either adapting the base model with the target emotional data (CSG-Emo-Adapted), or adding emotional conditions as the input of the model (CSG-Emo-Aware). Objective evaluations of these models show improvements for (Sadoughi et al., Abstract, § 4, Fig. 4).
Please, see additional references in form PTO-892 for more details.
8.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Edgar Guerra-Erazo whose telephone number is (571) 270-3708.  The examiner can normally be reached on M-F 7:30a.m.-5:00p.m. EST. If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Bhavesh Mehta can be reached on (571) 272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
/EDGAR X GUERRA-ERAZO/Primary Examiner, Art Unit 2656