DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to the submission filed August 13, 2019.  Claims 1-20 are pending.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-5, and 7-11 are rejected under 35 U.S.C. 102(a)(1)/(a)(2) as being anticipated by Brand (US Patent No. 6,735,566).
Regarding claim 1, Brand teaches a computing system for generating image data representing a speaker's face (col. 1, lines 5-8; col. 4, lines 24-31 -- a system for providing control parameters for animation using hidden Markov models or HMMs), the computing system comprising: a detection device configured to route data representing a voice signal to one or more processors that generate a response to the voice signal (col. 5, lines 11-12 -- during run time, new audio 90 is put through acoustic analysis 92 which then provides new vocal data 94; col. 5, lines 47-53 -- to obtain a useful vocal representation, a mix of LPC and RASTA-PLP features are calculated); and a data processing device comprising the one or more processors, the data processing device configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal (Fig. 1; col. 4, lines 24-31 -- a system for providing control parameters for animation using hidden Markov models or HMMs; col. 5, lines 24-26 -- components of the computer; col. 8, lines 29-36 -- trajectory of control points is used to drive a 3D animated head model or a 2D face image) by performing operations comprising: executing a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal (col. 5, lines 47-53 -- a mix of LPC and RASTA-PLP features are calculated); mapping a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector (col. 4, lines 31-36; col. 4, lines 50-53 continuing to col. 5, lines 10; col. 5, lines 55-57 -- ultimate goal is to learn a mapping from the vocal features in a given time frame to simultaneous facial features); and generating a visual representation of at least a portion of the speaker based on the mapping, the visual trajectory of control points is used to drive a 3D animated head model or a 2D face image).
Regarding claim 2, Brand teaches parameters of the voice embedding function that specify which of the one or more signal features of the voice signal are included in the feature vector are trained with one or more covariate classifiers that receive image data and voice signals (col. 4, lines 32 continuing to col. 5, line 26 -- The link between the video and the audio is provided by a facial state sequence 82 which is a series of facial images in a facial space. Combining the facial state sequence 82 with the vocal data 78 produces a set of vocal mappings 86, one for each facial state. The facial mappings, the facial dynamics, and the vocal mappings constitute the learned model from which a vocal/facial hidden Markov model 84 is produced; col. 5, line 55 continuing to col. 7, line 6). 
Regarding claim 3, Brand teaches comprising generating an inference of a value for the visual feature based on a known correlation of the one or more signal features of the voice signal to the visual feature of the speaker (col. 4, lines 32 continuing to col. 5, line 26 -- The link between the video and the audio is provided by a facial state sequence 82 which is a series of facial images in a facial space. Combining the facial state sequence 82 with the vocal data 78 produces a set of vocal mappings 86, one for each facial state. The facial mappings, the facial dynamics, and the vocal mappings constitute the learned model from which a vocal/facial hidden Markov model 84 is produced; col. 5, line 55 continuing to col. 7, line 6).

Regarding claim 4, Brand teaches the value for the visual feature comprises a size or relative proportions of articulators and vocal chambers of the speaker (col. 5, lines 28-44 – corners of the speaker’s mouth). 
Regarding claim 5, Brand teaches the visual representation comprises a reconstructed representation of a face of the speaker (col. 8, lines 29-36 -- trajectory of control points is used to drive a 3D animated head model or a 2D face image).
Regarding claim 6, Brand teaches wherein at least one of the one or more signal features of the feature vector comprises a voice quality feature, wherein the voice quality feature is related deterministically to measurements of a vocal tract of the speaker (a mix of LPC and RASTA-PLP features are calculated as discussed by H. Hermansky and N. Morgan, Rasta processing of speech, IEEE Transactions on Speech and Audio Processing, 2(4):578-589, October 1994 – where the LPC and PLP are representative of vocal tract information of speech), wherein the measurements of the vocal tract are related to measurements of a face of the speaker (col. 4, lines 32 continuing to col. 5, line 26 -- The link between the video and the audio is provided by a facial state sequence 82 which is a series of facial images in a facial space. Combining the facial state sequence 82 with the vocal data 78 produces a set of vocal mappings 86, one for each facial state. The facial mappings, the facial dynamics, and the vocal mappings constitute the learned model from which a vocal/facial hidden Markov model 84 is produced; col. 5, line 55 continuing to col. 7, line 6), and wherein the data processing device is configured to recreate a geometry and of the face of the speaker based on determining the voice quality feature (col. 8, lines 29-36 -- trajectory of control points is used to drive a 3D animated head model or a 2D face image
Regarding claim 7, Brand teaches receiving, from the detection device, data comprising a template face, and modifying the data comprising the template face to incorporate the visual feature (col. 8, lines 29-36 – 2D face image).
Regarding claim 8, Brand teaches the visual feature comprises one or more of a skull structure, a gender of the speaker, an ethnicity of the speaker, a facial landmark of the speaker, a nose structure, or a mouth shape of the speaker (col. 5, lines 27-45 -- obtain facial articulation data, a computer vision system is used to simultaneously track several individual features on the face, such as the corners of the mouth). 
Regarding claim 9, Brand teaches generating a facial image of the speaker in two or three dimensions independent of receiving data comprising a template image (col. 8, lines 29-36 -- trajectory of control points is used to drive a 3D animated head model or a 2D face image).
Regarding claim 10, Brand teaches the voice embedding function comprises a regression function configured to enable the data processing device to generate a statistically plausible face that incorporates the visual feature (col. 5, line 55 continuing to col. 7, line 6 – vocal/facial HMM model….Using entropic training, one estimates a facial dynamical model from the poses and velocities output by the vision system. One then uses a dynamic programming analysis to find the most probable sequence of hidden states given the training video..)
Regarding claim 11, Brand teaches the voice embedding function comprises a generative model configured to enable the data processing device to generate a statistically plausible face that incorporates the visual feature (col. 5, line 55 continuing to col. 7, line 6 – vocal/facial HMM model).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Brand in view of Yang et al (US Patent Application Publication No. 2020/0043465), hereinafter Yang.
Regarding claim 16, Brand teaches a computing system comprising (col. 1, lines 5-8; col. 4, lines 24-31 -- a system for providing control parameters for animation using hidden Markov models or HMMs): a detection device configured to route data representing an image data representing a speaker to one or more processors that generate a response to the image data (col. 4, line 45 continuing to col. 5, line 46 –facial state sequence); and a data processing device comprising the one or more processors (col. 5, line 24-26 – components of the computer), executing a face embedding function to generate a feature vector from the image data representing visual features of the facial state sequence); mapping a feature of the feature vector to a signal feature of the voice signal by a modality transfer function specifying a relationship between the visual features of the image data and the signal feature of the voice signal (col. 4, lines 31-36; col. 4, lines 50-53 continuing to col. 5, lines 10; col. 5, lines 55-57 -- ultimate goal is to learn a mapping from the vocal features in a given time frame to simultaneous facial features).  Brand fails to teach the data processing device configured to generate a simulation of a voice signal in response to receiving the image data by performing operations comprising where; and generating, based on the mapping, the voice signal to simulate a voice of the speaker, the voice signal comprising the signal feature.  In a similar field of endeavor, Yang teaches a method for audio synthesis adapted to video characteristics, wherein the video characteristics are based on mouth shape characteristics of a speaker that are extracted from an input video (Fig. 2; para 0008-0014; 0030-0039; 0043).  Yang suggests the invention is advantageous in providing dubbing for video without requiring manual operations (para 0004) and for providing speech more adapted to the video characteristics to be added (para 0020).  One having ordinary skill in the art would have recognized the advantages of implementing the audio synthesis processing techniques of Yang in the animation system of Brand, for the purpose of enhancing the animation display by providing dubbing features to the animation and/or providing audio more adapted to the animation, as suggested by Yang. 
Regarding claim 12, Brand fails to teach the data processing device is configured to receive auxiliary data about the speaker comprising an age, a height, a gender, an ethnicity, or a body-mass index (BMI) value.   Yang teaches the sex and age of the 
Regarding claim 13, Brand fails to teach, but Yang teaches, the data processing device is configured to estimate one or more body indices of the speaker based on the auxiliary data, wherein the visual representation of the speaker comprises a full-body representation based on the one or more body indices [para 0044 – where providing Yang’s body image data to represent full-body representation is an obvious step requiring routine skill in the art].   One having ordinary skill in the art would have recognized the advantages of implementing the specific speaker characteristics from a body image, as suggested by Yang, for the purpose enhancing the user experience with the system by generating personalized animation/synthetic representations.
Regarding claim 14, Brand fails to teach, but Yang teaches where the body indices are represented by a vector that includes a number of linear and volumetric characterizations of a body of the speaker [Yang’s AI model, where utilizing a vector of various characterizations of data for training and using an AI model so as to determine an optimum function/model for a system is an obvious step requiring only routine skill in the art].  One having ordinary skill in the art would have recognized the advantages of 
Regarding claim 15, Brand fails to teach, but Yang teaches a relation between visual features and the body indices is modeled by a neural network that is trained from training data comprising at least one of image data representing faces of speakers and voice signals [Yang’s AI model, where utilizing a neural network of various characterizations of data for training and using the neural network so as to determine an optimum function/model for a system is an obvious step requiring only routine skill in the art].  One having ordinary skill in the art would have recognized the advantages of implementing the specific speaker characteristics from a body image, as suggested by Yang, for the purpose enhancing the user experience with the system by generating personalized animation/synthetic representations. 
Regarding claim 17, the combination of Brand and Yang teaches wherein mapping comprises: determining, by voice quality generation logic, a voice quality of the voice signal comprising one or more spectral features (a mix of LPC and RASTA-PLP features are calculated as discussed by H. Hermansky and N. Morgan, Rasta processing of speech, IEEE Transactions on Speech and Audio Processing, 2(4):578-589, October 1994 – where the LPC and PLP are representative of vocal tract information of speech); and determining, by content generator logic, a style of the voice signal, a language of the voice signal, or an accent for the voice signal that includes the one or more spectral features (Yang’s gender and age characteristics, since gender and age effect speech characteristics and therefore provide a form of style of the voice 
Regarding claim 18, the combination of Brand and Yang teaches where the voice quality generator logic is configured to map visual features derived from facial images to estimates of one or more subcomponents of voice quality (Brand at col. 4, lines 31-36; col. 4, lines 50-53 continuing to col. 5, lines 10; col. 5, lines 55-57 -- ultimate goal is to learn a mapping from the vocal features in a given time frame to simultaneous facial features). 
Regarding claim 19, the combination of Brand and Yang teaches wherein the voice quality generation logic determines the voice quality based on training data comprising facial image-voice quality pairs (Brand at col. 4, lines 31-36; col. 4, lines 50-53 continuing to col. 5, lines 10; col. 5, lines 55-57 -- ultimate goal is to learn a mapping from the vocal features in a given time frame to simultaneous facial features).  
Regarding claim 20, the combination of Brand and Yang teaches wherein the voice quality generation logic determines the voice quality based on a known relationship between visual features, the known relationship being derived from a plurality of images and voice qualities data  (col. 4, lines 32 continuing to col. 5, line 26 -- The link between the video and the audio is provided by a facial state sequence 82 which is a series of facial images in a facial space. Combining the facial state sequence 82 with the vocal data 78 produces a set of vocal mappings 86, one for each facial state. The facial mappings, the facial dynamics, and the vocal mappings constitute the learned model from which a vocal/facial hidden Markov model 84 is produced; col. 5, line 55 continuing to col. 7, line 6 – vocal/facial HMM model….Using entropic training, one estimates a facial dynamical model from the poses and velocities output by the vision system. One then uses a dynamic programming analysis to find the most probable sequence of hidden states given the training video..).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Cao et al (US Patent Application Publication No. 2019/0130628) discloses joint audio-video facial animation system.
Lu et al (US Patent Application Publication No. 2011/0227931) discloses a method and apparatus for changing lip shape and obtaining lip animation in voice-driven animation.





Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598.  The examiner can normally be reached on M,T,TH,F 11:30-8:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


ANGELA A. ARMSTRONG
Primary Examiner
Art Unit 2659



/ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659