Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 12 recites the limitation "the parameters" in “…wherein the operation of the neural network is based at least in part on the parameters…”.  There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1-6, 10, 12 are rejected under 35 U.S.C. 103 as being unpatentable over Santos et al (US20200302667) in view of Karras et al ("Audio-driven facial animation by joint end-to-end learning of pose and emotion." ACM Transactions on Graphics (TOG) 36.4 (2017): 1-12).

Regarding Claim 1. Santos teaches A non-transitory computer readable medium storing a computer program, wherein when the computer program is executed by one or more processors of a computing device, the computer program performs a method for generating facial animation (Santos, abstract, the invention describes a computer-implemented method for generating a machine learned
model to generate facial position data based on audio data comprising training a conditional variational autoencoder having an encoder and decoder. The training comprises receiving a set of training data items, each training data item comprising a facial position descriptor and an audio descriptor; processing one or more of the training data items using the encoder to obtain distribution parameters; sampling a latent vector from a latent space distribution based on the distribution parameters; processing the latent vector and the audio descriptor using the decoder to obtain a facial position output; calculating a loss value based at least in part on a comparison of the facial position output and the facial position descriptor of at least one of the one or more training data items; and updating parameters of the conditional variational autoencoder based at least in part on the calculated loss value.
[0013] FIG. 1 is a flow diagram illustrating an example method 100 of training a conditional variational autoencoder for use in generating a machine-learned model to generate facial position data based on audio data. The method is performed by executing computer-readable instructions using one or more processors of one or more computing devices.), and the method includes:
inputting two or more training input data into a facial animation generation
model (Santos, [0014] As shown in Fig 1, step 110, a set of training data items is received. Each of the training data items may include a facial position descriptor and one or more audio descriptors. Each of the training data items may also include additional descriptors and/or parameters. For example, one or more descriptors indicating an actor associated with the one or more audio descriptors and/or one or more descriptors indicating a character associated with the facial positions may be included.);

Santos fails to explicitly teach, however, Karras teaches training a common feature of the two or more training input data using a first network function included in the facial animation generation model (Karras, abstract, the paper describes a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. The deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet. Even though our primary goal is to model the speaking style of a single actor, the model yields reasonable results even when driven with audio from other speakers with different gender, accent, or language,
Page 94:4, col 1, par 1-2, the deep neural network consists of one special-purpose layer, 10 convolutional layers, and 2 fully-connected layers. We divide it in
formant analysis network to produce a time-varying sequence of speech features that will subsequently drive articulation. The network first extracts raw formant information using fixed-function autocorrelation analysis (Section 3.2) and then refines it with 5 convolutional layers. Through training, the convolutional layers learn to extract short term features that are relevant for facial animation,
Page 94:4, col 1, par 3, Next, the result is fed to an articulation network that consists of 5 further convolutional layers that analyze the temporal evolution of the features and eventually decide on a single abstract feature vector that describes the facial pose at the center of the audio window. As a secondary input, the articulation network accepts a (learned) description of emotional state to disambiguate between different facial expressions and speaking styles (Section 3.3). The emotional state is represented as an E-dimensional vector that is concatenated directly onto the output of each layer.
Page, 94:5, col 1, par 1-2, Inferring facial animation from speech is an inherently ambiguous problem, because the same sound can be produced with very different facial expressions. This is especially true with the eyes and eyebrows, since they have no direct causal relationship with sound production. Such ambiguities are also problematic for deep neural networks, because the training data will inevitably contain cases where nearly identical audio inputs are expected to produce very different output poses. Our approach for resolving these ambiguities is to introduce a secondary input to the network. We associate a small amount of additional, latent data with each training sample, so that the network has enough information to unambiguously infer the correct output pose. Informally, we wish the secondary input to represent the emotional state of
the actor. Besides resolving ambiguities in the training data, the secondary input is also highly useful for inference-it allows us to mix and match different emotional states with a given vocal track to provide powerful control over the resulting animation.
Therefore, for different video with same audio input, the facial expression could be different. Then the audio is the common feature with variation of facial animations.); and
training an independent feature of each of the two or more training input data using a second network function to cause the facial animation generation model to generate a facial animation according to input data (Karras, page 94:8, col 1, par 6, when inferring the facial pose for novel audio, we need to supply the network with an emotional state vector as a secondary input. As part of training, the network has learned a latent E-dimensional vector for each training sample, and our strategy is to mine this emotion database for robust emotion vectors that can be used during inference.
Page 94:9, col 1, par 2, We then examine the output of the network for several novel audio clips with every remaining emotion vector, and assign a semantic meaning (e.g., "neutral", "amused", "surprised", etc.) to each of them, depending on the emotional state they convey (Figure 5). Which semantic emotions remain depends entirely on the training material, and it will not be possible to extract, e.g., a "happy"
emotion if the training data does not contain enough such material to be generalizable to novel audio. Figure 6 shows inferred facial poses for Character 1 using the same audio window but different emotion vectors. As can be seen, even after removing all but the best performing emotion vectors there is still substantial variation to choose from.).
	Santos and Karras are analogous art, because they both teach method of creating facial animation based on image and/or audio input data. Karras further teaches method of adding emotional state vector when processing same audio input with different facial expression. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the facial animation creating method (taught in Santos), to further use the emotional state vector to further refine the facial animation (taught in Karras), so as to disambiguates the variations in facial expression that cannot be explained by the audio alone (Karras, abstract).

Regarding Claim 2. The combination of Santos and Karras further teaches The non-transitory computer readable medium according to claim 1, wherein the first network function includes a first sub-network function and a second sub-network function composed of one or more dimension reduction layers, the first sub-network function computes two or more input data to output a common voice feature of the two or more input data, and the second sub-network function computes the common voice feature and outputs a feature vector about a common facial pose corresponding to the two or more input data (Karras, page 94:4, col 1, par 3, Next, the result is fed to an articulation network that consists of 5 further convolutional layers that analyze the temporal evolution of the features and eventually decide on a single abstract feature vector that describes the facial pose at the center of the audio window. As a secondary input, the articulation network accepts a (learned) description of emotional state to disambiguate between different facial expressions and speaking styles (Section 3.3). The emotional state is represented as an E-dimensional vector that is concatenated directly onto the output of each layer.
Page 94:4, col 1, par 4, Each layer l outputs Fl x Wl x Hl activations, where Fl is the number of abstract feature maps, Wl is dimension of the time axis, and Hl is the dimension of the formant axis. We use strided 1x3 convolutions in the formant analysis network to gradually reduce Hl while increasing Fl, i.e., to push raw formant information to the abstract features, until we have Hl = l and Fl = 256 at the end. Similarly, we use 3x1 convolutions in the articulation network to decrease Wl, i.e., to subsample the time axis by combining information from the temporal neighborhood.
Page 94:4, col 1, par 5, The articulation network outputs a set of 256+ E abstract features that together represent the desired facial pose. We feed these features to an output network to produce the final 3D positions of 5022 control vertices in our tracking mesh. The output network is implemented as a pair of fully-connected layers that perform a simple linear transformation on the data. The first layer maps the set of input features to the weights of a linear basis, and the second layer calculates the final vertex positions as a weighted sum over the corresponding basis vectors.).

Regarding Claim 3. The combination of Santos and Karras further teaches The non-transitory computer readable medium according to claim 1, wherein each of the two or more training input data is matched with facial feature data (Karras, abstract, the paper describes a machine learning technique for driving 3D facial earns a mapping from input waveforms to the 3D vertex coordinates of
a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
Page 94:4, col 1, par 1-2, the deep neural network consists of one special-purpose layer, 10 convolutional layers, and 2 fully-connected layers. We divide it in
three conceptual parts, illustrated in Figure 1 and Table 1. The system starts by feeding the input audio window to a formant analysis network to produce a time-varying sequence of speech features that will subsequently drive articulation. The network first extracts raw formant information using fixed-function autocorrelation analysis (Section 3.2) and then refines it with 5 convolutional layers. Through training, the convolutional layers learn to extract short term features that are relevant for facial animation,), and the second network function is composed of one or more dimension expand layers and includes two or more parallel third sub-network functions associated with the facial feature data (Karras, page 94:4, col 1, par 3, Next, the result is fed to an articulation network that consists of 5 further convolutional layers that analyze the temporal evolution of the features and eventually decide on a single abstract feature vector that describes the facial pose at the center of the audio window. As a secondary input, the articulation network accepts a (learned) description of emotional state to disambiguate between different facial expressions and speaking styles (Section 3.3). The emotional state is represented as an E-dimensional vector that is concatenated directly onto the output of each layer.
Each layer l outputs Fl x Wl x Hl activations, where Fl is the number of abstract feature maps, Wl is dimension of the time axis, and Hl is the dimension of the formant axis. We use strided 1x3 convolutions in the formant analysis network to gradually reduce Hl while increasing Fl, i.e., to push raw formant information to the abstract features, until we have Hl = l and Fl = 256 at the end. Similarly, we use 3x1 convolutions in the articulation network to decrease Wl, i.e., to subsample the time axis by combining information from the temporal neighborhood.
Therefore, for each layer, there are at least time axis and temporal neighborhood axis are considered together with feature axis.).

Regarding Claim 4. The combination of Santos and Karras further teaches The non-transitory computer readable medium according to claim 3, wherein the third sub-network function performs a computation based on a one-hot vector, which is a representation of the facial feature data associated with the third subnetwork function, and determines a location of two or more vertex included in facial animation (Santos, [0027] In step 140, the latent vector and the audio descriptor are processed using the decoder to obtain a facial position output. The latent vector may be input to the first layer of the decoder. The audio descriptor may be input to one or more
subsequent layers of the decoder. Alternatively, the latent vector and audio descriptor, or a part thereof, may be concatenated and input into the first layer of the decoder as a combined vector.
[0028] In one example, the latent vector; a audio descriptors in matrix form, the audio descriptors representing slightly overlapping windows of audio; an n-dimensional
and an actor descriptor, in the form of a m-dimensional one-hot encoded vector, C, where m corresponds to the number of actors in the training set, are processed by the decoder. The latent vector and the actor descriptor are concatenated to give a (n+m)-dimensional vector, ZC. The a audio descriptors are passed through convolutional layers that increase their number of channels, i.e. their depth, but decreases their width and height. For each of the a audio descriptors, an I-dimensional vector is output by the convolutional layers. These a vectors are processed using one or more recurrent layers to output a single I-dimensional vector, A, representing the audio descriptors. The vectors ZC and A are concatenated into a (n+m+l)-dimensional vector, ZCA. The vector ZCA is processed by one or more fully connected layers which map it to a k-dimensional vector, the number of parameters in the face descriptor vector, each of the elements of the vector corresponding to a parameter of the face descriptor input to the network.
The m-dimensional one-hot vectors are for a plurality of actors and they map to a number of face descriptor vectors.).

Regarding Claim 5. The combination of Santos and Karras further teaches The non-transitory computer readable medium according to claim 3, wherein initial weights of at least one layer included in the third sub-network function are determined based on principal component analysis data of training data included in the training data subset associated with the facial feature data associated with the third sub-network function (Karras, page 94:4, col 1, par 5, the articulation network outputs a set of 256+E abstract features that together represent the desired output network to produce the final 3D positions of 5022 control vertices in our tracking mesh. The output network is implemented as a pair of fully-connected layers that perform a simple linear transformation on the data. The first layer maps the set of input features to the weights of a linear basis, and the second layer calculates the final vertex positions as a weighted sum over the corresponding basis vectors.).

Regarding Claim 6. The combination of Santos and Karras further teaches The non-transitory computer readable medium according to claim 1, wherein the method further includes: inputting emotional state data matched to the input data into at least one layer of the first network function or the second network function (Karras, page, 94:5, col 1, par 1-2, Inferring facial animation from speech is an inherently ambiguous problem, because the same sound can be produced with very different facial expressions. This is especially true with the eyes and eyebrows, since they have no direct causal relationship with sound production. Such ambiguities are also problematic for deep neural networks, because the training data will inevitably contain cases where nearly identical audio inputs are expected to produce very different output poses. Our approach for resolving these ambiguities is to introduce a secondary input to the network. We associate a small amount of additional, latent data with each training sample, so that the network has enough information to unambiguously infer the correct output pose. Informally, we wish the secondary input to represent the emotional state of the actor. Besides resolving ambiguities in the training data, the secondary input 
Page 94:8, col 1, par 6, when inferring the facial pose for novel audio, we need to supply the network with an emotional state vector as a secondary input. As part of training, the network has learned a latent E-dimensional vector for each training sample, and our strategy is to mine this emotion database for robust emotion vectors that can be used during inference.
Page 94:9, col 1, par 2, We then examine the output of the network for several novel audio clips with every remaining emotion vector, and assign a semantic meaning (e.g., "neutral", "amused", "surprised", etc.) to each of them, depending on the emotional state they convey (Figure 5). Which semantic emotions remain depends entirely on the training material, and it will not be possible to extract, e.g., a "happy" emotion if the training data does not contain enough such material to be generalizable to novel audio. Figure 6 shows inferred facial poses for Character 1 using the same audio window but different emotion vectors. As can be seen, even after removing all but the best performing emotion vectors there is still substantial variation to choose from.).

Claim 10 is similar in scope as Claim 1 and thus is rejected under same rationale. 

Claim 12 is similar in scope as Claim 1 and thus is rejected under same rationale. Claim 12 further requires:
A non-transitory computer readable medium storing data structure
corresponding to weights of a neural network, at least one of the weights being updated during the training process, wherein the operation of the neural network is based at least in part on the parameters (Santos, [0035] In step 160, the parameters of the conditional variational autoencoder are updated based at least in part on the calculated loss value. The updates to the parameters may be calculated using backpropagation. In backpropagation, the calculated loss value, or a value derived from it, are backpropagated through the network to calculate derivatives of the loss with respect to a given network parameter of the conditional variational autoencoder, e.g. network weights. The parameters of the conditional variational autoencoder may then be updated by gradient descent using the calculated derivatives. As discussed above, the 'reparametization trick' may facilitate the use of gradient descent to train the network. With the reparametization trick, the layer of the network calculating the latent vector may be backpropagated through to the encoder via deterministic latent space distribution parameters, with the stochastic element contained in the sampled vector, ε. 
[0088] The parameters of the encoder 530 and decoder 570, e.g. their neural network weights and biases, may be updated based on the loss value calculated using the loss calculator 590. The updates to the parameters may be calculated using backpropagation. In backpropagation, the calculated loss value, or a value derived from it, are backpropagated through the network to calculate derivatives of the loss with respect to a given network parameter of the encoder 530 or decoder 570. The parameters of the autoencoder may then be updated by gradient descent using the calculated derivatives.).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Santos et al in view of Karras et al further in view of Asadiabadi et al ("Multimodal speech driven facial shape animation using deep neural networks.", IEEE, 2018).

Regarding Claim 7. The combination of Santos and Karras further teaches The non-transitory computer readable medium according to claim 6, wherein the inputting emotional state data matched to the input data into at least one layer of the first network function or the second network function (Karras, page 94:4, col 1, par 3, Next, the result is fed to an articulation network that consists of 5 further convolutional layers that analyze the temporal evolution of the features and eventually decide on a single abstract feature vector that describes the facial pose at the center of the audio window. As a secondary input, the articulation network accepts a (learned) description of emotional state to disambiguate between different facial expressions and speaking styles (Section 3.3). The emotional state is represented as an E-dimensional vector that is concatenated directly onto the output of each layer.) includes: 

The combination of Santos and Karras fails to explicitly teach, however, Asadiabadi teaches inputting the emotional state data into at least one layer except a last layer of the first network function (Asadiabadi, abstract, the paper describes a deep learning multimodal approach for speech driven generation of face animations. The method utilizes both acoustic features and phoneme label features to generate natural looking speaker independent lip animations synchronized with affective speech. A phoneme-based model qualifies generation of speaker independent

Page 4, col 1, par 5, col 2, par 1, The proposed multimodal approach is a combination of the text-based and speech based networks, hence gaining the advantage of text features for a speaker independent model and benefiting from the
speech features for discriminating different affective content. A fusion strategy is utilized to update the output layer’s weights according to the merged hidden neurons of the two
modalities, during optimization. Text and speech features are fed separately to the network and later concatenated in last layer of the network (before the output layer). The fused neurons are then connected to the fully connected output regression layer as shown in Figure 3. Similar deep MLP and CNN structures were employed, as described in sections II-D1 and II-D2, for text and speech inputs, respectively.
Therefore, before the output layer (last layer), features (emotional state) data is attached to network layers.).
Santos, Karras and Asadiabadi are analogous art, because they all teach method of creating facial animation based on image and/or audio input data. Asadiabadi further teaches method of using multimodal (acoustic features and phoneme label features) to create animation. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the facial .

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Santos et al in view of Karras et al further in view of Bhat et al (US10169905).

Claim 11 is similar in scope as Claim 1 and thus is rejected under same rationale. Claim 11 further requires:
The combination of Santos and Karras fails to explicitly teach, however, Bhat teaches A server for generating facial animation, comprising: a processor including one or more cores; and a memory (Bhat, abstract, the invention describes system and methods for computer animations of 3D models of heads generated from images of faces is disclosed. A 2D captured image that includes an image of a face can be received and used to generate a static 3D model of a head. A rig can be fit to the static 3D model to generate an animation-ready 3D generative model. Sets of rigs can be parameters that each map to particular sounds. These mappings can be used to generate a playlists of sets of rig parameters based upon received audio content. The playlist may be played in synchronization with an audio rendition of the audio content.
Col 6, line 1-18, in several embodiments, the system can also animate the rigged or orientation-ready 3D model of the head mapping to rig parameters audio samples and/or video data. In accordance with the some other embodiments, the processes are performed by a "cloud" server system, a user device, and/or combination of devices local and/or remote from a user.);
Santos, Karras and Bhat are analogous art, because they all teach method of creating facial animation based on image and/or audio input data. Bhat further teaches method of implement the facial animation creation method on a cloud server system. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the facial animation creating method (taught in Santos and Karras), to further implement on a cloud server system (taught in Bhat), so as to allow on-line game player to easily use the facial animation creation method to animate an avatar (Bhat, col 26, line 1-8).

Allowable Subject Matter
Claims 8-9 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Regarding Claim 8, the following is a statement of reason for the indication of allowable subject matter: The prior art of record either alone or in combination fails to teach or suggest: “…wherein the facial animation generation model is trained by differently setting an update rate of weights for the first network function and remaining layers except at least one layer of the second network function and an update rate of weights for the at least one layer of the second network function, during a predetermined epoch”, in the context of claim 8.
Therefore, Claim 8 is allowable.
Regarding Claim 9, the following is a statement of reason for the indication of allowable subject matter: The prior art of record either alone or in combination fails to teach or suggest: “…wherein the facial animation generation model is trained by updating weights only for the first network function and remaining layers except at least one layer of the second network function, except for the at least one layer of the second network function, during a predetermined epoch, when a back propagation is performed based on an error of an output obtained by computing the two or more training input data as inputs of the face animation generation model and two or more training facial animations”, in the context of claim 9.
Therefore, Claim 9 is allowable.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIN SHENG whose telephone number is (571)272-
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached on 5712727794.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Xin Sheng/Primary Examiner, Art Unit 2611