Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant’s amendments and remarks submitted 02/28/2022 have been entered and considered, Claims 7, 9, 17, 20 are amended. Claims 6, 8 are cancelled. Claims 21-22 are new. This action is made final.

Response to Arguments
Applicant’s arguments filed on 02/28/2022 have been fully considered but are not persuasive.
Applicant argues “Ganin teaches away from combination with Saragih. For example, on page 4, Ganin describes that "[less] related to our approach are methods that aim to solve the gaze problem in videoconferencing via synthesizing 3D rotated views of either the entire scene or of the face (that is subsequently blended into the unrotated head)." (See page 4, lines 8-11 of Ganin). As Saragih discloses a method wherein 3D rotated views of the face are blended into the unrotated head, Ganin explicitly teaches away from combination with Saragih…Ganin describes that a general problem with applying the approach disclosed by Ganin to methods such as Saragih that blend 3D rotated views into the unrotated head "is how to fill disoccluded regions." (Id. at lines 12-13). Moreover, Ganin further warns that for 3D rotated views that are specifically of the face (as opposed to the entire scene), such as that used by Saragih, "there is also a danger of distorting head proportions characteristic to a person." (Id. at lines 13-15). Accordingly, Ganin teaches away from combination with Saragih… Furthermore, Applicant submits that the combination of Saragih and Ganin is improper at least because modifying the system of Saragih with Ganin would render the system of Saragih inoperable for its intended purpose of accurately mimicking the facial expressions of the user's face.”.
However, Saragih, abstract, the invention describes a computing system may access a plurality of first captured images that are captured in a first spectral domain, generate, using a first machine-learning model, a plurality of first domain-transferred images based on the first captured images, wherein the first domain transferred images are in a second spectral domain, render, based on a first avatar, a plurality of first rendered images comprising views of the first avatar, and update the first machine-learning model based on comparisons between the first domain-transferred images and the first rendered images, wherein the first machine-learning model is configured to translate images in the first spectral domain to the second spectral domain. The system may also generate, using a second machine-learning model, the first avatar based on the first captured images. The first avatar may be rendered using a parametric face model based on a plurality of avatar parameters.
Therefore, Saragih teaches using input partial image of the device wearer to generate an avatar that mimicking the facial expressions of the user’s face by using machine-learning model.
generating highly realistic images of a given face with a redirected gaze. We treat this problem as a specific instance of conditional image generation and suggest a new deep architecture that can handle this task very well as revealed by numerical comparison with prior art and a user study. Our deep architecture performs coarse-to-fine warping with an additional intensity correction of individual pixels. All these operations are performed in a feed-forward manner, and the parameters associated with different operations are learned jointly in the end-to-end fashion. After learning, the resulting neural network can synthesize images with manipulated gaze, while the redirection angle can be selected arbitrarily from a certain range and provided as an input to the network.
Therefore, Ganin teaches using machine learning to transfer input images into new photorealistic images in a different way, for example, in a different viewing angle. 
Saragih and Ganin are analogous art, because they both teach method of using machine learning to render 3D model with different variations. Ganin further teaches rendering 3D model with different angles and more photorealistic. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the machine learning method for 3D model rendering (taught in Saragih), to further use the warping modules to map the input image to final photorealism images with different angles (taught in Ganin), so as to provide a method for digital alteration for images in real-time or near real-time manner (Ganin, page 2, par 2).
There is no description in Saragih indicating “blend 3D rotated views into the unrotated head”. Ganin at page 4, par 2, describing related studies regarding solving 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1-5, 7, 10-13, 15-19 are rejected under 35 U.S.C. 103 as being unpatentable over Saragih et al (US20200402284) in view of Ganin et al. ("Deepwarp: Photorealistic image resynthesis for gaze manipulation.", European conference on computer vision. Springer, Cham, 2016).

Regarding Claim 1. Saragih teaches An apparatus comprising:
at least one processor;
a memory storing instructions that, when executed by the at least one
processor, perform a method for computing an image depicting a face of a wearer of a head mounted display (HMD), as if the wearer was not wearing the HMD,
(Saragih, abstract, the invention describes a computing system may access a plurality of first captured images that are captured in a first spectral domain, generate, using a first machine-learning model, a plurality of first domain-transferred images based on the first captured images, wherein the first domain transferred images are in a second spectral domain, render, based on a first avatar, a plurality of first rendered images
comprising views of the first avatar, and update the first machine-learning model based on comparisons between the first domain-transferred images and the first rendered
images, wherein the first machine-learning model is configured to translate images in the first spectral domain to the second spectral domain. The system may also generate,
using a second machine-learning model, the first avatar based on the first captured images. The first avatar may be rendered using a parametric face model based on a plurality of avatar parameters.
[0003] In particular embodiments, a system may automatically map images captured by a head-mounted capture device to status (e.g., facial expressions) of a user's avatar. The headset has IR cameras that provide partial images of the face. An avatar that mimics the facial expressions of the user's face may be animated based on the images from the IR cameras mounted in the headset. The avatar may be displayed by headsets of the user and other users during a telepresence session, for example. The avatar may be animated by determining a set of parameters that represent the
3D shape of a face and providing the parameters to an avatar generator that generates the 3D shape.

comprising:
accessing an input image depicting a partial view of the wearer's face captured from at least one face facing capture device in the HMD (Saragih, [0031] FIG. 1A illustrates an example method for creating an avatar based on images captured in the visible-light spectral domain. One or more visible-light ("RGB") cameras 108 may capture one or more RGB images 110 of a user 102 based on visible light.
[0032] FIG. 1B illustrates an example method for training a domain-transfer machine-learning (ML) model 114 to transfer images between different spectral domains. Spectral domains may include infrared, visible light, or other domains in which images may be captured by cameras. A headset 104 worn by a user 102 may have one or more infrared (IR) cameras, which may capture one or more IR images 106. A domain-transfer ML model training process 112 may train the domain-transfer ML model 114 based on the IR images 106 and rendered images 116. The rendered images 116 may be rendered by a renderer 115 based on the parametric face model 109 and facial expression code 113 generated from the RGB images 110. The domain-transfer ML model 114 may be trained on each frame received from the IR cameras of the headset 104.);
accessing an expression system comprising a machine learning model which has been trained to compute expression parameters from the input image,
accessing a three-dimensional (3D) face model that has expressions parameters (Saragih, [0036] FIG. 2 illustrates an example method for training an avatar parameter-extraction machine-learning (ML) model 206 to extract avatar parameters from images 106 in the infrared spectral domain. An avatar parameter extraction model training process 204 may train the avatar parameter extraction model 206 using the IR images 106, the parametric face model 109, and the domain transfer ML model 114. The domain transfer ML model 114 may be used to generate domain-transferred RGB images 118 based on IR images 106, and the avatar parameter extraction model training process 204 may learn avatar parameters that cause the parametric face model 109 to render an avatar having an appearance similar to the domain-transferred RGB images 118. The avatar parameters may include a facial expression code 113 and a head pose (not shown).);

Saragih fails to explicitly teach, however, Ganin teaches accessing a photorealiser being a machine learning model trained to map images rendered from the 3D face model to photorealistic images (Ganin, abstract, the paper describes method of generating highly realistic images of a given face with a redirected gaze. We treat this problem as a specific instance of conditional image generation and suggest a new deep architecture that can handle this task very well as revealed by numerical comparison with prior art and a user study. Our deep architecture performs coarse-to-fine warping with an additional intensity correction of individual pixels. All these operations are performed in a feed-forward manner, and the parameters After learning, the resulting neural network can synthesize images with manipulated gaze,
while the redirection angle can be selected arbitrarily from a certain range and provided as an input to the network.
Page 2, par 3, all of these scenarios put very high demands on the realism of the result of the digital alteration, and some of them also require real-time or near real-time operation. To meet these challenges, we develop a new deep feed-forward architecture that combines several principles of operation (coarse-to-fine processing, image warping, intensity correction). The architecture is trained end-to-end in a supervised way using a specially collected dataset that depicts the change of the appearance under gaze redirection in real life.
Page 4, par 3-4, In this section, we discuss the architecture of our deep model for re-synthesis. The model is trained on pairs of images corresponding to eye appearance before and after the redirection. The redirection angle serves as an additional input parameter that is provided both during training and at test time. As in [16], the bulk of gaze redirection is accomplished via warping the input image (Figure 2). The task of the network is therefore the prediction of the warping field. This field is predicted in two stages in a coarse-to-fine manner, where the decisions at the fine scale are being informed by the result of the coarse stage. Beyond coarse-to-fine warping, the photorealism of the result is improved by performing pixel-wise correction of the brightness where the amount of correction is again predicted by the network. All operations outlined above are implemented in a single feed-forward architecture and are trained jointly end-to-end.
 two warping modules takes as an input the image, the position of the feature points, and the redirection angle. All inputs are expressed as maps as discussed below, and the architecture of the warping modules is thus “fully-
convolutional", including several convolutional layers interleaved with Batch Normalization layers [11] and ReLU non-linearities (the actual configuration is shown in the Appendix). 
Coarse warping. The last convolutional layer of the first (half-scale) warping module produces a pixel-flow field (a two-channel map), which is then upsampled Dcoarse(I, α) and applied to warp the input image by means of a bilinear sampler S [12,21] that finds the coarse estimate. 
Page 6, par 1, Fine warping. In the fine warping module, the rough image estimate Ocoarse and the upsampled low-resolution flow Dcoarse(I, α) are concatenated with the input data (the image, the angle encoding, and the feature point encoding) at the original scale and sent to the 1X-scale network which predicts another two-channel flow Dres that amends the half-scale pixel-flow. The amended flow is used to obtain the final output (again, via bilinear sampler).
Page 6, par 3, As discussed above, alongside the raw input image, the warping modules also receive the information about the desired redirection angle and feature points also encoded as image-sized feature maps.);
Saragih and Ganin are analogous art, because they both teach method of using machine learning to render 3D model with different variations. Ganin further teaches rendering 3D model with different angles and more photorealistic. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing 


The combination of Saragih and Ganin further teaches computing expression parameter values from the image using the expression system (Saragih, [0036] FIG. 2 illustrates an example method for training an avatar parameter-extraction machine-learning (ML) model 206 to extract avatar parameters from images 106 in the infrared spectral domain. An avatar parameter extraction model training process 204 may train the avatar parameter extraction model 206 using the IR images 106, the parametric face model 109, and the domain transfer ML model 114. The domain transfer ML model 114 may be used to generate domain-transferred RGB images 118 based on IR images 106, and the avatar parameter extraction model training process 204 may learn avatar parameters that cause the parametric face model 109 to render an avatar having an appearance similar to the domain-transferred RGB images 118. The avatar parameters may include a facial expression code 113 and a head pose (not shown).);
driving the 3D face model with the expression parameter values to produce a 3D model of the face of the wearer (Saragih, [0028] In particular embodiments, a system may automatically map images captured by a head-mounted capture device to 
user's face may be generated based on the images from the IR cameras mounted in the headset. The avatar may be displayed by headsets of the user and other people during a telepresence session, for example. The avatar may be constructed by determining a set of parameters that represent the 3D shape of a face and providing the parameters to an avatar generator that generates the 3D shape.
[0054] In particular embodiments, for the distribution over headset pose P(v), the 3D geometry of the face model D 406 may be fitted to detected landmarks on headset images 500 by collecting 2D landmark annotations and training landmark detectors.);
rendering the 3D model from a specified viewpoint to compute a rendered
image (Ganin, page 8, par 1, For each person we record 2-10 sequences, changing the head pose and light conditions between different sequences. Training pairs are collected, taking two images with different gaze directions from one sequence. We manually exclude bad shots, where a person is blinking or where she is not changing gaze direction monotonically as anticipated. Most of the experiments were done on the dataset of 33 persons and 98 sequences. Unless noted otherwise, we train the model for vertical gaze redirection in the range between -30° and 30°.); and
upgrading the rendered image to a photorealistic image using the photorealiser (Saragih, [0072] In particular embodiments, rather than minimizing L2 -loss in the latent space of z', loss may be measured in a way that encourages the network to spend capacity on the most visually sensitive parts, such as subtle lip shape and gaze direction. Additionally, the error in geometry and texture map, particularly in view-dependent texture to be photorealistic in these regions.
[0076] FIG. 6A illustrates an example training headset 602 having intrusive and non-intrusive infrared cameras. The training head-mounted display (HMD) 602 includes augmented cameras 606a-f and standard cameras 604a-c. The HMD 602 may be used for collecting data to help establish better correspondence between HMD images and avatar parameters. Particular embodiments use two versions of the same headset design: a non-intrusive, consumer-friendly design with a minimally intrusive camera configuration, and a training design with an augmented camera set having more
accommodating viewpoints to support finding correspondences. The augmented training headset 602 may be used to collect data and build a mapping between the non-intrusive headset camera configuration and the user's facial expressions. Specifically, the non-intrusive cameras 604 may include a VGA-resolution camera for each of the mouth 604b, left-eye 604c, and right-eye 604a. The six augmented cameras 606 add an additional view 606a,b to each eye as well as four additional views 606c-f of the mouth.
Ganin, page 4, par 3-4, the task of the network is therefore the prediction of the warping field. This field is predicted in two stages in a coarse-to-fine manner, where the decisions at the fine scale are being informed by the result of the coarse stage. Beyond coarse-to-fine warping, the photorealism of the result is improved by performing pixel-wise correction of the brightness where the amount of correction is again predicted by the network.).

Regarding Claim 2. The combination of Saragih and Ganin further teaches The apparatus of claim I wherein the method further comprises inserting the photorealistic image into a virtual web cam stream (Saragih, [0003] In particular embodiments, a system may automatically map images captured by a head-mounted capture device to status (e.g., facial expressions) of a user's avatar. The headset has IR cameras that provide partial images of the face. An avatar that mimics the facial expressions of the user's face may be animated based on the images from the IR cameras mounted in the headset. The avatar may be displayed by headsets of the user and other users during a telepresence session, for example. The avatar may be animated by determining a set of parameters that represent the 3D shape of a face and providing the parameters to an avatar generator that generates the 3D shape.
[0072], Additionally, the error in geometry and texture map, particularly in eye and mouth regions, may be minimized, because the avatar may have insufficient geometry detail, and may thus rely on view-dependent texture to be photorealistic in these regions.
It is common to use web cam in an online chat room or conference for capturing the image of the participant and presenting it in real time. Thus using a trained avatar in a telepresence session is equivalent of using a virtual web cam in an online conference for presenting participant.).

Regarding Claim 3. The combination of Saragih and Ganin further teaches The apparatus of claim 1 wherein the method further comprises one or more of: using the photorealistic image in a video conferencing application, using the photorealistic image to animate an avatar in a telepresence application (Saragih, [0003] In particular embodiments, a system may automatically map images captured by a head-mounted capture device to status (e.g., facial expressions) of a user's avatar. The headset has IR cameras that provide partial images of the face. An avatar that mimics the facial expressions of the user's face may be animated based on the images from the IR cameras mounted in the headset. The avatar may be displayed by headsets of the user and other users during a telepresence session, for example. The avatar may be animated by determining a set of parameters that represent the 3D shape of a face and providing the parameters to an avatar generator that generates the 3D shape.
[0072], Additionally, the error in geometry and texture map, particularly in eye and mouth regions, may be minimized, because the avatar may have insufficient geometry detail, and may thus rely on view-dependent texture to be photorealistic in these regions.).

Regarding Claim 4. The combination of Saragih and Ganin further teaches The apparatus of claim 1 wherein the method comprises accessing a plurality of input images depicting different partial views of the wearer's face and using the plurality of input images to compute the expression parameter values (Saragih, [0003] In particular embodiments, a system may automatically map images captured by a head-mounted capture device to status (e.g., facial expressions) of a user's avatar. The headset has IR cameras that provide partial images of the face. An avatar that mimics the facial expressions of the user's face may be animated based on the images from the IR cameras mounted in the headset. The avatar may be displayed by headsets 
[0028] In particular embodiments, a system may automatically map images captured by a head-mounted capture device to status (e.g., facial expressions) of a user's avatar. The headset has IR cameras that provide partial images of the face. An avatar that mimics the facial expressions of the user's face may be generated based on the images from the IR cameras mounted in the headset. The avatar may be displayed by headsets of the user and other people during a telepresence session, for example. The avatar may be constructed by determining a set of parameters that represent the 3D shape of a face and providing the parameters to an avatar generator that generates the 3D shape).

Regarding Claim 5. The combination of Saragih and Ganin further teaches The apparatus of claim 4 wherein the plurality of input images comprise a first image depicting a first eye, a second image depicting a second eye, and a third image depicting a mouth (Saragih, [0076] FIG. 6A illustrates an example training headset 602 having intrusive and non-intrusive infrared cameras. The training head-mounted display (HMD) 602 includes augmented cameras 606a-f and standard cameras 604a-c. The HMD 602 may be used for collecting data to help establish better correspondence between HMD images and avatar parameters. Particular embodiments use two versions of the same headset design: a non-intrusive, consumer-friendly design with a minimally intrusive camera configuration, and a training design with an augmented camera set each of the mouth 604b, left-eye 604c, and right-eye 604a. The six augmented cameras 606 add an additional view 606a,b to each eye as well as four additional views 606c-f of the mouth.).

Regarding Claim 7. The combination of Saragih and Ganin further teaches The apparatus of claim 1 wherein the expression system comprises a neural network that has been trained using synthetic images depicting partial views of a face of an HMD wearer, the synthetic images associated with known expression parameters (Saragih, [0041] FIG. 5 illustrates an example pipeline for establishing correspondence between infrared images and avatar parameters by training domain-transfer and parameter-extraction machine-learning models. In particular embodiments, the pipeline may use a pre-trained personalized parametric face model D 406, which may be understood as an avatar that can be configured based on avatar parameters such as an estimated facial expression 113 and an estimated pose 510. The parametric face model D 406 may be a deep appearance model, e.g., a deep deconvolutional neural network, which generates a representation of the avatar, including geometry and a texture, based on the avatar parameters.).

Regarding Claim 10. The combination of Saragih and Ganin further teaches The apparatus of claim 1 wherein the photorealiser comprises a neural network having been trained with pairs of 3D scans and frontal views of faces (Ganin, page 5, par 1-2, at training time, our dataset allows us to mine pairs of images containing eyes of the same person looking in two different directions separated by a known angle α. The head pose, the lighting, and all other nuisance parameters are (approximately) the same between the two images in the pair.).
The reasoning for combination is the same as in Claim 1.

Regarding Claim 11. The combination of Saragih and Ganin further teaches The apparatus of claim 10 wherein the photorealiser has been fine-tuned with 2D views of a particular individual not wearing the HMD (Ganin Page 5, par 3-4, each of the two warping modules takes as an input the image, the position of the feature points, and the redirection angle. All inputs are expressed as maps as discussed below, and the architecture of the warping modules is thus “fully-
convolutional", including several convolutional layers interleaved with Batch Normalization layers [11] and ReLU non-linearities (the actual configuration is shown in the Appendix). 
Page 6, par 1, Fine warping. In the fine warping module, the rough image estimate Ocoarse and the upsampled low-resolution flow Dcoarse(I, α) are concatenated with the input data (the image, the angle encoding, and the feature point encoding) at the original scale and sent to the 1X-scale network which predicts another two-channel res that amends the half-scale pixel-flow. The amended flow is used to obtain the final output (again, via bilinear sampler).
Page 6, par 2, The purpose of coarse-to-fine processing is two-fold. The half-scale (coarse) module effectively increases the receptive field of the model resulting in a flow that moves larger structures in a more coherent way. Secondly, the coarse module
gives a rough estimate of how a redirected eye would look like. This is useful for locating problematic regions which can only be fixed by a neural network operating at a finer scale.
As shown in  Fig 4, the user is not wearing HMD. Her images are used in training process including the fine warping process.).
The reasoning for combination is the same as in Claim 1.

Regarding Claim 12. The combination of Saragih and Ganin further teaches The apparatus of claim 1 wherein the viewpoint is selected according to user input (Ganin, abstract, in this work, we consider the task of generating highly realistic
images of a given face with a redirected gaze. We treat this problem as a specific instance of conditional image generation and suggest a new deep architecture that can handle this task very well as revealed by numerical comparison with prior art and a user study. Our deep architecture performs coarse-to-fine warping with an additional intensity
correction of individual pixels. All these operations are performed in a feed-forward manner, and the parameters associated with different operations are learned jointly in the end-to-end fashion. After learning, the resulting neural network can synthesize images with manipulated gaze, while the redirection angle can be selected arbitrarily from a certain range and provided as an input to the network.).
The reasoning for combination is the same as in Claim 1.

Regarding Claim 13. The combination of Saragih and Ganin further teaches  The apparatus of claim 1 wherein the 3D model comprises a polygon mesh with higher density of vertices in eye and mouth regions of the polygon mesh than in other regions of the polygon mesh (Saragih, [0041] FIG. 5 illustrates an example pipeline for establishing correspondence between infrared images and avatar parameters by training domain-transfer and parameter-extraction machine-learning models. In particular embodiments, the pipeline may use a pre-trained personalized
parametric face model D 406, which may be understood as an avatar that can be configured based on avatar parameters such as an estimated facial expression 113 and an estimated pose 510. The parametric face model D 406 may be a deep appearance model, e.g., a deep deconvolutional neural network, which generates a representation of the avatar, including geometry and a texture, based on the avatar parameters. The estimated facial expression 113 may be an I-dimensional latent facial expression code z[Symbol font/0xCE]Rl 113. The estimated pose 510 may be a 6-DOF rigid pose transform v[Symbol font/0xCE]Rσ 510 from the avatar's reference frame to the headset (represented by a reference camera). The estimated pose 510 may be a view vector, represented as the vector pointing from the head of the user to the camera (e.g., relative to a head orientation that may be estimated from a tracking algorithm). The geometry, which may be a mesh M 514, and a texture T 516 may be generated based on the facial expression code z 113 and pose v 510 using the parametric face model D 406.
[0042] In particular embodiments, the mesh M[Symbol font/0xCE]Rnx3 514 represents the facial shape comprising n-vertices, and the texture T[Symbol font/0xCE]Rwxh 516 is the generated texture. A rendered image R 506 can be generated from this shape and texture through rasterization by a renderer R 115 based on the mesh M 514, the texture T 516, and the camera's projection function A 511.
[0054-0055] In particular embodiments, for the distribution over headset pose P(v), the 3D geometry of the face model D 406 may be fitted to detected landmarks on headset images 500 by collecting 2D landmark annotations and training landmark detectors. Example landmarks are shown in FIG. 7. One of the challenges in fitting a 3D mesh to 2D detections is defining correspondence between mesh vertices and detected landmarks. To address this problem, while fitting individual meshes, particular embodiments may simultaneously solve for each landmark's mesh correspondence (e.g., used across all frames) in the texture's uv-space {uj[Symbol font/0xCE]Rlx3}j=1n, where m is the number of available landmarks. To project each landmark m on rendered images of every view, particular embodiments may calculate a row vector of the barycentric-coordinates bj[Symbol font/0xCE]Rlx3 of the current uj in its enclosing triangle, with vertices indexed by aj[Symbol font/0xCE]N3 , and then linearly interpolate projections of the enclosing triangle's 3D vertices,
Maj[Symbol font/0xCE]R3x3, where M is the mesh 514 (representing the facial shape) from Equation (1).
The vertices on the 3D mesh corresponds to landmarks of the original image. Regions such as eyes and mouth comprises more landmarks than other regions, so as 

Regarding Claim 15. The combination of Saragih and Ganin further teaches  The apparatus of claim 1 integral with an HMD (Saragih, [0003] In particular embodiments, a system may automatically map images captured by a head-mounted capture device to status (e.g., facial expressions) of a user's avatar. The headset has IR cameras that provide partial images of the face. An avatar that mimics the facial expressions of the user's face may be animated based on the images from the IR cameras mounted in the headset.).

Claim 16 is similar in scope as Claim 1, and thus is rejected under same rationale. 
Claim 17 is similar in scope as Claim 2/3, and thus is rejected under same rationale.
Claim 18 is similar in scope as Claim 4, and thus is rejected under same rationale.
Claim 19 is similar in scope as Claim 5, and thus is rejected under same rationale.

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Saragih et al in view of Ganin et al. further in view of Bouaziz et al (US20200160582). 

Regarding Claim 9. The combination of Saragih and Ganin fails to explicitly teach, however, Bouaziz teaches  The apparatus of claim 1 wherein the 3D face model has a generic identity and generic texture (Bouaziz, abstract, the invention describes a method for real-time facial animation, and a processing device for real-time facial animation. The method includes providing a dynamic expression model, receiving tracking data corresponding to a facial expression of a user, estimating tracking parameters based on the dynamic expression model and the tracking data, and refining the dynamic expression model based on the tracking data and estimated tracking parameters. The method may further include generating a graphical representation corresponding to the facial expression of the user based on the tracking parameters. Embodiments pertain to a real-time facial animation system.
[0019] According to one embodiment, the plurality of blendshapes at least includes a blendshape b0 representing a neutral facial expression and the dynamic expression model further includes an identity principal component analysis (PCA) model, the method further including matching the blendshape b0 representing the neutral facial expression to the neutral expression of the user based on the tracking data and the identity PCA model. The identity PCA model may represent variations of face geometries across different users and may be used to initialize the plurality of blendshapes including the blendshape b0 to the face geometry of the user.
[0063] For example, an online avatar of the user can be directly created based on the refined blendshapes, since the user-specific dynamic expression model 316 that was built automatically during model refinement 310 constitutes a fully rigged geometric avatar of the user. The online avatar may further include a reconstruction of texture and other facial features such as hair in order to allow for a complete digital online avatar that can directly be integrated into online applications or communication applications and tools.
Therefore, the basic neutral model comprises generic identity and texture, which will be reconstructed based on individual features for different users.).
Saragih, Ganin and Bouaziz are analogous art, because they all teach method of using machine learning to render 3D model with different variations. Bouaziz further teaches training 3D basic/generic model with different user moving and/or talking expression images. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the machine learning method for 3D model rendering (taught in Saragih and Ganin), to further use the expression training for basic/neutral 3D model (taught in Bouaziz), so as to provide a real-time facial animation method (Bouaziz, [0007]).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Saragih et al in view of Ganin et al. further teaches Kawai et al. ("Automatic Generation of Photorealistic 3D Inner Mouth Animation only from Frontal Images." Journal of Information Processing 23.5 (2015): 693-703).

Regarding Claim 14. The combination of Saragih and Ganin further teaches The apparatus of claim 1 wherein the 3D model comprises a representation of eye balls (Saragih, [0035] In particular embodiments, the ML models, such as the domain-transfer ML model 114, that are used in generating an avatar for a particular user 102 
[0073] and K is the crop on texture maps focusing on eye and mouth area (shown in FIG. 7), and v0 is a fixed frontal view of the avatar.
[0076] FIG. 6A illustrates an example training headset 602 having intrusive and non-intrusive infrared cameras. The training head-mounted display (HMD) 602 includes augmented cameras 606a-f and standard cameras 604a-c. The augmented training headset 602 may be used to collect data and build a mapping between the non-intrusive
headset camera configuration and the user's facial expressions. Specifically, the non-intrusive cameras 604 may include a VGA-resolution camera for each of the mouth 604b, left-eye 604c, and right-eye 604a. The six augmented cameras 606 add an additional view 606a,b to each eye as well as four additional views 606c-f of the mouth,), 

The combination of Saragih and Ganin fails to explicitly teach, however, Kawai teaches a representation of teeth and tongue (Kawai, abstract, the paper describes 
a novel method to generate highly photorealistic three-dimensional (3D) inner mouth animation that is well-fitted to an original ready-made speech animation using only frontal captured images and small-size databases. The algorithms are composed of quasi-3D model reconstruction and motion control of teeth and the tongue, and final compositing of photorealistic speech animation synthesis tailored to the original. In automatically generate 3D inner mouth appearances by improving photorealism with only three inputs: an original tailor-made lip-sync animation, a single image of the speaker’s teeth, and a syllabic decomposition of the desired speech. The key idea of our proposed method is to combine 3D reconstruction and simulation with two-dimensional (2D) image processing using only the above three inputs, as well as a tongue database and mouth database. The satisfactory performance of our proposed method is illustrated by the significant improvement in picture quality of several tailor-made animations to a degree nearly equivalent to that of camera-captured videos.).
Saragih, Ganin and Kawai are analogous art, because they all teach method of using machine learning to render 3D model with different variations. Kawai further teaches training 3D model with movement of teeth and tongue. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the machine learning method for 3D model rendering (taught in Saragih and Ganin), to further use the expression training of teeth and tongue for basic/neutral 3D model (taught in Kawai), so as to provide a method for significant improving in picture quality to a degree nearly equivalent to that of camera-captured videos (Kawai, abstract).

Claim 20-21 is rejected under 35 U.S.C. 103 as being unpatentable over Saragih et al in view of Ganin et al further in view of Peng et al (US20200312043).

Claim 20 is similar in scope as Claim 1&9, and thus is rejected under same rationale. Claim 20 further requires:
One or more device-readable media (Saragih, [0100] Herein, a computer-readable non-transitory storage medium or media may include one or more semiconductor based or other integrated circuits (ICs) (such, as for example, field-programmable gate arrays (FPGAs) or application-specific ICs (ASICs)), hard disk drives (HDDs), hybrid hard drives (HHDs), optical discs, optical disc drives (ODDs), magneto-optical discs, magneto-optical drives, floppy diskettes, floppy disk drives (FDDs), magnetic tapes, solid-state drives (SSDs), RAM-drives, SECURE DIGITAL
cards or drives, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate.). 

The combination of Saragih and Ganin fails to explicitly teach, however, Peng teaches driving the 3D face model with identity parameter values and the expression parameter values to produce a 3D model of the face of the wearer (Peng, abstract, the invention describes a face model processing method performed at an electronic device. The method includes the following steps: obtaining a three-dimensional face model corresponding to a user picture, and selecting a sample oral cavity model in an oral cavity model library for the three-dimensional face model; registering the sample oral cavity model into the three-dimensional face model by using performing form adjustment on an oral cavity form of the registered sample oral cavity model by using an expression parameter of the three-dimensional face model to generate a target oral cavity model; and generating, based on the three-dimensional face model and the target oral cavity model, a three-dimensional face expression model corresponding to the user picture. 
[0057]	The obtained three-dimensional face model may be described by using a formula 
    PNG
    media_image1.png
    27
    218
    media_image1.png
    Greyscale
 
    PNG
    media_image2.png
    29
    163
    media_image2.png
    Greyscale
, where S is a three-dimensional face model, S is a three-dimensional model of an average face, Ui and Vi are respectively a spatial basis vector matrix of  a face identity and a spatial basis vector matrix of a face expression obtained by training a face three-dimensional model data set, and wi and vi represent an identity parameter and an expression parameter of a corresponding three-dimensional human model. 
    PNG
    media_image3.png
    27
    24
    media_image3.png
    Greyscale
, Ui, and Vi are known numbers. If wi and vi are known, a corresponding three-dimensional face model S may be calculated according to the foregoing formula. Correspondingly, if the three-dimensional face model S is synthesized, w, and v, corresponding to the three-dimensional face model may also be obtained through detection. Generally, a presentation form of the three-dimensional face model S is changed by changing values of w, and v,. The identity parameter w, of the three-dimensional face model remains unchanged to control the change of v,,  so that geometric models of the same face with different expressions may be obtained. For example, FIG. 2 shows a
generated three-dimensional expression model.).


Regarding Claim 21. The combination of Saragih, Ganin and Peng further teaches The apparatus of claim 1, wherein driving the 3D face model with the expression parameter values comprises driving the 3D face model with identity parameter values to produce the 3D model of the face of the wearer (Peng, [0057] The obtained three-dimensional face model may be described by using a formula 
    PNG
    media_image1.png
    27
    218
    media_image1.png
    Greyscale
 
    PNG
    media_image2.png
    29
    163
    media_image2.png
    Greyscale
, where S is a three-dimensional face model, S is a three-dimensional model of an average face, Ui and Vi are respectively a spatial basis vector matrix of  a face identity and a spatial basis vector matrix of a face expression obtained by training a face three-dimensional model data set, and wi and vi represent an identity parameter and an expression parameter of a corresponding three-dimensional human model. 
    PNG
    media_image3.png
    27
    24
    media_image3.png
    Greyscale
, Ui, and Vi are known numbers. If wi and vi are known, a corresponding three-dimensional face model S may be calculated according to the foregoing formula. Correspondingly, if the three-dimensional face model S so that geometric models of the same face with different expressions may be obtained. For example, FIG. 2 shows a
generated three-dimensional expression model.).
The reasoning for combination of Saragih, Ganin and Peng is the same as described in Claim 20.

Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Saragih et al in view of Ganin et al, Peng et al further in view of Szeto et al (US20180137366).

Regarding Claim 22. The combination of Saragih, Ganin and Peng further teaches The apparatus of claim 1, wherein driving the 3D face model with the expression parameter values comprises driving the 3D face model with identity parameter values determined (Peng, [0057] The obtained three-dimensional face model may be described by using a formula 
    PNG
    media_image1.png
    27
    218
    media_image1.png
    Greyscale
 
    PNG
    media_image2.png
    29
    163
    media_image2.png
    Greyscale
, where S is a three-dimensional face model, S is a three-dimensional model of an average face, Ui and Vi are respectively a spatial basis vector matrix of  a face identity and a spatial basis vector matrix of a face expression obtained by training a face three-dimensional model data set, and wi and vi represent an identity parameter and an expression parameter of a corresponding three-dimensional human model. 
    PNG
    media_image3.png
    27
    24
    media_image3.png
    Greyscale
, Ui, and Vi are known numbers. If wi and vi are known, a corresponding three-dimensional face model S may be calculated according to the foregoing formula. Correspondingly, if the three-dimensional face model S is synthesized, w, and v, corresponding to the three-dimensional face model may also be obtained through detection. Generally, a presentation form of the three-dimensional face model S is changed by changing values of w, and v,. The identity parameter w, of the three-dimensional face model remains unchanged to control the change of v,,  so that geometric models of the same face with different expressions may be obtained. For example, FIG. 2 shows a generated three-dimensional expression model.) 
The reasoning for combination of Saragih, Ganin and Peng is the same as described in Claim 20.

The combination of Saragih, Ganin and Peng fails to explicitly teach, however, Szeto teaches via an offline process (Szeto, abstract, the invention describes 
a method for training an object detection algorithm. The method comprises: acquiring, from a camera, a video sequence of a real object; deriving a pose of the real object
included in at least one image frame using a 3D model corresponding to the real object in the case where the at least one image frame is selected from the video sequence;
tracking or deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions from the at least one image frame; and

[0054] FIG. 9 is a diagram illustrating a flow of the offline creation process. In the offline creation process, first, the CPU 110 performs acquisition of a video sequence (step S31). In the performed acquisition of the video sequence, a user images the real object OB1 with the imaging section 40 in advance. At this time, the imaging section 40 is relatively moved so that poses of the real object OB1 relative to the imaging section 40 correspond to all spatial relationships represented by the dots in FIG. 7 or 8.
[0055] Next, the CPU 110 acquires a reference image frame (step S33). A pose of the real object OB1 imaged in each of the image frames in the preceding and succeeding of the time axis with respect to the selected image frame is tracked (step S35). In this case, bundle adjustment is locally or entirely applied to the reference image frame, and thus estimation of the pose of the real object OB1 is refined with respect to each image frame. The appearance data of the real object OB1 is acquired and is recorded at a predetermined timing. Training data is created in which the acquired appearance data is associated with the pose, and "2D model data" of the 2D model obtained by projecting the 3D model in the pose (step S37), is stored as a template, and the offline creation process is finished.)
Saragih, Ganin, Peng and Szeto are analogous art, because they all teach method of using machine learning to render 3D model with different variations. Szeto .


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIN SHENG whose telephone number is (571)272-5734. The examiner can normally be reached M-F 9:30AM-3:30PM 6:00PM-8:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached on 5712727794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Xin Sheng/Primary Examiner, Art Unit 2611