DETAILED ACTION
This Office Action is in response to the Applicants' communication filed on April 6, 2022, which amends the independent claim 1, and presents arguments, is hereby acknowledged. Claims 1-20 are currently pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant’s arguments filed on April 6, 2022, have been fully considered.
Applicant argues that by this response, the independent claim 1 is hereby amended to add new limitations of “generate, by the neural network, a delta image representing the makeup from the reference face image” in order to overcome the 103 rejections. 
Examiner replies that the amended claims with new limitation may overcome the cited portions of the prior arts. However, a newly found art, D'Alessandro, etc. (US 20180276883 A1) teaches that generate, by the neural network, a delta image representing the makeup from the reference face image (See D'Alessandro: Figs. 7-8, and [0064], “Similarly, a delta can be computed using the color model between the expected color at the desired age and the expected color at the individual's actual age. This delta can likewise be added to the image to produce the predicted image. It should be understood that before adding the delta, the delta image must be warped to the predicted landmark points in order to line up with the warping of the individual's image from the shape model”).
	Applicant argues that by this response, the independent claim 1 is hereby amended to add a new limitation “a transfer network that is trained to transfer the makeup from the facial components of the reference face image to the facial components of the target face image and to maintain a skin tone and a lighting environment of the target face image using the delta image”, and argued that the prior arts on record do not teach the limitation of “a lighting environment”:
“II.	The cited art does not teach maintaining a skin tone and a lighting environment of the target face image using the delta image as recited in claim 1”.
Examiner replies that the amended claims with new limitation may overcome the cited portions of the prior arts. However, the secondary art cited by FAOM, Fu, etc. (US 20190014884 A1) does teach to maintain a skin tone and a lighting environment of the target face image using the delta image (See Fu: Fig. 3, and [0116], “While various regions and fiducial points may be used in the method and system herein, for purposes of explaining a preferred embodiment illustrating a first and/or second region to be extracted and one of such regions intrinsically decomposed, the following example illustrates such steps using the eye and mouth regions as follows. For the eye region 1040A, for example, an intrinsic image decomposition technique is utilized in Step 1045 to recover the shading and reflectance channels of the eye region. Then, in Step 1050A, the shading channel and reflectance channel are fed into histogram matching separately to get an image with the makeup removed in the eye region. For the lip/mouth region 1040B, for example, an image is first transferred, i.e., converted, to HSV color channels, and different histogram matching procedures are applied to the H, S, V channels separately with regard to different lighting conditions. For lip color removal, specific reference histograms of “saturation” and “value” were learned from a collected dataset of facial images without makeup on the lips. With those predefined lip histograms, an input lip makeup could be removed by matching the detected lip histogram to a corresponding one having no makeup. For the lip channel, the “hue” channel is used as the lip region which usually has only one value so that one need not use a histogram to represent it, and the procedure for “hue” channel is set as the value of the “hue” channel for each pixel compared to a pre-trained color value”). The skin color is mapped to the claimed element “skin tone”.
Applicant argues that by this response, the prior arts on record do not teach the cited limitation of “a blending network” in the independent claim 1:
“III.	The cited art does not teach a blending network that is configured to blend the facial components of the target face having the makeup onto the target face image as recited in claim 1”.
Examiner replies that the limitation cited in the independent claim is “a blending network that is configured to blend the facial components of the target face having the makeup onto the target face image”, which does not specify how to blend the facial components on to the target face image, nor does it specify what is “the facial components”. In the Office action, the facial expressions of the primary art, Chandran, etc. (US 20210279956 A1), are mapped to “the facial components”, and the facial expressions are blended into the target face images, which is used in the Office action to be mapped to the claimed limitation at issue. Further, the secondary art or record, Fu, etc. (US 20190014884 A1), also teaches that a blending network that is configured to blend the facial components of the target face having the makeup onto the target face image (See Fu: Fig. 7, and [0156], “The first image having the first output effect and/or the additional images with their respective output effects are combined and blended with the original facial image of the user in step 2040 to create a resultant image in step 2050 having each of the output effects combined on the facial image of the user”). Thus, the arguments of the applicant at this matter are not persuasive.
Applicant argues, lastly, that by this response, the prior arts on record do not teach the cited limitation of “train the neural network using the paired data using supervised learning; and train the neural network using the unpaired data using unsupervised learning” in the independent claim 11:
“IV.	The cited art does not teach training the neural network using the paired data using supervised learning and then training the neural network using the unpaired data using unsupervised learning as recited in claim 11”.
Examiner replies that the limitations cited in the independent claim 11 are “train the neural network using the paired data using supervised learning; and train the neural network using the unpaired data using unsupervised learning”, which does not specify what are paired or unpaired data, nor do they specify what are supervised or unsupervised learning. The training data set of neutral face 3D meshes and meshes for the same facial identifiers may be the paired data, while the neutral face 3D meshes and the predetermined facial expressions may be the unpaired data. Thus, the primary art on record, Chandran teaches that train the neural network using the paired data using supervised learning (See Chandran: Fig. 10, and [0051], “As discussed in greater detail below in conjunction with FIG. 10, in some embodiments the identity encoder 152, the expression encoder 154, and the decoder 156 are trained in an end-to-end and fully supervised manner using a L1 loss function, with the identity and expression latent spaces being constrained using Kullback-Leibler (KL) divergence losses, a fixed learning rate, and the Adaptive Moment Estimation (ADAM) optimizer”). Further, when the input data are not paired, the training of the neural network may be regarded as unsupervised learning, as it is detailed in the secondary art on record, Fu teaches that train the neural network using the unpaired data using unsupervised learning (See Fu: Figs. 9-10, and [0204], “Thus, one can collect various images with makeup on them and instead of having to significant numbers of images with makeup off, the makeup removal method may be used to generate numbers of images with no makeup applied that are used as input data for training in step 4030”; [0206], “FIGS. 10A-10D shows more detailed output examples of the makeup annotation system 5000 in accordance with an embodiment of the present disclosure. Through the makeup annotation system 5000, digitalized makeup information can be generated and this information may be used as input data of the deep learning training in step 4045”; and [0205], “For model training, a deep learning framework 4035 such as Caffe™, Caffe2™ or Pytorch™ is used to support many different types of deep learning architectures for image classification and image segmentation. Such a framework supports a variety of neural network patterns, as well as fully connected neural network designs”. Note that data 4020 and 5000 are unpaired). Thus. Applicant’s argument in this matter is not persuasive.
Examiner respectfully further replies that the Applicant's arguments have been fully considered and a new ground of rejections have been made. Accordingly, new grounds of rejection are set forth below. Since the new grounds of rejection are necessitated by Applicant's amendments to the claims, the present action is made final.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-9 and 11-20 are rejected under 35 U.S.C. 103 as being unpatentable over Chandran, etc. (US 20210279956 A1) in view of D'Alessandro, etc. (US 20180276883 A1), further in view of Fu, etc. (US 20190014884 A1).
Regarding claim 1, Chandran teaches that non-transitory computer readable medium comprising instructions that, when executed by at least one processor, cause a computer device (See Chandran: Fig. 1, and [0031], “Fig. 1 illustrates a system 100 configured to implement one or more aspects of the various embodiments. As shown, the system 100 includes a machine learning server 110, a data store 120, and a computing device 140 in communication over a network 130, which may be a wide area network (WAN) such as the Internet, a local area network (LAN), or any other suitable network”) to:
receive, at a neural network, a target face image and a reference face image, the reference face image including makeup (See Chandran: Fig. 1, and [0035], “The model trainer 116 is configured to train machine learning models, including a non-linear model for generating faces 150, which is also referred to herein as a “face model.” As shown, the face model 150 includes an identity encoder 152, an expression encoder 154, and a decoder 156. Any technically feasible types of encoders and decoders may be used. In some embodiments, each of the identity encoder 152 and the expression encoder 154 may include a deep neural network, such as an encoder from a variational autoencoder (VAE). Similarly, the decoder 156 may also include a deep neural network in some embodiments. Operation(s) performed to encode representations of facial identities using the identity encoder 152, or to encode representations of facial expressions using the expression encoder 154 (or another mapping), are also referred to herein as “encoding operation(s).” Operation(s) performed to generate a representation of a face using the decoder 156, based on an encoded representation of a facial identity and an encoded representation of a facial expression, are also referred to herein as “decoding operation(s).””; Fig. 15 and [0096], “As shown, a method 1500 begins at step 1502, where the application 146 receives an image of a face. For example, the image could be a standalone image or one of multiple frames of a video. In the case of a video, steps of the method 1500 may be repeated for each frame in the video”; and [0101], “At step 1508, the application 146 receives a representation of a facial identity. Similar to step 1202 of the method 1200 described above in conjunction with FIG. 12, the facial identity may be represented in any technically feasible manner, such as an identity code, a neutral face mesh that can be converted to an identity code using the identity encoder 152, an image or video frame from which a neutral face mesh can be determined, etc.”);
generate, by the neural network, a delta image representing the makeup from the reference face image; and
transfer, by the neural network, the makeup from the reference face image to the target face image by combining the target face image with the delta image, wherein the neural network includes (See Chandran: Fig. 1, and [0043], “As shown, the application 146 includes the face model 150, which itself includes the identity encoder 152, the expression encoder 154, and the decoder 156. In other embodiments, the application 146 may include a landmark model in addition to, or in lieu of, the face model 150. The face model 150 and/or the landmark model may be employed in any technically feasible use cases, including face fitting (e.g., fitting to a facial identity while constraining to the neutral expression, or fitting to a facial expression once a facial identity is known), performance animation (e.g., modifying only the expression space of the face model 150), and performance transfer or retargeting (e.g., modifying only the identity space of the face model 150). For example, the application 146 could use the decoder 156 to generate novel faces by sampling from identities represented by meshes in the data set that is used to train the face model 150, which are also referred to herein as “known identities,” or adding random noise to an identity code associated with a known identity. As another example, the application 146 could receive a new identity that is not one of the known identities and use the face model 150 to generate a face having the new identity and a target expression. As another example, the application 146 could perform blendweight retargeting in which the face model 150 is used to transfer facial expression(s) from an image or video to a new facial identity by determining blendweights associated with the facial expression(s) in the image or video, inputting the blendweights into the expression encoder 154, and inputting a representation of the new facial identity into the identity encoder 152. As a further example, the application 146 could perform 2D landmark-based capture and retargeting by determining 2D facial landmarks from a facial performance in a video, mapping the facial landmarks to expression codes that are then input, along with an identity code associated with a new identity, into the decoder 156 to generate faces having the new identity and the expressions in the facial performance. As used herein, a “facial performance” refers to a series of facial expressions, such as the facial expressions in successive frames of a video”):
a face parsing network that is configured to parse the target face image and the reference face image into multiple facial components (See Chandran: Figs. 1-2, and [0048], “where the subscript exp is used for facial expression components, the subscript T refers to a target expression shape, b.sup.T is a blendweight vector that corresponds to a target expression shape T that is input into the VAE encoder, which is denoted by E.sub.exp, and μ.sub.exp and σ.sub.exp are the mean and standard deviation, respectively, output by the VAE encoder. As described, in some embodiment, the re-parameterization may be performed during training and omitted (or not) thereafter. Blendweights are used to condition the decoder for two reasons”),
a transfer network that is trained to transfer the makeup from the facial components of the reference face image to the facial components of the target face image (See Chandran: Fig. 7, and [0073], “FIG. 7 illustrates an exemplar retargeting of a facial performance from one facial identity to another facial identity using the face model 150 of FIG. 1, according to various embodiments. As shown, facial expressions 700, 702, 704, 706, and 708 that are associated with one facial identity are retargeted to the same facial expressions 710, 712, 714, 716, and 718 for a new facial identity in a natural-looking, nonlinear manner. As used herein, “retargeting” refers to transferring the facial expressions associated with one facial identity, which may be represented as blendweights, onto another facial identity. In some embodiments, retargeting is performed by inputting an identity code associated with the new identity and expression codes associated with the facial expressions 700, 702, 704, 706, and 708 into the decoder 156 to generate vertex displacements for deforming a reference mesh into meshes of faces having the new facial identity and the same expressions 710, 712, 714, 716, and 718. As described, the identity code for the new identity may be manually entered by a user, generated by adding random noise to the identity code associated with a known identity, generated by inputting a neutral face mesh associated with the new identity minus the reference mesh into the identity encoder 152, or in any other technically feasible manner. As described, the expression code may also be manually entered by a user, generated by inputting user-specified or automatically-determined blendweights into the expression encoder 154, or in any other technically feasible manner”) and to maintain a skin tone and a lighting environment of the target face image using the delta image, and
a blending network that is configured to blend the facial components of the target face having the makeup onto the target face image; and output for display the makeup on the target face image (See Chandran: Figs. 5A-B, and [0059], “FIG. 5A illustrates an exemplar superimposing of facial expressions generated using a linear blending technique, according to the prior art. Experience has shown that conventional linear-based models can be used to superimpose expressions that are non-conflicting (ideally orthogonal), but such models produce poor results for many other shape combinations. As shown, an unrealistic-looking facial expression 504 is produced by superimposing a mouth-right expression 500 and a mouth-left expression 502 using a linear-based model”).
However, Chandran fails to explicitly disclose that generate, by the neural network, a delta image representing the makeup from the reference face image; transfer, by the neural network, the makeup from the reference face image to the target face image by combining the target face image with the delta image; and to maintain a skin tone and a lighting environment of the target face image using the delta image.
However, D'Alessandro teaches that generate, by the neural network, a delta image representing the makeup from the reference face image (See D'Alessandro: Figs. 7-8, and [0064], “Similarly, a delta can be computed using the color model between the expected color at the desired age and the expected color at the individual's actual age. This delta can likewise be added to the image to produce the predicted image. It should be understood that before adding the delta, the delta image must be warped to the predicted landmark points in order to line up with the warping of the individual's image from the shape model”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Chandran to have generate, by the neural network, a delta image representing the makeup from the reference face image as taught by D'Alessandro in order to simulate expected results to help a user make an informed choice in a suitable treatment for skin care (See D'Alessandro: [0034], “The systems and methods for age appearance simulation herein may be configured to provide a user with an aging and/or de-aging prediction and/or simulation experience. In some instances, the user provides an image of themselves along with gender, ethnicity, and/or age information, which can be combined with empirical/statistical, age-based facial shape and complexion (texture and/or color) data models to visually communicate how the user will age. With this insight, the user can make choices about their skin care treatments and/or procedures to provide the skin appearance benefit they desire. Additionally or alternatively, these models, when combined with clinically-based efficacy data could be used to simulate expected results (average responder, best responder, etc.), thereby helping a user make an informed choice on a suitable treatment. The term “simulation” includes the predictive nature of the functionality in both 2D and 3D spaces, as well as a projection of imagery (such as a 2D projection), as described herein”). Chandran teaches a method and system that may generate the representation of the face that has the facial identity and facial expressions that may be blended into other facial images; while D'Alessandro teaches a system and method that may determine the ethnicity and age of the patients based on the simulation that uses the delta image between the desired and the individual’s facial images. Therefore, it is obvious to one of ordinary skill in the art to modify Chandran by D'Alessandro to generate the delta image in order to have a more accurate facial image simulation results. The motivation to modify Chandran by D'Alessandro is “Use of known technique to improve similar devices (methods, or products) in the same way”.
However, Chandran, modified by D'Alessandro, fails to explicitly disclose that transfer, by the neural network, the makeup from the reference face image to the target face image by combining the target face image with the delta image; and to maintain a skin tone and a lighting environment of the target face image using the delta image.
However, Fu teaches that transfer, by the neural network, the makeup from the reference face image to the target face image by combining the target face image with the delta image (See Fu: Figs. 3 and 9-10, and [0116], “While various regions and fiducial points may be used in the method and system herein, for purposes of explaining a preferred embodiment illustrating a first and/or second region to be extracted and one of such regions intrinsically decomposed, the following example illustrates such steps using the eye and mouth regions as follows. For the eye region 1040A, for example, an intrinsic image decomposition technique is utilized in Step 1045 to recover the shading and reflectance channels of the eye region. Then, in Step 1050A, the shading channel and reflectance channel are fed into histogram matching separately to get an image with the makeup removed in the eye region. For the lip/mouth region 1040B, for example, an image is first transferred, i.e., converted, to HSV color channels, and different histogram matching procedures are applied to the H, S, V channels separately with regard to different lighting conditions. For lip color removal, specific reference histograms of “saturation” and “value” were learned from a collected dataset of facial images without makeup on the lips. With those predefined lip histograms, an input lip makeup could be removed by matching the detected lip histogram to a corresponding one having no makeup. For the lip channel, the “hue” channel is used as the lip region which usually has only one value so that one need not use a histogram to represent it, and the procedure for “hue” channel is set as the value of the “hue” channel for each pixel compared to a pre-trained color value”); and 
a lighting environment of the target face image using the delta image (See Fu: Fig. 3, and [0116], “While various regions and fiducial points may be used in the method and system herein, for purposes of explaining a preferred embodiment illustrating a first and/or second region to be extracted and one of such regions intrinsically decomposed, the following example illustrates such steps using the eye and mouth regions as follows. For the eye region 1040A, for example, an intrinsic image decomposition technique is utilized in Step 1045 to recover the shading and reflectance channels of the eye region. Then, in Step 1050A, the shading channel and reflectance channel are fed into histogram matching separately to get an image with the makeup removed in the eye region. For the lip/mouth region 1040B, for example, an image is first transferred, i.e., converted, to HSV color channels, and different histogram matching procedures are applied to the H, S, V channels separately with regard to different lighting conditions. For lip color removal, specific reference histograms of “saturation” and “value” were learned from a collected dataset of facial images without makeup on the lips. With those predefined lip histograms, an input lip makeup could be removed by matching the detected lip histogram to a corresponding one having no makeup. For the lip channel, the “hue” channel is used as the lip region which usually has only one value so that one need not use a histogram to represent it, and the procedure for “hue” channel is set as the value of the “hue” channel for each pixel compared to a pre-trained color value”.  Note that the skin color is mapped to the claimed element “skin tone”).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Chandran to have transfer, by the neural network, the makeup from the reference face image to the target face image by combining the target face image with the delta image; and to maintain a skin tone and a lighting environment of the target face image using the delta image as taught by Fu in order to make\s the system more robust (See Fu: Fig. 9, and [0204], “The unique use of makeup washoff or removal to generate non-makeup facial images makes the system more robust, but also represents a solution to the hardest part of solving the deep learning training problem which is to collect enough before and after makeup images to train deep learning models (DLM) 4040 resulting from the training. Thus, one can collect various images with makeup on them and instead of having to significant numbers of images with makeup off, the makeup removal method may be used to generate numbers of images with no makeup applied that are used as input data for training in step 4030”). Chandran teaches a method and system that may generate the representation of the face that has the facial identity and facial expressions that may be blended into other facial images; while Fu teaches a system and method that may provide a facial image of a user with makeup applied and may combine the facial images to generate a facial image with makeup removed. Therefore, it is obvious to one of ordinary skill in the art to modify Chandran by Fu to add and or remove makeup to or from the facial images. The motivation to modify Chandran by Fu to add makeup to the facial images is “Use of known technique to improve similar devices (methods, or products) in the same way”.
Regarding claim 2, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 1 as outlined above. Further, Fu teaches that the non-transitory computer readable medium of claim 1, wherein:
the instructions that, when executed by at the least one processor, cause the computer device to receive, at the neural network, the target face image and the reference face image, include instructions that, when executed by the at least one processor, cause the computer device to receive, at the neural network, the target face image and both a first reference face image and a second reference face image, the first reference face image having a first reference facial component with makeup selected by the user via the GUI and the second reference image having a second reference facial component with makeup selected by the user via the GUI (See Fu: Fig. 11, and [0021], “Also within the scope of the invention is a system for detecting and removing makeup from an input image, where the system is configured to be capable of: receiving an input image from a user interface with makeup applied thereto; locating facial landmarks from the facial image of the user in at least a first region and/or a second region different from the first region, wherein the first region includes makeup and/or the second region includes makeup; if the first region is located, decomposing the first region of the facial image into first channels and feeding the first channels of the first region into histogram matching using a reference histogram from a dataset of histograms of faces each having no makeup to obtain a first image with the makeup removed in the first region and/or if the second region is located, converting the second region of the facial image into color channels and feeding the color channels into histogram matching under different lighting conditions and using a reference histogram from a dataset of histograms of faces under different lighting conditions each having no makeup to obtain a second image with makeup being removed in the second region; and if both the first region and the second region are located, combining the first image and the second image to form a resultant facial image with makeup removed from the first region and the second region”); and 
the instructions that, when executed by the at least one processor, cause the computer device to transfer, by the neural network, the makeup from the reference face image to the target face image include instructions that, when executed by the at least one processor, cause the computer device to transfer, by the neural network, the makeup from the first reference facial component and the makeup from the second reference facial component to the target face image (See Fu: Fig. 3, and [0116], “While various regions and fiducial points may be used in the method and system herein, for purposes of explaining a preferred embodiment illustrating a first and/or second region to be extracted and one of such regions intrinsically decomposed, the following example illustrates such steps using the eye and mouth regions as follows. For the eye region 1040A, for example, an intrinsic image decomposition technique is utilized in Step 1045 to recover the shading and reflectance channels of the eye region. Then, in Step 1050A, the shading channel and reflectance channel are fed into histogram matching separately to get an image with the makeup removed in the eye region. For the lip/mouth region 1040B, for example, an image is first transferred, i.e., converted, to HSV color channels, and different histogram matching procedures are applied to the H, S, V channels separately with regard to different lighting conditions. For lip color removal, specific reference histograms of “saturation” and “value” were learned from a collected dataset of facial images without makeup on the lips. With those predefined lip histograms, an input lip makeup could be removed by matching the detected lip histogram to a corresponding one having no makeup. For the lip channel, the “hue” channel is used as the lip region which usually has only one value so that one need not use a histogram to represent it, and the procedure for “hue” channel is set as the value of the “hue” channel for each pixel compared to a pre-trained color value”).
Regarding claim 3, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 1 as outlined above. Further, Fu teaches that the non-transitory computer readable medium of claim 1, wherein the neural network is trained using the semi-supervised learning using both paired data and unpaired data, wherein the unpaired data is a larger dataset than the paired data (See Fu: Fig. 19, and [0264], “Previous patches and current patches are then matched by computing the correlation coefficient of each pair of the patches. Then the best region of interest in the current patches are chosen and their centers are made as final landmarks 3090. In addition, the correlation coefficient may also be used to classify which landmarks are occluded. The calculation function is preferably”).
Regarding claim 4, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 3 as outlined above. Further, Fu teaches that the non-transitory computer readable medium of claim 3, wherein the neural network is trained using the semi-supervised learning by i) training the neural network using only the paired data using supervised learning and then ii) iteratively training the neural network using subsets of the unpaired data using unsupervised learning (See Fu: Figs. 25A-B, and [0150], “The iterative method is used to detect the lip region using the complexion probability map of the lower part of the face. In each iteration, more offset is added on the base threshold until the binary image contains a contour region that satisfies the above criteria and has the convex hull configuration for the white region. Once such criteria are met, the detected region is considered to be the initial lip region”).
Regarding claim 5, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 4 as outlined above. Further, Chandran and Fu teach that the non-transitory computer readable medium of claim 4, wherein:
training the neural network using only the paired data includes computing a first loss function between an output of the neural network and a ground truth (GT) (See Chandran: Fig. 11, and [0081], “At step 1106, the model trainer 116 trains the mapping between 2D landmarks and expression codes applied by the mapping module 806 based on the normalized landmarks and ground truth blendweights, while keeping the previously trained identity encoder 152 and the decoder 156 fixed. In some embodiments, the mapping may be trained using the ground truth blendweights, which permit supervision on the facial expression code, given the pre-trained expression encoder 154, and the resulting geometry may be included in the loss function during training using the pre-trained decoder 156”); and
iteratively training the neural network using the subsets of the unpaired data comprises computing a second loss function, wherein the second loss function is a different loss function than the first loss function (See Fu: Figs. 25A-B, and [0150], “The iterative method is used to detect the lip region using the complexion probability map of the lower part of the face. In each iteration, more offset is added on the base threshold until the binary image contains a contour region that satisfies the above criteria and has the convex hull configuration for the white region. Once such criteria are met, the detected region is considered to be the initial lip region”; and Fig. 36, and [0164], “The present texture simulator 100 is capable of learning any lipstick texture given a single reference image of such texture and is shown in a representative component flow chart in FIG. 36. The simulation pipeline consists of four modules (see, FIG. 36): training module 52, pre-process module 50, a mono-channel style transfer (MST) module 54 and a post-process module 56. Given a desired, deep convolutional neural network structure, the training module is responsible for learning all the hidden weights and bias through gradient descent guided by any self-defined loss function. The style transfer model may be trained on any image dataset 58 that is either under the creative commons attribution license or is self-prepared by an in-house dataset. After the training module, a style transfer model is ready to be used with the rest of the modules”).
Regarding claim 6, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 5 as outlined above. Further, Fu teaches that the non-transitory computer readable medium of claim 5, wherein the first loss function includes a L1 loss function and the second loss function includes an adversarial loss function using a generator adversarial network (See Fu: Fig. 36, and [0163], “Existing virtual try-on techniques rely heavily on the original light distribution on the input lip region, which is intrinsically challenging for simulating textures that have a large deviation in luminance distribution compared to the input image. Therefore, to generate a more realistic texture, the original lip luminance pattern needs to be mapped into a reference pattern through a mapping function. Such a mapping function would have to be highly nonlinear and complex to be modeled explicitly by hand. For this reason, a deep learning model, which is known to have the capability to model highly nonlinear functions, is employed herein for solving style transfer problems. Research on style transfer has been increasing in recent years, especially in the deep learning domains. For instance, several publications demonstrate the capability of deep networks to mimic any input textures or art styles in real-time. See, for example, Johnson, Justin et al. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” ECCV (2016); Zhang, Hang and Kristin J. Dana, “Multi-style Generative Network for Real-time Transfer,” CoRR abs/1703.06953 (2017); and Li, Chuan and Michael Wand, “Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks.” ECCV (2016)”; and [0164], “The present texture simulator 100 is capable of learning any lipstick texture given a single reference image of such texture and is shown in a representative component flow chart in FIG. 36. The simulation pipeline consists of four modules (see, FIG. 36): training module 52, pre-process module 50, a mono-channel style transfer (MST) module 54 and a post-process module 56. Given a desired, deep convolutional neural network structure, the training module is responsible for learning all the hidden weights and bias through gradient descent guided by any self-defined loss function. The style transfer model may be trained on any image dataset 58 that is either under the creative commons attribution license or is self-prepared by an in-house dataset. After the training module, a style transfer model is ready to be used with the rest of the modules”). 
Regarding claim 7, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 3 as outlined above. Further, Chandran teaches that the non-transitory computer readable medium of claim 3, wherein the paired data includes a training reference face image with makeup, a training target face image without the makeup, and the training target face image with the makeup, wherein the training reference face image and the training target face image have a same identity (See Chandran: Fig. 1, and [0073], “As shown, a method 1000 begins at step 1002, where the model trainer 116 receives meshes of neutral faces and meshes of faces having expressions. The meshes may be obtained in any technically feasible manner. In some embodiments, the meshes are extracted, using well-known techniques, from standalone images and/or the frames of videos depicting human faces. For example, a passively-lit, multi-camera setup could be used to capture a number of individuals having different ethnicities, genders, age groups, and body mass index (BMI) in a predefined set of facial expressions, including the neutral expression. The captured images of individuals can then be reconstructed using well-known techniques, and a template mesh including a number of vertices can be semi-automatically registered to the reconstructions of each individual. In addition, facial expressions can be stabilized to remove rigid head motions and align the facial expressions to the same canonical space. Experience has shown that a relatively small number of individuals (e.g., hundreds of individuals) and predefined expressions (e.g., tens of expressions) can be used to train the face model 150”) and a same alignment (See Fu: Fig. 3, and [0115], “Many suitable commercial and open-source software exists for facial detection, such as Python, dlib and HOG, as well as for landmark detection and identification of fiducial points, such as that described by V. Kazemi et al., “One Millisecond Face Alignment with an Ensemble of Regression Trees,” KTH, Royal Institute of Technology, Computer Vision and Active Perception Lab, Stockholm, Sweden (2014). Preferred for use herein is Giaran, Inc. software”).
Regarding claim 8, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 3 as outlined above. Further, Fu teaches that the non-transitory computer readable medium of claim 3, wherein the unpaired data includes a training reference face image with makeup and a training target face image without the makeup, wherein the training reference face image and the training target face image are randomly paired face images (See Fu: Fig. 19, and [0264], “Previous patches and current patches are then matched by computing the correlation coefficient of each pair of the patches. Then the best region of interest in the current patches are chosen and their centers are made as final landmarks 3090. In addition, the correlation coefficient may also be used to classify which landmarks are occluded”).
Regarding claim 9, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 1 as outlined above. Further, Fu teaches that the non-transitory computer readable medium of claim 1, wherein the neural network comprises:
a first model that is trained to transfer the makeup from a lip region of the reference face image to a lip region of the target face image using the semi-supervised learning (See Fu: Figs. 10A-D, and [0197], “As noted above, the methods of makeup removal and application, as well as the applied end effects and texture simulations may be used independently or in an overall method and system, and may be supplemented by the various enhanced techniques noted below. FIG. 37 shows a general flow chart of a combination 500 of some of the embodiments of methods and systems herein. An input image II, II′ (as defined herein) can be provided by a user through a user interface (UI). The user interface can preferably communicate a digital input image as defined herein. The input image II, II′ may be processed and the device and color calibrated as described in this disclosure (200, 8000) and the landmarks detected and/or identified and annotated using various landmark detection and annotation methods described herein 300, 3000. When providing the input image II, II′, the user can elect to use the virtual makeup removal methods and systems described herein, including, for example, method 1000 to remove any makeup virtually from the input image should the user with to initially remove makeup. If the input image is sent without makeup so that removal is not required or once any makeup is removed using the methods herein, or, should the user with to use an add-on program without having removed makeup in one or more locations, the input image, is then optionally is sent to the makeup service (MS) and may be subjected to any of the makeup try-on, output end effects or texturing simulation as described in the systems and methods herein. For example, a virtual try-on may be used to apply an eye makeup virtual application of either a single or multiple type and layer eye makeup add-on as described in embodiment 400, including one or more of its specific sub-methods 10000, 20000, 30000, 40000 and 50000. Alternatively, a lip makeup color and/or output end effects as described herein (see method 2000), including an optional plumping effect and/or lip texture simulation (as in method 100) may be employed by the makeup service”); and
a second model that is trained to transfer the makeup from an eye region of the reference face image to an eye region of the target face image using the semi-supervised learning (See Fu: Fig. 27A-F, and [0172], “As shown in FIGS. 27a-27f, using a sample image photo II′″, various sections of eye makeup and/or eye features can be layered on an eye 424 of the photo II′″ layer by layer as shown. FIG. 27a shows an input image II″′ of a face 426 having no makeup applied. FIG. 27b includes an eye shadow layer add-on 428 applied to the face 426 of image II′″. FIG. 27c includes a middle eye shadow add-on 430 as applied to image II″′. FIG. 27d includes an eye corner add-on 432 applied to image II″′. FIG. 27e shows an eye tail add-on 434 applied to Image II″′, and FIG. 27f includes an eye lash 436 add-on also applied to Image II″′”).
Regarding claim 11, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 1 as outlined above. Further, Chandran, D'Alessandro, and Fu teach that a system (See Chandran: Fig. 1, and [0031], “Fig. 1 illustrates a system 100 configured to implement one or more aspects of the various embodiments. As shown, the system 100 includes a machine learning server 110, a data store 120, and a computing device 140 in communication over a network 130, which may be a wide area network (WAN) such as the Internet, a local area network (LAN), or any other suitable network”) comprising:
at least one processor (See Chandran: Fig. 1, and [0032], “As shown, a model trainer 116 executes on a processor 112 of the machine learning server 110 and is stored in a system memory 114 of the machine learning server 110”); and
a non-transitory computer readable medium comprising instructions that, when executed by the at least one processor, cause the system (See Chandran: Fig. 1, and [0033], “The system memory 114 of the machine learning server 110 stores content, such as software applications and data, for use by the processor 112 and the GPU”) to:
receive, at a neural network, paired data, the paired data including target training face images without makeup and reference training face images with makeup (See Chandran: Fig. 1, and [0035], “The model trainer 116 is configured to train machine learning models, including a non-linear model for generating faces 150, which is also referred to herein as a “face model.” As shown, the face model 150 includes an identity encoder 152, an expression encoder 154, and a decoder 156. Any technically feasible types of encoders and decoders may be used. In some embodiments, each of the identity encoder 152 and the expression encoder 154 may include a deep neural network, such as an encoder from a variational autoencoder (VAE). Similarly, the decoder 156 may also include a deep neural network in some embodiments. Operation(s) performed to encode representations of facial identities using the identity encoder 152, or to encode representations of facial expressions using the expression encoder 154 (or another mapping), are also referred to herein as “encoding operation(s).” Operation(s) performed to generate a representation of a face using the decoder 156, based on an encoded representation of a facial identity and an encoded representation of a facial expression, are also referred to herein as “decoding operation(s).””; Fig. 15 and [0096], “As shown, a method 1500 begins at step 1502, where the application 146 receives an image of a face. For example, the image could be a standalone image or one of multiple frames of a video. In the case of a video, steps of the method 1500 may be repeated for each frame in the video”; and [0101], “At step 1508, the application 146 receives a representation of a facial identity. Similar to step 1202 of the method 1200 described above in conjunction with FIG. 12, the facial identity may be represented in any technically feasible manner, such as an identity code, a neutral face mesh that can be converted to an identity code using the identity encoder 152, an image or video frame from which a neutral face mesh can be determined, etc.”), wherein each pair of target training face images and reference training face images have a same identity (See Chandran: Fig. 10, and [0039], “In such cases, the training data set may include 3D meshes of neutral faces for different facial identities, as well as meshes for the same facial identities and a number of predefined expressions”) and a same alignment (See Fu: Fig. 3, and [0115], “Many suitable commercial and open-source software exists for facial detection, such as Python, dlib and HOG, as well as for landmark detection and identification of fiducial points, such as that described by V. Kazemi et al., “One Millisecond Face Alignment with an Ensemble of Regression Trees,” KTH, Royal Institute of Technology, Computer Vision and Active Perception Lab, Stockholm, Sweden (2014). Preferred for use herein is Giaran, Inc. software”);
train the neural network using the paired data using supervised learning (See Chandran: Fig. 10, and [0051], “As described, facial identity and expression are separated in the internal representation of the face model 150, which permits semantic control of identities and expressions of faces generated by the face model 150. Experience has shown that the face model 150 is capable of learning to generate more realistic-looking faces than conventional linear-based models. As discussed in greater detail below in conjunction with FIG. 10, in some embodiments the identity encoder 152, the expression encoder 154, and the decoder 156 are trained in an end-to-end and fully supervised manner using a L1 loss function, with the identity and expression latent spaces being constrained using Kullback-Leibler (KL) divergence losses, a fixed learning rate, and the Adaptive Moment Estimation (ADAM) optimizer. That is, three loss functions are used, the L1 loss on reconstruction, which is the mesh prediction output by the decoder 156, and two KL divergence losses on the identity and expression encoders 152 and 154, respectively”);
after training the neural network using the paired data, receive, at the neural network, unpaired data, the unpaired data including randomly paired target training face images without makeup and reference training face images with makeup (See Fu: Figs. 9-10, and [0204], “Thus, one can collect various images with makeup on them and instead of having to significant numbers of images with makeup off, the makeup removal method may be used to generate numbers of images with no makeup applied that are used as input data for training in step 4030”; and [0206], “FIGS. 10A-10D shows more detailed output examples of the makeup annotation system 5000 in accordance with an embodiment of the present disclosure. Through the makeup annotation system 5000, digitalized makeup information can be generated and this information may be used as input data of the deep learning training in step 4045”. Note that data 4020 and 5000 are unpaired); and
train the neural network using the unpaired data using unsupervised learning (See Fu: Figs. 9-10, and [0205], “For model training, a deep learning framework 4035 such as Caffe™, Caffe2™ or Pytorch™ is used to support many different types of deep learning architectures for image classification and image segmentation. Such a framework supports a variety of neural network patterns, as well as fully connected neural network designs”. Note that data 4020 and 5000 are unpaired, thus, the neural network training is unsupervised learning).
Regarding claim 12, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 11 as outlined above. Further, Fu teaches that the system of claim 11, wherein the instructions that, when executed by the at least one processor, cause the system to train the neural network using the unpaired data include instructions that, when executed by the at least one processor, cause the system to iteratively train the neural network using subsets of the unpaired data (See Fu: Figs. 25A-B, and [0150], “The iterative method is used to detect the lip region using the complexion probability map of the lower part of the face. In each iteration, more offset is added on the base threshold until the binary image contains a contour region that satisfies the above criteria and has the convex hull configuration for the white region. Once such criteria are met, the detected region is considered to be the initial lip region”).
Regarding claim 13, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 12 as outlined above. Further, Chandran and Fu teach that the system of claim 12, wherein:
the instructions that, when executed by the at least one processor, cause the system to train the neural network using the paired data include instructions that, when executed by the at least one processor, cause the system to compute a first loss function between an output of the neural network and a ground truth (GT) (See Chandran: Fig. 11, and [0081], “At step 1106, the model trainer 116 trains the mapping between 2D landmarks and expression codes applied by the mapping module 806 based on the normalized landmarks and ground truth blendweights, while keeping the previously trained identity encoder 152 and the decoder 156 fixed. In some embodiments, the mapping may be trained using the ground truth blendweights, which permit supervision on the facial expression code, given the pre-trained expression encoder 154, and the resulting geometry may be included in the loss function during training using the pre-trained decoder 156”); and
the instructions that, when executed by the at least one processor, cause the system to iteratively train the neural network using the subsets of the unpaired data include instructions that, when executed by the at least one processor, cause the system to compute a second loss function, wherein the second loss function is a different loss function than the first loss function (See Fu: Figs. 25A-B, and [0150], “The iterative method is used to detect the lip region using the complexion probability map of the lower part of the face. In each iteration, more offset is added on the base threshold until the binary image contains a contour region that satisfies the above criteria and has the convex hull configuration for the white region. Once such criteria are met, the detected region is considered to be the initial lip region”; and Fig. 36, and [0164], “The present texture simulator 100 is capable of learning any lipstick texture given a single reference image of such texture and is shown in a representative component flow chart in FIG. 36. The simulation pipeline consists of four modules (see, FIG. 36): training module 52, pre-process module 50, a mono-channel style transfer (MST) module 54 and a post-process module 56. Given a desired, deep convolutional neural network structure, the training module is responsible for learning all the hidden weights and bias through gradient descent guided by any self-defined loss function. The style transfer model may be trained on any image dataset 58 that is either under the creative commons attribution license or is self-prepared by an in-house dataset. After the training module, a style transfer model is ready to be used with the rest of the modules”).
Regarding claim 14, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 14 as outlined above. Further, Fu teaches that the system of claim 14, wherein the first loss function includes a L1 loss function and the second loss function includes an adversarial loss function using a generator adversarial network (See Fu: Fig. 36, and [0163], “Existing virtual try-on techniques rely heavily on the original light distribution on the input lip region, which is intrinsically challenging for simulating textures that have a large deviation in luminance distribution compared to the input image. Therefore, to generate a more realistic texture, the original lip luminance pattern needs to be mapped into a reference pattern through a mapping function. Such a mapping function would have to be highly nonlinear and complex to be modeled explicitly by hand. For this reason, a deep learning model, which is known to have the capability to model highly nonlinear functions, is employed herein for solving style transfer problems. Research on style transfer has been increasing in recent years, especially in the deep learning domains. For instance, several publications demonstrate the capability of deep networks to mimic any input textures or art styles in real-time. See, for example, Johnson, Justin et al. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” ECCV (2016); Zhang, Hang and Kristin J. Dana, “Multi-style Generative Network for Real-time Transfer,” CoRR abs/1703.06953 (2017); and Li, Chuan and Michael Wand, “Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks.” ECCV (2016)”; and [0164], “The present texture simulator 100 is capable of learning any lipstick texture given a single reference image of such texture and is shown in a representative component flow chart in FIG. 36. The simulation pipeline consists of four modules (see, FIG. 36): training module 52, pre-process module 50, a mono-channel style transfer (MST) module 54 and a post-process module 56. Given a desired, deep convolutional neural network structure, the training module is responsible for learning all the hidden weights and bias through gradient descent guided by any self-defined loss function. The style transfer model may be trained on any image dataset 58 that is either under the creative commons attribution license or is self-prepared by an in-house dataset. After the training module, a style transfer model is ready to be used with the rest of the modules”).
Regarding claim 15, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 11 as outlined above. Further, Fu teaches that the system of claim 11, wherein the unpaired data is a larger dataset than the paired data (See Fu: Fig. 19, and [0264], “Previous patches and current patches are then matched by computing the correlation coefficient of each pair of the patches. Then the best region of interest in the current patches are chosen and their centers are made as final landmarks 3090. In addition, the correlation coefficient may also be used to classify which landmarks are occluded. The calculation function is preferably”).
Regarding claim 16, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 1 as outlined above. Further, Chandran, D'Alessandro, and Fu teach that a computer-implemented method for training a neural network to automatically transfer makeup from a first image to a second image, the method (See Chandran: Fig. 1, and [0031], “Fig. 1 illustrates a system 100 configured to implement one or more aspects of the various embodiments. As shown, the system 100 includes a machine learning server 110, a data store 120, and a computing device 140 in communication over a network 130, which may be a wide area network (WAN) such as the Internet, a local area network (LAN), or any other suitable network”) comprising:
receiving, at a neural network, paired data, the paired data including target training face images without makeup and reference training face images with makeup (See Chandran: Fig. 1, and [0035], “The model trainer 116 is configured to train machine learning models, including a non-linear model for generating faces 150, which is also referred to herein as a “face model.” As shown, the face model 150 includes an identity encoder 152, an expression encoder 154, and a decoder 156. Any technically feasible types of encoders and decoders may be used. In some embodiments, each of the identity encoder 152 and the expression encoder 154 may include a deep neural network, such as an encoder from a variational autoencoder (VAE). Similarly, the decoder 156 may also include a deep neural network in some embodiments. Operation(s) performed to encode representations of facial identities using the identity encoder 152, or to encode representations of facial expressions using the expression encoder 154 (or another mapping), are also referred to herein as “encoding operation(s).” Operation(s) performed to generate a representation of a face using the decoder 156, based on an encoded representation of a facial identity and an encoded representation of a facial expression, are also referred to herein as “decoding operation(s).””; Fig. 15 and [0096], “As shown, a method 1500 begins at step 1502, where the application 146 receives an image of a face. For example, the image could be a standalone image or one of multiple frames of a video. In the case of a video, steps of the method 1500 may be repeated for each frame in the video”; and [0101], “At step 1508, the application 146 receives a representation of a facial identity. Similar to step 1202 of the method 1200 described above in conjunction with FIG. 12, the facial identity may be represented in any technically feasible manner, such as an identity code, a neutral face mesh that can be converted to an identity code using the identity encoder 152, an image or video frame from which a neutral face mesh can be determined, etc.”), wherein each pair of target training face images and reference training face images have a same identity (See Chandran: Fig. 10, and [0039], “In such cases, the training data set may include 3D meshes of neutral faces for different facial identities, as well as meshes for the same facial identities and a number of predefined expressions”) and a same alignment (See Fu: Fig. 3, and [0115], “Many suitable commercial and open-source software exists for facial detection, such as Python, dlib and HOG, as well as for landmark detection and identification of fiducial points, such as that described by V. Kazemi et al., “One Millisecond Face Alignment with an Ensemble of Regression Trees,” KTH, Royal Institute of Technology, Computer Vision and Active Perception Lab, Stockholm, Sweden (2014). Preferred for use herein is Giaran, Inc. software”);
training the neural network using the paired data using supervised learning (See Chandran: Fig. 10, and [0051], “As described, facial identity and expression are separated in the internal representation of the face model 150, which permits semantic control of identities and expressions of faces generated by the face model 150. Experience has shown that the face model 150 is capable of learning to generate more realistic-looking faces than conventional linear-based models. As discussed in greater detail below in conjunction with FIG. 10, in some embodiments the identity encoder 152, the expression encoder 154, and the decoder 156 are trained in an end-to-end and fully supervised manner using a L1 loss function, with the identity and expression latent spaces being constrained using Kullback-Leibler (KL) divergence losses, a fixed learning rate, and the Adaptive Moment Estimation (ADAM) optimizer. That is, three loss functions are used, the L1 loss on reconstruction, which is the mesh prediction output by the decoder 156, and two KL divergence losses on the identity and expression encoders 152 and 154, respectively”);
after training the neural network using the paired data, receiving, at the neural network, unpaired data, the unpaired data including randomly paired target training face images without makeup and reference training face images with makeup (See Fu: Figs. 9-10, and [0204], “Thus, one can collect various images with makeup on them and instead of having to significant numbers of images with makeup off, the makeup removal method may be used to generate numbers of images with no makeup applied that are used as input data for training in step 4030”; and [0206], “FIGS. 10A-10D shows more detailed output examples of the makeup annotation system 5000 in accordance with an embodiment of the present disclosure. Through the makeup annotation system 5000, digitalized makeup information can be generated and this information may be used as input data of the deep learning training in step 4045”. Note that data 4020 and 5000 are unpaired); and
iteratively training the neural network using subsets of the unpaired data using unsupervised learning (See Fu: Figs. 9-10, and [0205], “For model training, a deep learning framework 4035 such as Caffe™, Caffe2™ or Pytorch™ is used to support many different types of deep learning architectures for image classification and image segmentation. Such a framework supports a variety of neural network patterns, as well as fully connected neural network designs”. Note that data 4020 and 5000 are unpaired, thus, the neural network training is unsupervised learning).
Regarding claim 17, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 16 as outlined above. Further, Chandran teaches that the computer-implemented method of claim 16, wherein training the neural network using the paired data includes computing a first loss function between an output of the neural network and a ground truth (GT) (See Chandran: Fig. 11, and [0081], “At step 1106, the model trainer 116 trains the mapping between 2D landmarks and expression codes applied by the mapping module 806 based on the normalized landmarks and ground truth blendweights, while keeping the previously trained identity encoder 152 and the decoder 156 fixed. In some embodiments, the mapping may be trained using the ground truth blendweights, which permit supervision on the facial expression code, given the pre-trained expression encoder 154, and the resulting geometry may be included in the loss function during training using the pre-trained decoder 156”).
Regarding claim 18, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 17 as outlined above. Further, Chandran teaches that computer-implemented method of claim 17, wherein iteratively training the neural network using the subsets of the unpaired data includes computing a second loss function, wherein the second loss function is a different loss function than the first loss function (See Chandran: Fig. 10, and [0051], “As described, facial identity and expression are separated in the internal representation of the face model 150, which permits semantic control of identities and expressions of faces generated by the face model 150. Experience has shown that the face model 150 is capable of learning to generate more realistic-looking faces than conventional linear-based models. As discussed in greater detail below in conjunction with FIG. 10, in some embodiments the identity encoder 152, the expression encoder 154, and the decoder 156 are trained in an end-to-end and fully supervised manner using a L1 loss function, with the identity and expression latent spaces being constrained using Kullback-Leibler (KL) divergence losses, a fixed learning rate, and the Adaptive Moment Estimation (ADAM) optimizer. That is, three loss functions are used, the L1 loss on reconstruction, which is the mesh prediction output by the decoder 156, and two KL divergence losses on the identity and expression encoders 152 and 154, respectively”).
Regarding claim 19, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 18 as outlined above. Further, Fu teaches that computer-implemented method of claim 18, wherein the first loss function includes a L1 loss function and the second loss function includes an adversarial loss function using a generator adversarial network (See Fu: Fig. 36, and [0163], “Existing virtual try-on techniques rely heavily on the original light distribution on the input lip region, which is intrinsically challenging for simulating textures that have a large deviation in luminance distribution compared to the input image. Therefore, to generate a more realistic texture, the original lip luminance pattern needs to be mapped into a reference pattern through a mapping function. Such a mapping function would have to be highly nonlinear and complex to be modeled explicitly by hand. For this reason, a deep learning model, which is known to have the capability to model highly nonlinear functions, is employed herein for solving style transfer problems. Research on style transfer has been increasing in recent years, especially in the deep learning domains. For instance, several publications demonstrate the capability of deep networks to mimic any input textures or art styles in real-time. See, for example, Johnson, Justin et al. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” ECCV (2016); Zhang, Hang and Kristin J. Dana, “Multi-style Generative Network for Real-time Transfer,” CoRR abs/1703.06953 (2017); and Li, Chuan and Michael Wand, “Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks.” ECCV (2016)”; and [0164], “The present texture simulator 100 is capable of learning any lipstick texture given a single reference image of such texture and is shown in a representative component flow chart in FIG. 36. The simulation pipeline consists of four modules (see, FIG. 36): training module 52, pre-process module 50, a mono-channel style transfer (MST) module 54 and a post-process module 56. Given a desired, deep convolutional neural network structure, the training module is responsible for learning all the hidden weights and bias through gradient descent guided by any self-defined loss function. The style transfer model may be trained on any image dataset 58 that is either under the creative commons attribution license or is self-prepared by an in-house dataset. After the training module, a style transfer model is ready to be used with the rest of the modules”).
Regarding claim 20, Chandran, D'Alessandro, and Fu teach all the features with respect to claim 16 as outlined above. Further, Chandran teaches that computer-implemented method of claim 16, wherein the unpaired data is a larger dataset than the paired data (See Fu: Fig. 19, and [0264], “Previous patches and current patches are then matched by computing the correlation coefficient of each pair of the patches. Then the best region of interest in the current patches are chosen and their centers are made as final landmarks 3090. In addition, the correlation coefficient may also be used to classify which landmarks are occluded. The calculation function is preferably”).

Allowable Subject Matter
Claim 10 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.




Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


Any inquiry concerning this communication or earlier communications from the examiner should be directed to GORDON G LIU whose telephone number is (571)270-0382. The examiner can normally be reached Monday - Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GORDON G LIU/Primary Examiner, Art Unit 2612