DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Allowable Subject Matter
Claims 12 and 13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 14 and 15 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 14 recites the limitation "the second training rendered neural texture".  There is insufficient antecedent basis for this limitation in the claim. There is no “second training rendered neural texture” discussed in any of the claims from which claim 14 depends. Appropriate correction is required.
Claim 15 depends from claim 14 and is accordingly rejected.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-11 and 16-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Shysheya et al. (US Pub. 2021/0358197), hereinafter Shysheya.
Regarding claim 1, Shysheya discloses a method comprising, by a computing device: adjusting parameters of a three-dimensional geometry corresponding to a first person to make the three-dimensional geometry represent a desired pose for the first person (Paragraphs [0038]-[0040]: system for learning full body neural avatars is provided. The system trains a deep network to produce full body renderings of a person for varying person pose and camera positions. In the training process, the system explicitly estimates the 2D texture describing the appearance of the body surface. While retaining explicit texture estimation, the system bypasses the explicit estimation of 3D skin (surface) geometry at any time…classical (“neural-free”) avatar based on a standard computer graphics pipeline is to take a user-personalized body mesh in a neutral position, estimate the joint angles from the joint positions, perform skinning (deformation of the neutral pose) thus estimating the 3D geometry of the body. After that texture mapping is applied using precomputed 2D texture. Finally, the resulting textured model is lit using a certain lighting model and then projected onto the camera view. Creating a person's avatar in the classical pipeline thus requires personalizing the skinning process responsible for the geometry and the texture that is responsible for appearance); accessing a neural texture encoding an appearance of the first person (Paragraphs [0051]-[0056]: Textured neural avatar. The direct translation approach relies on the generalization ability of ConvNets and incorporates very little domain-specific knowledge into the system. As an alternative, the textured avatar approach is applied, that explicitly estimates the textures of body parts, thus ensuring the similarity of the body surface appearance under varying pose and cameras…training the neural textured avatar, a convolutional network gϕ is learned with learnable parameters ϕ to translate the input map stacks Bi into the body part assignments and the body part coordinates. As gϕ has two branches (“heads”), gϕP is the branch that produces the body part assignments stack, and gϕC is the branch that produces the body part coordinates. To learn the parameters of the textured neural avatar, the loss between the generated image and the ground truth image Ii is optimized); generating a first rendered neural texture based on a mapping between (1) a portion of the three-dimensional geometry that is visible from a viewing direction and (2) the neural texture, the first rendered neural texture comprising latent channels (Paragraph [0059]: FIG. 2 is the overview of the textured neural avatar system. The input pose is defined as a stack of “bone” rasterizations (one bone per channel; here we show it highlighted in red). The input is processed by the fully-convolutional network (orange) to produce body part assignment map stack and the body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are back-propagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates; Paragraphs [0079]-[0082]: the deep network is used to produce an RGB image (a three-channel stack) Ii and a single channel mask Mi, defining the pixels that are covered by the avatar. At training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames…Textured neural avatar. The direct translation approach relies on the generalization ability of the deep networks and incorporates very little domain-specific knowledge into the system. As an alternative, the textured avatar approach is applied, that explicitly estimates the textures of body parts, thus ensuring the similarity of the body surface appearance under varying pose and cameras. Following the DensePose approach [21], the body is subdivided into n parts, where each part has a 2D parameterization. Thus, it is assumed that in a person's image each pixel belongs to one of n parts or to the background. In the former case, the pixel is further associated with 2D part specific coordinates. The k-th body part is also associated with the texture map Tk that is estimated during training. The estimated textures are learned at training time and are reused for all camera views and all poses); generating a second rendered neural texture by processing the first rendered neural texture using a first neural network, the second rendered neural texture comprising color channels and latent channels (Paragraphs [0079]-[0082]: the deep network is used to produce an RGB image (a three-channel stack) Ii and a single channel mask Mi, defining the pixels that are covered by the avatar. At training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames…Textured neural avatar. The direct translation approach relies on the generalization ability of the deep networks and incorporates very little domain-specific knowledge into the system. As an alternative, the textured avatar approach is applied, that explicitly estimates the textures of body parts, thus ensuring the similarity of the body surface appearance under varying pose and cameras. Following the DensePose approach [21], the body is subdivided into n parts, where each part has a 2D parameterization. Thus, it is assumed that in a person's image each pixel belongs to one of n parts or to the background. In the former case, the pixel is further associated with 2D part specific coordinates. The k-th body part is also associated with the texture map Tk that is estimated during training. The estimated textures are learned at training time and are reused for all camera views and all poses); determining normal information associated with the portion of the three-dimensional geometry that is visible from the viewing direction (Fig. 3; Paragraph [0072]: a (full-body) avatar is defined as a system that is capable of rendering views of a certain person under varying human pose defined by a set of 3D positions of the body joints and varying camera positions (FIG. 3). FIG. 3 shows textured neural avatar results (without video-to-video post-processing) for different viewpoints during training. Reference numbers 1 to 6 denotes different viewpoints of the camera and images from viewpoints 1 to 6. In lower row of pictures on FIG. 3, the images on the left are obtained by processing the pose input shown on the right. Body joint positions are taken rather than joint angles as an input, since such positions are easier to estimate from data using marker-based or marker-less motion capture systems. A classical (“neural-free”) avatar based on a standard computer graphics pipeline is to take a user-personalized body mesh in a neutral position, estimate the joint angles from the joint positions, perform skinning (deformation of the neutral pose) thus estimating the 3D geometry of the body. After that texture mapping is applied using precomputed 2D texture. Finally, the resulting textured model is lit using a certain lighting model and then projected onto the camera view. Creating a person's avatar in the classical pipeline thus requires personalizing the skinning process responsible for the geometry and the texture that is responsible for appearance); generating a rendered image for the first person in the desired pose by processing the second rendered neural texture and the normal information using a second neural network (Fig. 3; Paragraph [0072]: FIG. 3 shows textured neural avatar results (without video-to-video post-processing) for different viewpoints during training. Reference numbers 1 to 6 denotes different viewpoints of the camera and images from viewpoints 1 to 6. In lower row of pictures on FIG. 3, the images on the left are obtained by processing the pose input shown on the right. Body joint positions are taken rather than joint angles as an input, since such positions are easier to estimate from data using marker-based or marker-less motion capture systems; Paragraphs [0103]-[0111]: a plurality of images of the person in different poses and from different viewpoints are received…the step 202, the 3D coordinates of body joint positions of the person defined in the camera coordinate frame are obtained for each image of the received plurality of images. The 3D coordinates may be obtained by using any appropriate technique. Such techniques are known in the prior art…the step 203, the machine learning predictor is initialized based on the 3D coordinates of the body joint positions and the received plurality of images to obtain parameters for predicting the map stack of body part assignments and the map stack of body part coordinates).
Regarding claim 2, Shysheya discloses the method of claim 1, wherein the three-dimensional geometry is constructed by interpolating three-dimensional geometries representing known poses for the first person (Paragraph [0046]: synthesizing images of a certain person given its pose is required. It is assumed that the pose for the i-th image comes in the form of 3D joint positions defined in the camera coordinate frame. As an input to the network, then consider a map stack Bi is considered, where each map Bji contains the rasterized j-th segment (bone) of the “stickman” (skeleton) projected on the camera plane. To retain the information about the third coordinate of the joints, the depth-value is linearly interpolated between the joints defining the segments, and the interpolated values are used to define the values in the map Bji corresponding to the bone pixels (the pixels not covered by the j-th bone are set to zero). Overall, the stack Bi incorporates the information about the person and the camera pose).
Regarding claim 3, Shysheya discloses the method of claim 1, wherein the three-dimensional geometry is constructed based on a three-dimensional geometry representing the desired pose for a second person (Paragraph [0016]: the trained machine learning predictor is retrained for another person based on a plurality of images of another person; Paragraphs [0038]-[0040]: system for learning full body neural avatars is provided. The system trains a deep network to produce full body renderings of a person for varying person pose and camera positions. In the training process, the system explicitly estimates the 2D texture describing the appearance of the body surface. While retaining explicit texture estimation, the system bypasses the explicit estimation of 3D skin (surface) geometry at any time…classical (“neural-free”) avatar based on a standard computer graphics pipeline is to take a user-personalized body mesh in a neutral position, estimate the joint angles from the joint positions, perform skinning (deformation of the neutral pose) thus estimating the 3D geometry of the body. After that texture mapping is applied using precomputed 2D texture. Finally, the resulting textured model is lit using a certain lighting model and then projected onto the camera view. Creating a person's avatar in the classical pipeline thus requires personalizing the skinning process responsible for the geometry and the texture that is responsible for appearance).
Regarding claim 4, Shysheya discloses the method of claim 1, wherein each texel of the neural texture has k-channel latent representation (Paragraph [0053]: map channel Pki for k=0 . . . n−1 is then interpreted as the probability of the pixel to belong to the k-th body part, and the map channel Pni corresponds to the probability of the background. The coordinate maps…correspond to the pixel coordinates on the k-th body part. Specifically, once the part assignments Pi and body part coordinates Ci are predicted, the image Ii at each pixel (x,y) are reconstructed as a weighted combination of texture elements, where the weights and texture coordinates are prescribed by the part assignment maps and the coordinate maps correspondingly; Paragraph [0077]: lower index i is used to denote objects that are specific to the i-th training or test image. Uppercase notation is used, e.g. Bi denotes a stack of maps (a third order tensor/three-dimensional array) corresponding to the i-th training or test image. The upper index is used to denote a specific map (channel) in the stack, e.g. Bji. Furthermore, square brackets is used to denote elements corresponding to a specific image location).
Regarding claim 5, Shysheya discloses the method of claim 1, wherein the rendered image for the first person is modified by swapping at least a part of the neural texture with the corresponding part of a neural texture encoding an appearance of a second person (Paragraph [0043]: Comparison of the performance of the textured neural avatar provided by the present invention with direct video-to-video translation approach [53] shows that explicit estimation of textures brings additional generalization capability and considerably improves the realism of the generated images for new views. Significant benefits provided by the present invention consist in the fact that the explicit decoupling of textures and geometry gives in the transfer learning scenarios, when the network is retrained to a new person with little training data; Paragraph [0065]: Once textured neural avatar is trained for a certain person based on a large amount of data, it can be retrained for a different person using much less data (so-called transfer learning). During retraining a new stack of texture maps is reestimated using the initialization procedure discussed above. After which the training process proceeds in a standard way but using the previously trained set of parameters ϕ as initialization).
Regarding claim 6, Shysheya discloses the method of claim 5, the neural texture encoding the appearance of the first person and the neural texture encoding the appearance of the second person are simultaneously trained along with the first neural network and the second neural network (Paragraph [0069]: FIGS. 1-2, those descriptions are exemplary. Although the subject matter has been described in language specific to structural features or methodological acts, it is understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. Also, the invention is not limited by the illustrated order of the method steps, the order may be modified by a skilled person without creative efforts. Some or all of the method steps may be performed sequentially or concurrently).
Regarding claim 7, Shysheya discloses the method of claim 1, wherein the second neural network also produces a mask that is used for blending the generated rendered image for the first person in the desired pose with a background (Fig. 2; Paragraph [0042]: Keeping this component with the neural pipeline thus boosts generalization across such transforms. The role of the convolutional network in the approach of the present invention is then confined to predicting the texture coordinates of individual pixels given the body pose and the camera parameters (FIG. 2). Additionally, the network predicts the foreground/background mask).
Regarding claim 8, Shysheya discloses the method of claim 1, wherein a process for training the neural texture encoding the appearance of the first person, the first neural network, and the second neural network comprises: accessing a video stream of the first person taken from various viewing directions (Fig. 3; Paragraph [0072]: a (full-body) avatar is defined as a system that is capable of rendering views of a certain person under varying human pose defined by a set of 3D positions of the body joints and varying camera positions (FIG. 3). FIG. 3 shows textured neural avatar results (without video-to-video post-processing) for different viewpoints during training. Reference numbers 1 to 6 denotes different viewpoints of the camera and images from viewpoints 1 to 6. In lower row of pictures on FIG. 3, the images on the left are obtained by processing the pose input shown on the right. Body joint positions are taken rather than joint angles as an input, since such positions are easier to estimate from data using marker-based or marker-less motion capture systems. A classical (“neural-free”) avatar based on a standard computer graphics pipeline is to take a user-personalized body mesh in a neutral position, estimate the joint angles from the joint positions, perform skinning (deformation of the neutral pose) thus estimating the 3D geometry of the body. After that texture mapping is applied using precomputed 2D texture. Finally, the resulting textured model is lit using a certain lighting model and then projected onto the camera view. Creating a person's avatar in the classical pipeline thus requires personalizing the skinning process responsible for the geometry and the texture that is responsible for appearance); determining keyframes among a plurality of frames of the video stream that capture static salient appearances of the first person in the video stream (Fig. 3; Paragraph [0008]: work of [59] is the most related to ours in this group, as they warp the individual frames of the multiview video dataset according to the target pose to generate new sequences; Paragraph [0047]: the network is used to produce an RGB image (a three-channel stack) Ii and a single channel mask Mi defining the pixels that are covered by the avatar. At training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames); for each of the determined keyframes: generating a training rendered image for the first person in a training pose shown in the frame using the neural texture, the first neural network, and the second neural network (Paragraphs [0047]-[0051]: training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames…the training or test examples i−1 and i−2 correspond to the preceding frames. The resulting video-to-video translation system provides a strong baseline for the present invention…the body is subdivided into n parts, where each part has a 2D parameterization. Thus, it is assumed that in a person's image each pixel belongs to one of n parts or to the background. In the former case, the pixel is further associated with 2D part specific coordinates. The k-th body part is also associated with the texture map Tk that is estimated during training. The estimated textures are learned at training time and are reused for all camera views and all poses); calculating losses by comparing the generated training rendered image and a ground truth image of the first person in the frame (Paragraphs [0056]-[0058]: learn the parameters of the textured neural avatar, the loss between the generated image and the ground truth image Ii is optimized…where d(⋅,⋅) is a loss comparing two images (the exact choice is discussed below). During the stochastic optimization, the gradient of the loss (4) is backpropagatcd through (2) both into the translation network gϕ and onto the texture maps Tk, so that minimizing this loss updates not only the network parameters but also the textures themselves. As an addition, the learning also optimizes the mask loss that measures the discrepancy between the ground truth background mask 1−Mi and the background mask prediction); and updating parameters of the neural texture, the first neural network, and the second neural network based on the calculated losses (Paragraphs [0057]-[0059]: a loss comparing two images (the exact choice is discussed below). During the stochastic optimization, the gradient of the loss (4) is backpropagatcd through (2) both into the translation network gϕ and onto the texture maps Tk, so that minimizing this loss updates not only the network parameters but also the textures themselves… After back-propagation of the weighted combination of (4) and (5), the network parameters ϕ and the textures maps Tk are updated. As the training progresses, the texture maps change (FIG. 2), and so does the body part coordinate predictions, so that the learning is free to choose the appropriate parameterization of body part surfaces…FIG. 2 is the overview of the textured neural avatar system. The input pose is defined as a stack of “bone” rasterizations (one bone per channel; here we show it highlighted in red). The input is processed by the fully-convolutional network (orange) to produce body part assignment map stack and the body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are back-propagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates).
Regarding claim 9, Shysheya discloses the method of claim 8, wherein generating the training rendered image for the first person in the training pose shown in the frame comprises: constructing a three-dimensional training geometry to represent the first person in the training pose shown in the frame based on a body shape model (Paragraphs [0051]-[0056]: Textured neural avatar. The direct translation approach relies on the generalization ability of ConvNets and incorporates very little domain-specific knowledge into the system. As an alternative, the textured avatar approach is applied, that explicitly estimates the textures of body parts, thus ensuring the similarity of the body surface appearance under varying pose and cameras…training the neural textured avatar, a convolutional network gϕ is learned with learnable parameters ϕ to translate the input map stacks Bi into the body part assignments and the body part coordinates. As gϕ has two branches (“heads”), gϕP is the branch that produces the body part assignments stack, and gϕC is the branch that produces the body part coordinates. To learn the parameters of the textured neural avatar, the loss between the generated image and the ground truth image Ii is optimized); generating a first training rendered neural texture based on a mapping between (1) a portion of the three-dimensional training geometry that is visible from a viewing direction of the frame and (2) the neural texture (Paragraph [0059]: FIG. 2 is the overview of the textured neural avatar system. The input pose is defined as a stack of “bone” rasterizations (one bone per channel; here we show it highlighted in red). The input is processed by the fully-convolutional network (orange) to produce body part assignment map stack and the body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are back-propagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates; Paragraphs [0079]-[0082]: the deep network is used to produce an RGB image (a three-channel stack) Ii and a single channel mask Mi, defining the pixels that are covered by the avatar. At training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames…Textured neural avatar. The direct translation approach relies on the generalization ability of the deep networks and incorporates very little domain-specific knowledge into the system. As an alternative, the textured avatar approach is applied, that explicitly estimates the textures of body parts, thus ensuring the similarity of the body surface appearance under varying pose and cameras. Following the DensePose approach [21], the body is subdivided into n parts, where each part has a 2D parameterization. Thus, it is assumed that in a person's image each pixel belongs to one of n parts or to the background. In the former case, the pixel is further associated with 2D part specific coordinates. The k-th body part is also associated with the texture map Tk that is estimated during training. The estimated textures are learned at training time and are reused for all camera views and all poses); generating a second training rendered neural texture by processing the first training rendered neural texture using the first neural network (Paragraph [0065]: Transfer learning. Once textured neural avatar is trained for a certain person based on a large amount of data, it can be retrained for a different person using much less data (so-called transfer learning). During retraining a new stack of texture maps is reestimated using the initialization procedure discussed above. After which the training process proceeds in a standard way but using the previously trained set of parameters ϕ as initialization); determining training normal information associated with the portion of the three-dimensional training geometry that is visible from the viewing direction (Fig. 3; Paragraph [0072]: a (full-body) avatar is defined as a system that is capable of rendering views of a certain person under varying human pose defined by a set of 3D positions of the body joints and varying camera positions (FIG. 3). FIG. 3 shows textured neural avatar results (without video-to-video post-processing) for different viewpoints during training. Reference numbers 1 to 6 denotes different viewpoints of the camera and images from viewpoints 1 to 6. In lower row of pictures on FIG. 3, the images on the left are obtained by processing the pose input shown on the right. Body joint positions are taken rather than joint angles as an input, since such positions are easier to estimate from data using marker-based or marker-less motion capture systems. A classical (“neural-free”) avatar based on a standard computer graphics pipeline is to take a user-personalized body mesh in a neutral position, estimate the joint angles from the joint positions, perform skinning (deformation of the neutral pose) thus estimating the 3D geometry of the body. After that texture mapping is applied using precomputed 2D texture. Finally, the resulting textured model is lit using a certain lighting model and then projected onto the camera view. Creating a person's avatar in the classical pipeline thus requires personalizing the skinning process responsible for the geometry and the texture that is responsible for appearance); and generating the training rendered image for the first person in the training pose by processing the second training rendered neural texture and the training normal information using the second neural network (Paragraphs [0079]-[0082]: the deep network is used to produce an RGB image (a three-channel stack) Ii and a single channel mask Mi, defining the pixels that are covered by the avatar. At training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames…Textured neural avatar. The direct translation approach relies on the generalization ability of the deep networks and incorporates very little domain-specific knowledge into the system. As an alternative, the textured avatar approach is applied, that explicitly estimates the textures of body parts, thus ensuring the similarity of the body surface appearance under varying pose and cameras. Following the DensePose approach [21], the body is subdivided into n parts, where each part has a 2D parameterization. Thus, it is assumed that in a person's image each pixel belongs to one of n parts or to the background. In the former case, the pixel is further associated with 2D part specific coordinates. The k-th body part is also associated with the texture map Tk that is estimated during training. The estimated textures are learned at training time and are reused for all camera views and all poses).
Regarding claim 10, Shysheya discloses the method of claim 8, wherein each frame of the video stream comprises an image with color channels (Paragraph [0047]: the network is used to produce an RGB image (a three-channel stack) Ii and a single channel mask Mi defining the pixels that are covered by the avatar. At training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames).
Regarding claim 11, Shysheya discloses the method of claim 8, wherein each determined keyframe is associated with a distinctive viewing direction (Fig. 3; Paragraph [0072]: a (full-body) avatar is defined as a system that is capable of rendering views of a certain person under varying human pose defined by a set of 3D positions of the body joints and varying camera positions (FIG. 3). FIG. 3 shows textured neural avatar results (without video-to-video post-processing) for different viewpoints during training. Reference numbers 1 to 6 denotes different viewpoints of the camera and images from viewpoints 1 to 6. In lower row of pictures on FIG. 3, the images on the left are obtained by processing the pose input shown on the right. Body joint positions are taken rather than joint angles as an input, since such positions are easier to estimate from data using marker-based or marker-less motion capture systems. A classical (“neural-free”) avatar based on a standard computer graphics pipeline is to take a user-personalized body mesh in a neutral position, estimate the joint angles from the joint positions, perform skinning (deformation of the neutral pose) thus estimating the 3D geometry of the body. After that texture mapping is applied using precomputed 2D texture. Finally, the resulting textured model is lit using a certain lighting model and then projected onto the camera view. Creating a person's avatar in the classical pipeline thus requires personalizing the skinning process responsible for the geometry and the texture that is responsible for appearance).
Regarding claim 16, Shysheya discloses the method of claim 8, wherein the losses comprise a red, green, and blue (RGB) loss, a feature loss, an adversarial loss, or a mask loss (Paragraph [0052]: During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are back-propagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates).
Regarding claim 17, Shysheya discloses the method of claim 16, wherein the RGB loss is calculated based on a comparison between RGB channels of the generated training rendered image and RGB channels of the ground truth image of the first person in the frame (Paragraph [0047]: the network is used to produce an RGB image (a three-channel stack) Ii and a single channel mask Mi defining the pixels that are covered by the avatar. At training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames; Paragraph [0059]: FIG. 2 is the overview of the textured neural avatar system. The input pose is defined as a stack of “bone” rasterizations (one bone per channel; here we show it highlighted in red). The input is processed by the fully-convolutional network (orange) to produce body part assignment map stack and the body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are back-propagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates).
Regarding claim 18, Shysheya discloses the method of claim 16, wherein the feature loss is calculated based on a comparison between latent features extracted from the generated training rendered image and latent features extracted from the ground truth image of the first person in the frame (Paragraph [0047]: the network is used to produce an RGB image (a three-channel stack) Ii and a single channel mask Mi defining the pixels that are covered by the avatar. At training time, it is assumed that for each input frame i, the input joint locations and the “ground truth” foreground mask are estimated, and 3D body pose estimation and human semantic segmentation are used to extract them from raw video frames; Paragraph [0059]: FIG. 2 is the overview of the textured neural avatar system. The input pose is defined as a stack of “bone” rasterizations (one bone per channel; here we show it highlighted in red). The input is processed by the fully-convolutional network (orange) to produce body part assignment map stack and the body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are back-propagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates).
Regarding claim 19, the limitations of this claim substantially correspond to the limitations of claim 1 (except for the one or more computer-readable non-transitory storage media, which are disclosed by Shysheya, Paragraph [0116]: implemented, at least in part, by software-controlled data processing apparatus, it will be appreciated that a non-transitory machine-readable medium carrying such software, such as an optical disk, a magnetic disk, semiconductor memory or the like, is also considered to represent an embodiment of the present disclosure); thus they are rejected on similar grounds.
Regarding claim 20, the limitations of this claim substantially correspond to the limitations of claim 1 (except for the one or more processors; and a non-transitory memory coupled to the processors, which are disclosed by Shysheya, Paragraphs [0115]-[0117]: operations described above may be performed by the system for synthesizing 2-D images of a person. The system for synthesizing 2-D images of a person comprises a processor and a memory. The memory stores instructions causing the processor to implement the method for synthesizing 2-D image of a person…implemented, at least in part, by software-controlled data processing apparatus, it will be appreciated that a non-transitory machine-readable medium carrying such software, such as an optical disk, a magnetic disk, semiconductor memory or the like, is also considered to represent an embodiment of the present disclosure); thus they are rejected on similar grounds.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW D SALVUCCI whose telephone number is (571)270-5748. The examiner can normally be reached M-F: 7:30-4:00PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, XIAO WU can be reached on (571) 272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW SALVUCCI/Primary Examiner, Art Unit 2613