DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Applicant's amendments filed on 16 March 2022 have been entered.  Claims 4, 6-8, 10-12, 21, and 22 have been amended.  Claim 14 has been canceled.  No claims have been added.  Claims 4-13, 15-22, 25, and 26 are still pending in this application, with claims 4, 10-12, 21, and 22 being independent.

Response to Arguments
Applicant's arguments filed 16 March 2022 have been fully considered but they are not persuasive. 
Applicant argues that “Zuffi teaches randomly adding noise to the brightness, hue, and saturation levels and randomly adding lighting. See Zuffi at Section 3.2 Zuffi is clear that images in the training set are a mix of real and synthetic elements and are therefore not technically equivalent to the claimed “images of the object captured in a natural environment.” While Zuffi does disclose that the generated images have differing camera positions, the random nature of the generation fails to teach or suggest that the images generated for the training set are included in a video. See Zuffi at Section 3.2: For each zebra model we generated random images that differ in background, shape, pose, camera, and appearance. Zuffi fails to teach training the neural network using a “video including images of the object captured in a natural environment,” as now required by amended claims”.
Examiner notes that the amended claims recite: “a video including images of the object captured in a natural environment from associated camera poses, wherein the object is a real We automatically extract 3D textured models of zebras from in-the-wild images” (emphasis added). Examiner asserts that the synthetic elements discussed by Applicant are extracted from and/or added to these in-the-wild images of animals. In other words, the captured in-the-wild images, or “images of the object captured in a natural environment” as in the claim, are captured as such and subsequent processing may be done to said images. Thus, Examiner maintains that Zuffi teaches these limitations as in amended independent claims 4, 10, and 11.
Applicant further argues that “Kato teaches identifying a best estimated shape/pose for each input image for each training iteration. Kato fails to teach or suggest differences between the best estimated shapes/poses. Kato also fails to teach or suggest differences between estimated shapes/poses. Instead, Karo is clear that the loss function is based on differences between an estimated shape/pose and the best shape/pose for each image. The claim limitations also require that the differences between the identity shapes are rotated. Kato fails to contemplate rotating any differences”.
Examiner points to the cited Fig. and sections of Kato, which, for example, disclose “error between estimated shapes/poses and the best shapes/poses that are recorded during training” (Section 3.2.3). Clearly this reads on both the claimed evaluating, and the poses read on the claimed rotations. Thus, Examiner maintains that Kato reads on this limitation.
With respect to claim 7, Applicant argues that: “Examiner asserts that Zuffi discloses images that are unlabeled. Applicants disagree with the Examiner’s assertion and contend that, in fact, Zuffi is clear that the generated images used for training are annotated with 2D keypoint labels”. Examiner notes that nowhere in Zuffi is a keypoint label mentioned and that the only mention of a label is hand labeling (i.e. after the processing). Thus, Examiner maintains that Zuffi teaches this limitation.
With respect to claim 10, Applicant argues that: “Kato is clear that a best pose, texture, and shape is recorded for each individual image. Kato fails to contemplate swapping any of the best pose, texture, and shape recorded for one image when reconstructing another image. Thus, Kato fails to teach or suggest any “swapping.” Zuffi fails to cure these deficiencies of Kato”. Examiner notes that no such “swapping” is claimed. Examiner thus maintains that the Zuffi reference indeed reads on these limitations.
Applicant makes the same argument for claim 11 as for claim 10, above, and Examiner maintains the rejection for the same reason.

Allowable Subject Matter
Claims 12, 21, and 22 are allowable over the prior art of record since the cited references taken individually or in combination fails to particularly disclose or suggest a method, system or computer-readable media, further comprising: applying a part pattern to the object in a middle image that is centered within a number of the images and propagating the part pattern from the middle image to images before and after the center image in the video to produce propagated part maps, as presented in the environment of the remaining limitations of claim 12 (and substantially similar limitations in each of independent claims 21 and 22).  It is noted that the closest prior art, Zuffi et al. (NPL: Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture from Images “In the Wild”), hereinafter Zuffi, shows receiving, by a neural network model, a video including images of the object captured from associated camera poses; predicting, by the neural network model, a 3D shape representation of the object for a first image of the images based on a set of learned shape bases; predicting, by the neural network model, a texture flow for the first image; mapping pixels from the first image to a texture space according to the texture flow to produce a texture image, wherein transfer of the texture image onto the 3D shape representation constructs a 3D object corresponding to the object in the first image; mapping the propagated part maps into the texture space according to corresponding texture flows predicted for the number of images to produce part maps in the texture space.  However, Zuffi fails to disclose or suggest applying a part pattern to the object in a 
Claims 13, 15-20, 25, and 26 each depend from one of the above independent claims and are accordingly allowable.
	
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4, 5, 7, and 9-11 are rejected under 35 U.S.C. 103 as being unpatentable over Zuffi et al. (NPL: Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture from Images “In the Wild”), hereinafter Zuffi, in view of Kato et al. (NPL: Self-supervised Learning of 3D Objects from Natural Images), hereinafter Kato.
Regarding claim 4, Zuffi discloses a computer-implemented method of training a neural network model to construct a three-dimensional (3D) representation of an object, comprising: receiving, by the neural network model, a video including images of the object captured in a natural environment from associated camera poses, wherein the object is a real animal (Fig. 1; Abstract: perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild…integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape); predicting, by the neural network model, a 3D shape representation of the object for a first image of the images based on a set of learned shape bases (Fig. 4; Section 3: estimating the 3D pose and shape of zebras from a single image as a model-based regression problem, where we train a neural network to predict 3D pose, shape and texture for the SMAL model; Section 3.1: let β be a row vector of shape variables, then vertices of a subject-specific shape in the reference T-pose are computed…this work we focus on an animal in the Equine family, thus the template vhorse is the shape that corresponds to the mean horse in the original SMAL model; Section 3.2: We increase the variability of the zebra shapes by adding noise to the shape variables (we use 20 shape variables). In addition to size variations due to depth, we add size variation by adding noise to the reference camera); predicting, by the neural network model, a texture flow for the first image (Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map); mapping pixels from the first image to a texture space according to the texture flow to produce a texture image, wherein transfer of the texture image onto the 3D shape representation constructs a 3D object corresponding to the object in the first image (Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements. In order to prevent the network from just learning average texture map colors, inspired by [13], we predict the flow between the image pixels and the texture map; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map).
	Zuffi does not explicitly disclose predicting non-rigid motion deformations for the images; applying the non-rigid motion deformations to identity shapes predicted for the images to produce 3D shape representations of the object; computing first differences between the 3D shape representations; evaluating a loss function based on rotated differences between the identity shapes and the first differences; and updating parameters of the neural network model based on the loss function to reduce discontinuities in the 3D shape representations.
	However, Kato teaches using neural networks to create 3D objects from images (Abstract; Section 1), further comprising predicting non-rigid motion deformations for the images (Fig. 1; Fig. 3; Section 3: we propose a two-stage training method that focuses on shapes first. Fig. 3 illustrates the overview of our proposed approach. In the first step, a category-specific 3D base shape is generated by maximizing the similarity between images in a dataset and images of the shape. We use randomly sampled viewpoints and strongly limited textures. In the second step, the whole model is trained limiting generated shapes to deformations of the obtained base shape; Section 3.2.1: Instead of predicting a mesh directly, we predict shape deformations using free-form deformation… use a 4 spatial grid of 4 × 4 × 4 vertices, and regress the difference between the original grid and a deformed grid using a neural network. In addition, we use another network to regress the relative height, width, and length of shapes. After deformation, the size of the predicted shape is scaled to fit a unit cube); applying the non-rigid motion deformations to identity shapes predicted for the images to produce 3D shape representations of the object (Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration. Specifically, we render images using an estimated shape, a recorded best shape, a slightly perturbed the best shape, and random shapes. Then, we compute reconstruction loss to find the best one and record it; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses); computing first differences between the 3D shape representations (Fig. 1; Section 3: method trains single-view reconstruction of 3D shape, pose, texture, and background with self-supervision as shown in Fig. 1 while avoiding unrealistic solutions like those shown in Fig. 2. One difficulty is training all elements at the same time because neural networks easily fall in the easiest solution of copying an input image into pixel arrays (textures or backgrounds). Therefore, we propose a two-stage training method that focuses on shapes first. Fig. 3 illustrates the overview of our proposed approach; Section 3.2.3: Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising); evaluating a loss function based on rotated differences between the identity shapes and the first differences (Fig. 4; Section 3.1: we propose a model that generates a shape, texture, and background from random noise by minimizing the difference between the set of rendered images and the set of images in a dataset. In the following sections, we explain each component along with the additional constraints and regularization needed to obtain a meaningful shape; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses; Section 3.2.3: addition to the components described above, we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse); and updating  variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration; Section 3.2.3: a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section A.1.5: the number of training iterations of the base shape learning is set to 1200, 300, 200, 800, and 200 for CIFAR-10 car, CIFAR-10 horse, PASCAL aeroplane, PASCAL car, and PASCAL chair respectively. In full model training, the number of iterations is set to 10000 in all categories). Kato teaches that this will allow for reconstruction of the 3D shape, pose, and texture of objects from categorized natural images in a self-supervised manner (Abstract) and is in the same field of endeavor as Zuffi of using neural networks to create 3D objects from images (Abstract). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zuffi with the features of predicting non-rigid motion deformations for the images; applying the non-rigid motion deformations to identity shapes predicted for the images to produce 3D shape representations of the object; computing first differences between the 3D shape representations; evaluating a loss function based on rotated differences between the identity shapes and the first differences; and updating parameters of the neural network model based on the loss function to reduce discontinuities in 
Regarding claim 5, Zuffi, in view of Kato teaches the computer-implemented method of claim 4, Kato discloses wherein each identity shape of the identity shapes is computed as a sum of component shapes included in the set of learned shape bases and each component shape is corresponding scaled by a coefficient generated by the neural network model (Section 3.1.7: to obtain a category-specific base shape, a shape generator, a texture generator, and a background generator are trained by minimizing the sum of reconstruction loss Lrec and smoothing loss Ls under the constraints of shape symmetricity, and texture and background simplicity. Though the input is random noise, generated shapes converge to a single shape after training).
Regarding claim 7, Zuffi, in view of Kato teaches the computer-implemented method of claim 4, Zuffi discloses wherein a portion of the images in the video are unlabeled (Abstract: integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: We scan and model the body, label images by hand, and build motion capture systems of all kinds. This level of investment is not possible for every animal species…Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape
Regarding claim 9, Zuffi, in view of Kato teaches the computer-implemented method of claim 4, Zuffi discloses further comprising: projecting the 3D object according to the camera pose to produce a rendered image (Fig. 4; Section 3.2: For each zebra model we generated random images that differ in background, shape, pose, camera, and appearance. Models are rendered with OpenDR…For each zebra model we generated 1000 images with different poses obtained by sampling a multivariate Gaussian distribution over the 3D Rodrigues 5362 vectors that describe pose. The sampling distribution is learned from the 57 poses obtained with SMALR and a synthetic walking sequence. We also add, for each zebra model, about 285 images obtained by adding noise to the 57 poses…Changing depth varies the size of the animal as we use perspective projection; Section 3.4: we define ground truth vertex displacements for network training…use the Neural Mesh Renderer (NMR) [14] for rendering the model and perspective projection); and updating parameters of the neural network model to reduce differences between the rendered image and the first image (Section 3.5: train the network to minimize the loss…where: Sgt is the mask, Lmask is the mask loss, defined as the L1 loss between Sgt and the predicted mas…the 2D keypoint loss, defined as the MSE loss between…and the projected 3D keypoints defined on the model vertices. Lcam is the camera loss, defined as the MSE loss between fgt and predicted focal length. Limg is the image loss, computed as the perceptual distance [32] between the masked input image and rendered zebra. Lpose is the MSE loss between θgt and predicted 3D poses, computed as geodesic distance [19]. Ltrans is the translation loss, defined as the MSE between γgt and predicted translation. Lshape is the shape loss, defined as the MSE between dvgt and predicted dv).
Regarding claim 10, Zuffi discloses a computer-implemented method of training a neural network model to construct a three-dimensional (3D) representation of an object, comprising: receiving, by the neural network model, a video including images of the object captured in a natural environment from associated camera poses, wherein the object is a real animal (Fig. 1; Abstract: perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild…integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape); predicting, by the neural network model, a 3D shape representation of the object for a first image of the images based on a set of learned shape bases (Fig. 4; Section 3: estimating the 3D pose and shape of zebras from a single image as a model-based regression problem, where we train a neural network to predict 3D pose, shape and texture for the SMAL model; Section 3.1: let β be a row vector of shape variables, then vertices of a subject-specific shape in the reference T-pose are computed…this work we focus on an animal in the Equine family, thus the template vhorse is the shape that corresponds to the mean horse in the original SMAL model; Section 3.2: We increase the variability of the zebra shapes by adding noise to the shape variables (we use 20 shape variables). In addition to size variations due to depth, we add size variation by adding noise to the reference camera); predicting, by the neural network model, a texture flow for the first image (Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map); mapping pixels from the first image to a texture space according to the texture flow to produce a texture image, wherein transfer of the texture image onto the 3D shape representation constructs a 3D object corresponding to the object in the first image (Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements. In order to prevent the network from just learning average texture map colors, inspired by [13], we predict the flow between the image pixels and the texture map; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map); transferring the texture image predicted for the first image to a second 3D shape representation predicted for a second image of the images to produce a first 3D object (Fig. 4; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map. For each image we save the following annotation data: texture map Tgt, texture uv-flow uvgt, silhouette Sgt, pose θgt, global translation γgt, shape variables βgt, vertex displacements …use a total of 28 surface landmarks, placed at the joints, on the face, ears and tail tip. These are defined only once on the 3D model template); projecting the first 3D object according to a first camera pose associated with the first image to produce a first projected 3D object (Fig. 4; Section 3.2: For each zebra model we generated random images that differ in background, shape, pose, camera, and appearance. Models are rendered with OpenDR…For each zebra model we generated 1000 images with different poses obtained by sampling a multivariate Gaussian distribution over the 3D Rodrigues 5362 vectors that describe pose. The sampling distribution is learned from the 57 poses obtained with SMALR and a synthetic walking sequence. We also add, for each zebra model, about 285 images obtained by adding noise to the 57 poses…Changing depth varies the size of the animal as we use perspective projection; Section 3.4: we define ground truth vertex displacements for network training…use the Neural Mesh Renderer (NMR) [14] for rendering the model and perspective projection).
	Zuffi does not explicitly disclose transferring a second texture image predicted for the second image to the 3D shape representation predicted for the first image to produce a second 3D object; projecting the second 3D object according to a second camera pose associated with the second image to produce a second projected 3D object; and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object.
	However, Kato teaches using neural networks to create 3D objects from images (Abstract; Section 1), further comprising transferring a second texture image predicted for the second image to the 3D shape representation predicted for the first image to produce a second 3D object (Fig. 2; Fig. 4; Section 1: We train this model by comparing input images with reconstructed images. Given an image, 3D shape, pose, texture image, and background image are estimated by neural networks. Then, an image is rendered using these estimated elements; Section 3.2.3: we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse); projecting the second 3D object according to a second camera pose associated with the second image to produce a second projected 3D object (Fig. 1; Fig. 6; Section 4.1.2: we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images; Section 4.1.2: Secondly, we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images); and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object (Fig. 4; Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration; Section 3.2.3: a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section A.1.5: the number of training iterations of the base shape learning is set to 1200, 300, 200, 800, and 200 for CIFAR-10 car, CIFAR-10 horse, PASCAL aeroplane, PASCAL car, and PASCAL chair respectively. In full model training, the number of iterations is set to 10000 in all categories; Section 3.1.7: to obtain a category-specific base shape, a shape generator, a texture generator, and a background generator are trained by minimizing the sum of reconstruction loss Lrec and smoothing loss Ls under the constraints of shape symmetricity, and texture and background simplicity. Though the input is random noise, generated shapes converge to a single shape after training, which is similar to mode collapse in GANs
Regarding claim 11, Zuffi discloses a computer-implemented method of training a neural network model to construct a three-dimensional (3D) representation of an object, comprising: receiving, by the neural network model, a video including images of the object captured in a natural environment from associated camera poses, wherein the object is a real animal (Fig. 1; Abstract: perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild…integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape); bases (Fig. 4; Section 3: estimating the 3D pose and shape of zebras from a single image as a model-based regression problem, where we train a neural network to predict 3D pose, shape and texture for the SMAL model; Section 3.1: let β be a row vector of shape variables, then vertices of a subject-specific shape in the reference T-pose are computed…this work we focus on an animal in the Equine family, thus the template vhorse is the shape that corresponds to the mean horse in the original SMAL model; Section 3.2: We increase the variability of the zebra shapes by adding noise to the shape variables (we use 20 shape variables). In addition to size variations due to depth, we add size variation by adding noise to the reference camera); predicting, by the neural network model, a texture flow for the first image (Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map); mapping pixels from the first image to a texture space according to the texture flow to produce a texture image, wherein transfer of the texture image onto the 3D shape representation constructs a 3D object corresponding to the object in the first image (Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements. In order to prevent the network from just learning average texture map colors, inspired by [13], we predict the flow between the image pixels and the texture map; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map) projecting the first 3D shape representation according to a first camera pose associated with the first image to produce a first projected 3D object (Fig. 4; Section 3.2: For each zebra model we generated random images that differ in background, shape, pose, camera, and appearance. Models are rendered with OpenDR…For each zebra model we generated 1000 images with different poses obtained by sampling a multivariate Gaussian distribution over the 3D Rodrigues 5362 vectors that describe pose. The sampling distribution is learned from the 57 poses obtained with SMALR and a synthetic walking sequence. We also add, for each zebra model, about 285 images obtained by adding noise to the 57 poses…Changing depth varies the size of the animal as we use perspective projection; Section 3.4: we define ground truth vertex displacements for network training…use the Neural Mesh Renderer (NMR) [14] for rendering the model and perspective projection).
	Zuffi does not explicitly disclose applying first non-rigid motion deformations predicted for the first image to a first identity shape predicted for a second image of the images to produce a first 3D shape representation; applying second non-rigid motion deformations predicted for the second image to a second identity shape predicted for the first image to produce a second 3D shape representation; projecting the second 3D shape representation according to a second camera pose associated with the second image to produce a second projected 3D object; and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object.
	However, Kato teaches using neural networks to create 3D objects from images (Abstract; Section 1), further comprising applying first non-rigid motion deformations predicted for the first image to a first identity shape predicted for a second image of the images to produce a first 3D shape representation (Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration. Specifically, we render images using an estimated shape, a recorded best shape, a slightly perturbed the best shape, and random shapes. Then, we compute reconstruction loss to find the best one and record it; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses); applying second non-rigid motion deformations predicted for the second image to a second identity shape predicted for the first image to produce a second 3D shape representation (Fig. 2; Fig. 4; Section 1: We train this model by comparing input images with reconstructed images. Given an image, 3D shape, pose, texture image, and background image are estimated by neural networks. Then, an image is rendered using these estimated elements; Section 3.2.3: we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse); projecting the second 3D shape representation according to a second camera pose associated with the second image to produce a second projected 3D object (Fig. 1; Fig. 6; Section 4.1.2: we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images; Section 4.1.2: Secondly, we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images); and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object (Fig. 4; Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration; Section 3.2.3: a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section A.1.5: the number of training iterations of the base shape learning is set to 1200, 300, 200, 800, and 200 for CIFAR-10 car, CIFAR-10 horse, PASCAL aeroplane, PASCAL car, and PASCAL chair respectively. In full model training, the number of iterations is set to 10000 in all categories; Section 3.1.7: to obtain a category-specific base shape, a shape generator, a texture generator, and a background generator are trained by minimizing the sum of reconstruction loss Lrec and smoothing loss Ls under the constraints of shape symmetricity, and texture and background simplicity. Though the input is random noise, generated shapes converge to a single shape after training, which is similar to mode collapse in GANs). Kato teaches that this will allow for reconstruction of the 3D shape, pose, and texture of objects from categorized natural images in a self-supervised manner (Abstract) and is in the same field of endeavor as Zuffi of using neural networks to create 3D objects from images (Abstract). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zuffi with the features of applying first non-rigid motion deformations predicted for the first image to a first identity shape predicted for a second image of the images to produce a first 3D shape representation; applying second non-rigid motion deformations predicted for the second image to a second identity shape predicted for the first image to produce a second 3D shape representation; projecting the second 3D shape representation according to a second camera pose associated with the second image to produce a second projected 3D object; and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object as taught by Kato so as to allow for self-supervised reconstruction as presented by Kato.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW D SALVUCCI whose telephone number is (571)270-5748. The examiner can normally be reached M-F: 7:30-4:00PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, XIAO WU can be reached on (571) 272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW SALVUCCI/Primary Examiner, Art Unit 2613                                                                                                                                                                                                        ffconclu