DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 19 January 2022 has been entered.
 
Response to Arguments
Examiner notes that a large number of NPL documents were submitted after the indication of allowable subject matter in the IDS forms filed 31 December 2021. As a result, the allowability is no longer suitable in light of the new rejections based on these documents.

Allowable Subject Matter
Claim 14 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4-13, 15-22, 25, and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Zuffi et al. (NPL: Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture from Images “In the Wild”), hereinafter Zuffi, in view of Kato et al. (NPL: Self-supervised Learning of 3D Objects from Natural Images), hereinafter Kato.
Regarding claim 4, Zuffi discloses a computer-implemented method of constructing a three-dimensional (3D) representation of an object, comprising: receiving, by a neural network model, a video including images of the object captured from a camera pose (Abstract: integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape); predicting, by the neural network model, a 3D shape representation of the object for a first image of the images based on a set of learned shape bases (Fig. 4; Section 3: estimating the 3D pose and shape of zebras from a single image as a model-based regression problem, where we train a neural network to predict 3D pose, shape and texture for the SMAL model; Section 3.1: let β be a row vector of shape variables, then vertices of a subject-specific shape in the reference T-pose are computed…this work we focus on an animal in the Equine family, thus the template vhorse is the shape that corresponds to the mean horse in the original SMAL model; Section 3.2: We increase the variability of the zebra shapes by adding noise to the shape variables (we use 20 shape variables). In addition to size variations due to depth, we add size variation by adding noise to the reference camera); predicting, by the neural network model, a texture flow for the first image (Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map); mapping pixels from the first image to a texture space according to the texture flow to produce a texture image, wherein transfer of the texture image onto the 3D shape representation constructs a 3D object corresponding to the object in the first image (Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements. In order to prevent the network from just learning average texture map colors, inspired by [13], we predict the flow between the image pixels and the texture map; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map).
	Zuffi does not explicitly disclose predicting non-rigid motion deformations for the images; applying the non-rigid motion deformations to identity shapes predicted for the images to produce 3D shape representations of the object; evaluating a loss function based on rotated differences between the identity shapes and differences between the 3D shape representations; and updating parameters of the neural network model based on the loss function to reduce discontinuities in the 3D shape representations.
	However, Kato teaches using neural networks to create 3D objects from images (Abstract; Section 1), further comprising predicting non-rigid motion deformations for the images (Fig. 3; Section 3: we propose a two-stage training method that focuses on shapes first. Fig. 3 illustrates the overview of our proposed approach. In the first step, a category-specific 3D base shape is generated by maximizing the similarity between images in a dataset and images of the shape. We use randomly sampled viewpoints and strongly limited textures. In the second step, the whole model is trained limiting generated shapes to deformations of the obtained base shape; Section 3.2.1: Instead of predicting a mesh directly, we predict shape deformations using free-form deformation… use a 4 spatial grid of 4 × 4 × 4 vertices, and regress the difference between the original grid and a deformed grid using a neural network. In addition, we use another network to regress the relative height, width, and length of shapes. After deformation, the size of the predicted shape is scaled to fit a unit cube); applying the non-rigid motion deformations to identity shapes predicted for the images to produce 3D shape representations of the object (Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration. Specifically, we render images using an estimated shape, a recorded best shape, a slightly perturbed the best shape, and random shapes. Then, we compute reconstruction loss to find the best one and record it; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses); evaluating a loss function based on rotated differences between the identity shapes and differences between the 3D shape representations (Fig. 4; Section 3.1: we propose a model that generates a shape, texture, and background from random noise by minimizing the difference between the set of rendered images and the set of images in a dataset. In the following sections, we explain each component along with the additional constraints and regularization needed to obtain a meaningful shape; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses; Section 3.2.3: addition to the components described above, we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse); and updating parameters of the neural network model based on the loss function to reduce discontinuities in the 3D shape representations (Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration; Section 3.2.3: a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section A.1.5: the number of training iterations of the base shape learning is set to 1200, 300, 200, 800, and 200 for CIFAR-10 car, CIFAR-10 horse, PASCAL aeroplane, PASCAL car, and PASCAL chair respectively. In full model training, the number of iterations is set to 10000 in all categories). Kato teaches that this will allow for reconstruction of the 3D shape, pose, and texture of objects from categorized natural images in a self-supervised manner (Abstract) and is in the same field of endeavor as Zuffi of using neural networks to create 3D objects from images (Abstract). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified 
Regarding claim 5, Zuffi, in view of Kato teaches the computer-implemented method of claim 4, Kato discloses wherein each identity shape of the identity shapes is computed as a sum of component shapes included in the set of learned shape bases and each component shape is corresponding scaled by a coefficient generated by the neural network model (Section 3.1.7: to obtain a category-specific base shape, a shape generator, a texture generator, and a background generator are trained by minimizing the sum of reconstruction loss Lrec and smoothing loss Ls under the constraints of shape symmetricity, and texture and background simplicity. Though the input is random noise, generated shapes converge to a single shape after training).
Regarding claim 6, Zuffi, in view of Kato teaches the computer-implemented method of claim 4, Zuffi discloses wherein the 3D shape representation is a mesh of vertices that define faces (Section 3.1: let β be a row vector of shape variables, then vertices of a subject-specific shape in the reference T-pose are computed…this work we focus on an animal in the Equine family, thus the template vhorse is the shape that corresponds to the mean horse in the original SMAL model).
Regarding claim 7, Zuffi, in view of Kato teaches the computer-implemented method of claim 4, Zuffi discloses wherein the images in the video are unlabeled (Abstract: integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: We scan and model the body, label images by hand, and build motion capture systems of all kinds. This level of investment is not possible for every animal species…Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape).
Regarding claim 8, Zuffi, in view of Kato teaches the computer-implemented method of claim 4, Zuffi discloses wherein the neural network model is further configured to predict the camera pose (Section 3.4: camera prediction layer predicts the focal length of the perspective camera. Since we also predict the depth in the network, this parameter can be redundant; however we have found empirically that it allows better model fits to images).
Regarding claim 9, Zuffi, in view of Kato teaches the computer-implemented method of claim 4, Zuffi discloses further comprising: projecting the 3D object according to the camera pose to produce a rendered image (Fig. 4; Section 3.2: For each zebra model we generated random images that differ in background, shape, pose, camera, and appearance. Models are rendered with OpenDR…For each zebra model we generated 1000 images with different poses obtained by sampling a multivariate Gaussian distribution over the 3D Rodrigues 5362 vectors that describe pose. The sampling distribution is learned from the 57 poses obtained with SMALR and a synthetic walking sequence. We also add, for each zebra model, about 285 images obtained by adding noise to the 57 poses…Changing depth varies the size of the animal as we use perspective projection; Section 3.4: we define ground truth vertex displacements for network training…use the Neural Mesh Renderer (NMR) [14] for rendering the model and perspective projection); and updating parameters of the neural network model to reduce  train the network to minimize the loss…where: Sgt is the mask, Lmask is the mask loss, defined as the L1 loss between Sgt and the predicted mas…the 2D keypoint loss, defined as the MSE loss between…and the projected 3D keypoints defined on the model vertices. Lcam is the camera loss, defined as the MSE loss between fgt and predicted focal length. Limg is the image loss, computed as the perceptual distance [32] between the masked input image and rendered zebra. Lpose is the MSE loss between θgt and predicted 3D poses, computed as geodesic distance [19]. Ltrans is the translation loss, defined as the MSE between γgt and predicted translation. Lshape is the shape loss, defined as the MSE between dvgt and predicted dv).
Regarding claim 10, Zuffi discloses a computer-implemented method of constructing a three-dimensional (3D) representation of an object, comprising: receiving, by a neural network model, a video including images of the object captured from a camera pose (Abstract: integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape); predicting, by the neural network model, a 3D shape representation of the object for a first image of the images based on a set of learned shape bases (Fig. 4; Section 3: estimating the 3D pose and shape of zebras from a single image as a model-based regression problem, where we train a neural network to predict 3D pose, shape and texture for the SMAL model; Section 3.1: let β be a row vector of shape variables, then vertices of a subject-specific shape in the reference T-pose are computed…this work we focus on an animal in the Equine family, thus the template vhorse is the shape that corresponds to the mean horse in the original SMAL model; Section 3.2: We increase the variability of the zebra shapes by adding noise to the shape variables (we use 20 shape variables). In addition to size variations due to depth, we add size variation by adding noise to the reference camera); predicting, by the neural network model, a texture flow for the first image (Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map); mapping pixels from the first image to a texture space according to the texture flow to produce a texture image, wherein transfer of the texture image onto the 3D shape representation constructs a 3D object corresponding to the object in the first image (Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements. In order to prevent the network from just learning average texture map colors, inspired by [13], we predict the flow between the image pixels and the texture map; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map); transferring the texture image predicted for the first image to a second 3D shape representation predicted for a second image of the images to produce a first 3D object (Fig. 4; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map. For each image we save the following annotation data: texture map Tgt, texture uv-flow uvgt, silhouette Sgt, pose θgt, global translation γgt, shape variables βgt, vertex displacements …use a total of 28 surface landmarks, placed at the joints, on the face, ears and tail tip. These are defined only once on the 3D model template); projecting the first 3D object according to a first camera pose associated with the first image to produce a first projected 3D object (Fig. 4; Section 3.2: For each zebra model we generated random images that differ in background, shape, pose, camera, and appearance. Models are rendered with OpenDR…For each zebra model we generated 1000 images with different poses obtained by sampling a multivariate Gaussian distribution over the 3D Rodrigues 5362 vectors that describe pose. The sampling distribution is learned from the 57 poses obtained with SMALR and a synthetic walking sequence. We also add, for each zebra model, about 285 images obtained by adding noise to the 57 poses…Changing depth varies the size of the animal as we use perspective projection; Section 3.4: we define ground truth vertex displacements for network training…use the Neural Mesh Renderer (NMR) [14] for rendering the model and perspective projection).
	Zuffi does not explicitly disclose transferring a second texture image predicted for the second image to the 3D shape representation predicted for the first image to produce a second 3D object; projecting the second 3D object according to a second camera pose associated with the second image to produce a second projected 3D object; and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object.
 We train this model by comparing input images with reconstructed images. Given an image, 3D shape, pose, texture image, and background image are estimated by neural networks. Then, an image is rendered using these estimated elements; Section 3.2.3: we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse); projecting the second 3D object according to a second camera pose associated with the second image to produce a second projected 3D object (Fig. 1; Fig. 6; Section 4.1.2: we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images; Section 4.1.2: Secondly, we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images); and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object (Fig. 4; Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration; Section 3.2.3: a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section A.1.5: the number of training iterations of the base shape learning is set to 1200, 300, 200, 800, and 200 for CIFAR-10 car, CIFAR-10 horse, PASCAL aeroplane, PASCAL car, and PASCAL chair respectively. In full model training, the number of iterations is set to 10000 in all categories; Section 3.1.7: to obtain a category-specific base shape, a shape generator, a texture generator, and a background generator are trained by minimizing the sum of reconstruction loss Lrec and smoothing loss Ls under the constraints of shape symmetricity, and texture and background simplicity. Though the input is random noise, generated shapes converge to a single shape after training, which is similar to mode collapse in GANs). Kato teaches that this will allow for reconstruction of the 3D shape, pose, and texture of 
Regarding claim 11, Zuffi discloses a computer-implemented method of constructing a three- dimensional (3D) representation of an object, comprising: receiving, by a neural network model, a video including images of the object captured from a camera pose (Abstract: integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape); bases (Fig. 4; Section 3: estimating the 3D pose and shape of zebras from a single image as a model-based regression problem, where we train a neural network to predict 3D pose, shape and texture for the SMAL model; Section 3.1: let β be a row vector of shape variables, then vertices of a subject-specific shape in the reference T-pose are computed…this work we focus on an animal in the Equine family, thus the template vhorse is the shape that corresponds to the mean horse in the original SMAL model; Section 3.2: We increase the variability of the zebra shapes by adding noise to the shape variables (we use 20 shape variables). In addition to size variations due to depth, we add size variation by adding noise to the reference camera); predicting, by the neural network model, a texture flow for the first image (Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map); mapping pixels from the first image to a texture space according to the texture flow to produce a texture image, wherein transfer of the texture image onto the 3D shape representation constructs a 3D object corresponding to the object in the first image (Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements. In order to prevent the network from just learning average texture map colors, inspired by [13], we predict the flow between the image pixels and the texture map; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map) projecting the first 3D shape representation according to a first camera pose associated with the first image to produce a first projected 3D object (Fig. 4; Section 3.2: For each zebra model we generated random images that differ in background, shape, pose, camera, and appearance. Models are rendered with OpenDR…For each zebra model we generated 1000 images with different poses obtained by sampling a multivariate Gaussian distribution over the 3D Rodrigues 5362 vectors that describe pose. The sampling distribution is learned from the 57 poses obtained with SMALR and a synthetic walking sequence. We also add, for each zebra model, about 285 images obtained by adding noise to the 57 poses…Changing depth varies the size of the animal as we use perspective projection; Section 3.4: we define ground truth vertex displacements for network training…use the Neural Mesh Renderer (NMR) [14] for rendering the model and perspective projection).
	Zuffi does not explicitly disclose applying first non-rigid motion deformations predicted for the first image to a first identity shape predicted for a second image of the images to produce a first 3D shape representation; applying second non-rigid motion deformations predicted for the second image to a second identity shape predicted for the first image to produce a second 3D shape representation; projecting the second 3D shape representation according to a second camera pose associated with the second image to produce a second projected 3D object; and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object.
	However, Kato teaches using neural networks to create 3D objects from images (Abstract; Section 1), further comprising applying first non-rigid motion deformations predicted for the first image to a first identity shape predicted for a second image of the images to produce a first 3D shape representation (Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration. Specifically, we render images using an estimated shape, a recorded best shape, a slightly perturbed the best shape, and random shapes. Then, we compute reconstruction loss to find the best one and record it; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses); applying second non-rigid motion deformations predicted for the second image to a second identity shape predicted for the first image to produce a second 3D shape representation (Fig. 2; Fig. 4; Section 1: We train this model by comparing input images with reconstructed images. Given an image, 3D shape, pose, texture image, and background image are estimated by neural networks. Then, an image is rendered using these estimated elements; Section 3.2.3: we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse); projecting the second 3D shape representation according to a second camera pose associated with the second image to produce a second projected 3D object (Fig. we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images; Section 4.1.2: Secondly, we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images); and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object (Fig. 4; Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration; Section 3.2.3: a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section A.1.5: the number of training iterations of the base shape learning is set to 1200, 300, 200, 800, and 200 for CIFAR-10 car, CIFAR-10 horse, PASCAL aeroplane, PASCAL car, and PASCAL chair respectively. In full model training, the number of iterations is set to 10000 in all categories; Section 3.1.7: to obtain a category-specific base shape, a shape generator, a texture generator, and a background generator are trained by minimizing the sum of reconstruction loss Lrec and smoothing loss Ls under the constraints of shape symmetricity, and texture and background simplicity. Though the input is random noise, generated shapes converge to a single shape after training, which is similar to mode collapse in GANs). Kato teaches that this will allow for reconstruction of the 3D shape, pose, and texture of objects from categorized natural images in a self-supervised manner (Abstract) and is in the same field of endeavor as Zuffi of using neural networks to create 3D objects from images (Abstract). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zuffi with the features of applying first non-rigid motion deformations predicted for the first image to a first identity shape predicted for a second image of the images to produce a first 3D shape representation; applying second non-rigid motion deformations predicted for the second image to a second identity shape predicted for the first image to produce a second 3D shape representation; projecting the second 3D shape representation according to a second camera pose associated with the second image to produce a second projected 3D object; and updating parameters of the neural network model to encourage consistency between the first projected 3D object and the second projected 3D object as taught by Kato so as to allow for self-supervised reconstruction as presented by Kato.
Regarding claim 12, Zuffi discloses a computer-implemented method of constructing a three- dimensional (3D) representation of an object, comprising: receiving, by a neural network model, a video including images of the object captured from a camera pose (Abstract: integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation; Section 1: Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements; Section 2: Tung et al. [28] exploit temporal consistency over video frames to train an end-to-end prediction model on images without ground truth 3D pose and shape); bases (Fig. 4; Section 3: estimating the 3D pose and shape of zebras from a single image as a model-based regression problem, where we train a neural network to predict 3D pose, shape and texture for the SMAL model; Section 3.1: let β be a row vector of shape variables, then vertices of a subject-specific shape in the reference T-pose are computed…this work we focus on an animal in the Equine family, thus the template vhorse is the shape that corresponds to the mean horse in the original SMAL model; Section 3.2: We increase the variability of the zebra shapes by adding noise to the shape variables (we use 20 shape variables). In addition to size variations due to depth, we add size variation by adding noise to the reference camera); predicting, by the neural network model, a texture flow for the first image (Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map); mapping pixels from the first image to a texture space  Using a fully generative model of animal shape, appearance, and neural rendering, we use a photometric loss to train a neural network that predicts the 3D pose, shape, and texture map of an animal from a single image. A key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements. In order to prevent the network from just learning average texture map colors, inspired by [13], we predict the flow between the image pixels and the texture map; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map); propagating a part map across the object in a number of the images to produce propagated part maps (Fig. 4; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map); mapping the propagated part maps into the texture space according to corresponding texture flows predicted for the number of images to produce part maps in the texture space (Fig. 4; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map; Section 3.4: Texture prediction. The texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map. We cut the texture map (that has size (256, 256)) into 4 regions illustrated in Figure 5. For each sub-image we define an encoder and decoder. Each encoder outputs a (256, H, W) feature map, where H and W are a reduction of 32 of the size of the sub-image, and is composed of 2 fully connected layers. The decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map).
	Zuffi does not explicitly disclose aggregating the part maps to produce a video-level part map.
	However, Kato teaches using neural networks to create 3D objects from images (Abstract; Section 1), further comprising aggregating the part maps to produce a video-level part map (Fig. 4; Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration; Section 3.2.3: a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section 4.1: mainly tested our method on the CIFAR-10 [19] dataset because it is composed of natural images and contains thousands of images per object category. Among ten object categories, we focused on car and horse classes because car is an artificial and rigid object and one of the most commonly used categories on the synthetic ShapeNet dataset [2] and horse is a deformable natural object not contained in ShapeNet. For feature extraction, we trained WRN-16-4 [48] on the CIFAR-10 training set. We used three layers right before sub-sampling as feature maps; Section 4.1.2: Secondly, we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images; Section A.1.5: the number of training iterations of the base shape learning is set to 1200, 300, 200, 800, and 200 for CIFAR-10 car, CIFAR-10 horse, PASCAL aeroplane, PASCAL car, and PASCAL chair respectively. In full model training, the number of iterations is set to 10000 in all categories; Section 3.1.7: to obtain a category-specific base shape, a shape generator, a texture generator, and a background generator are trained by minimizing the sum of reconstruction loss Lrec and smoothing loss Ls under the constraints of shape symmetricity, and texture and background simplicity. Though the input is random noise, generated shapes converge to a single shape after training, which is similar to mode collapse in GANs). Kato teaches that this will allow for reconstruction of the 3D shape, pose, and texture of objects from categorized natural images in a self-supervised manner (Abstract) and is in the same field of endeavor as Zuffi of using neural networks to create 3D objects from images (Abstract). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zuffi with the features of aggregating the part maps to produce a video-level part map as taught by Kato so as to allow for self-supervised reconstruction as presented by Kato.
Regarding claim 13, Zuffi, in view of Kato teaches the computer-implemented method of claim 12, Kato discloses further comprising: rendering 3D shape representations predicted for the number of the images according to associated camera poses, wherein the video-level part map is transferred onto each one of the 3D shape representations to produce rendered images (Fig. 4; Section 3.1: we propose a model that generates a shape, texture, and background from random noise by minimizing the difference between the set of rendered images and the set of images in a dataset. In the following sections, we explain each component along with the additional constraints and regularization needed to obtain a meaningful shape; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses; Section 3.2.3: addition to the components described above, we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section 4.1.2: we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images; Section 4.1.2: Secondly, we evaluate the second step using the base shapes obtained in the previous step. Fig. 6 shows representative results on the test set. Reconstructed images demonstrate that the estimators trained by our method are able to reconstruct images that look similar to input images (a–b). Estimated shapes, poses, and backgrounds can be further improved by simple gradient descent and photometric reconstruction loss (c). Rendered images from other viewpoints show that these objects have correct 3D shapes, which are slightly different among different input images); and updating parameters of the neural network model to encourage consistency between the rendered images and the propagated part maps (Fig. 4; Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration; Section 3.2.3: a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse; Section A.1.5: the number of training iterations of the base shape learning is set to 1200, 300, 200, 800, and 200 for CIFAR-10 car, CIFAR-10 horse, PASCAL aeroplane, PASCAL car, and PASCAL chair respectively. In full model training, the number of iterations is set to 10000 in all categories; Section 3.1.7: to obtain a category-specific base shape, a shape generator, a texture generator, and a background generator are trained by minimizing the sum of reconstruction loss Lrec and smoothing loss Ls under the constraints of shape symmetricity, and texture and background simplicity. Though the input is random noise, generated shapes converge to a single shape after training, which is similar to mode collapse in GANs).
Regarding claim 15, Zuffi, in view of Kato teaches the computer-implemented method of claim 12, Zuffi discloses wherein the images are each annotated and further comprising: mapping the annotations into the texture space according to corresponding texture flows predicted for the number of images to produce annotation maps in the texture space (Fig. 4; Section 3.2: For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map; Section 3.4: texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map; Section 3.4: Texture prediction. The texture prediction module is inspired by the work of Kanazawa et al. [13]. While [13] explores texture regression on a simple texture map that corresponds to a sphere, quadrupeds, like zebras, have a more complicated surface and texture map layout. We therefore predict the texture map as a collection of 4 sub-images that we then stitch together. We found this to work better than directly predicting the full texture map, probably because, given the complexity of the articulated model, the network has difficulty with the spatial discontinuities in the texture map. We cut the texture map (that has size (256, 256)) into 4 regions illustrated in Figure 5. For each sub-image we define an encoder and decoder. Each encoder outputs a (256, H, W) feature map, where H and W are a reduction of 32 of the size of the sub-image, and is composed of 2 fully connected layers. The decoders are composed of a set of convolutional layers and a final tanh module. The output of the decoders is stitched to create a full uv-flow map, that encodes which image pixels correspond to pixels in the texture map); and aggregating the annotation maps to produce a canonical annotation map for the video (Section 3.3: selected a set of 48 images of zebras that were not used to create the digital dataset, and we annotated them for 2D keypoints. We used this as validation set. We then selected 100 images as our test set, also avoiding zebras from the two sets above. For evaluation, we manually generated the segmentation mask for this set of images, and annotated the 2D keypoints. We mirror the images in order to double the test set data).
Regarding claim 16, Zuffi, in view of Kato teaches the computer-implemented method of claim 15, Zuffi discloses further comprising: transferring the canonical annotation map to 3D shape representations predicted for the images to produce annotated 3D shape representations  For each image we also compute the texture uv-flow that represents the mapping between image pixels and textels, and can be interpreted as the flow between the image and the texture map. For each image we save the following annotation data: texture map Tgt, texture uv-flow uvgt, silhouette Sgt, pose θgt, global translation γgt, shape variables βgt, vertex displacements …use a total of 28 surface landmarks, placed at the joints, on the face, ears and tail tip. These are defined only once on the 3D model template); projecting the annotated 3D shape representations according to the associated camera poses, to produce projected annotations for the images (Section 3.5: train the network to minimize the loss…where: Sgt is the mask, Lmask is the mask loss, defined as the L1 loss between Sgt and the predicted mas…the 2D keypoint loss, defined as the MSE loss between…and the projected 3D keypoints defined on the model vertices. Lcam is the camera loss, defined as the MSE loss between fgt and predicted focal length. Limg is the image loss, computed as the perceptual distance [32] between the masked input image and rendered zebra. Lpose is the MSE loss between θgt and predicted 3D poses, computed as geodesic distance [19]. Ltrans is the translation loss, defined as the MSE between γgt and predicted translation. Lshape is the shape loss, defined as the MSE between dvgt and predicted dv); and updating parameters of the neural network model to encourage consistency between the projected annotations and the annotations (Section 1: accurate synthetic human models exist for training, animals models of sufficient quality are rare, particularly for endangered species. A novelty of our approach is that instead of using completely synthetic data, we capture the texture of the animals from real images and render them with variability of background, pose, illumination and camera. This is obtained exploiting the recent SMALR method [33], which allows us to obtain accurate shape, pose, and texture of 10 animals by annotating only about 50 images. From this, adding variations to the subjects, we generate thousands of synthetic training images (Figure 3). We demonstrate that these are realistic enough for our method to learn to estimate body shape, pose and texture from image pixels without any fine-tuning on additional hand-labeled images; Section 2: Zuffi et al. [34] introduced the SMAL model, a 3D articulated shape model of animals, that can represent inter and intra species shape variations. They train the model from scans of toys, which may not exist for endangered species or may not be accurate. They go further in [33] to fit the model to multiple images, while allowing the shape to deform to fit to the individual shape of the animals. This allows them to capture shape outside of the SMAL shape space, increasing realism and generalization to unseen animal shapes. Unfortunately, the method is based on manually extracted silhouettes and keypoint annotations. More recently, Biggs et al. [5] fit the SMAL model to images automatically by training a joint detector on synthetically generated silhouettes. At inference time, their method requires accurate segmentation and is not robust to occlusion).
Regarding claim 17, Zuffi, in view of Kato teaches the computer-implemented method of claim 15, Zuffi discloses wherein the annotation is a semantic keypoint (Section 3.3: selected a set of 48 images of zebras that were not used to create the digital dataset, and we annotated them for 2D keypoints. We used this as validation set. We then selected 100 images as our test set, also avoiding zebras from the two sets above. For evaluation, we manually generated the segmentation mask for this set of images, and annotated the 2D keypoints. We mirror the images in order to double the test set data).
Regarding claim 18, Zuffi, in view of Kato teaches the computer-implemented method of claim 12, Zuffi discloses wherein the object is non-rigid animal (Figs. 1-8; Abstract: We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy’s zebras from a collection of images. The Grevy’s zebra is one of the most endangered species in Africa, with only a few thousand individuals left. Capturing the shape and pose of these animals can provide biologists and conservationists with information about animal health and behavior. In contrast to research on human pose, shape and texture estimation, training data for endangered species is limited, the animals are in complex natural scenes with occlusion, they are naturally camouflaged, travel in herds, and look similar to each other. To overcome these challenges, we integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation).
Regarding claim 19, Zuffi, in view of Kato teaches the computer-implemented method of claim 12, Zuffi discloses wherein the steps of predicting the 3D shape representation, predicting the texture flow, and mapping the pixels are performed on a server in a data center, or in a cloud-based computing environment to construct the 3D object, and the 3D object is streamed to a user device (Section 1: key novelty of the network is that it links the texture prediction to 3D pose and shape through a shared feature space, such that, in predicting the texture map, the network estimates model parameters for an optimal mapping between image pixels and texture map elements. In order to prevent the network from just learning average texture map colors, inspired by [13], we predict the flow between the image pixels and the texture map. We go beyond [13], however, to deal with an articulated object with a much more complex texture map containing multiple, disconnected regions).
Regarding claim 20, Zuffi, in view of Kato teaches the computer-implemented method of claim 12, Kato discloses wherein the steps of  predicting the 3D shape representation, predicting the texture flow, and mapping the pixels are performed to generate the 3D object that is used for training, testing, or certifying a second neural network that is employed in a machine, robot, or autonomous vehicle (Section 1: Implementing this ability in machines, known as single-view 3D object reconstruction and object pose estimation in computer vision, has many practical applications such as robot grasping and augmented reality).
Regarding claim 21, the limitations of this claim substantially correspond to the limitations of claim 12; thus they are rejected on similar grounds.
Regarding claim 22, the limitations of this claim substantially correspond to the limitations of claim 12; thus they are rejected on similar grounds.
Regarding claim 25, Zuffi, in view of Kato teaches the computer-implemented method of claim 12, Kato discloses further comprising: predicting non-rigid motion deformations of the 3D shape representation for the first image (Fig. 3; Section 3: we propose a two-stage training method that focuses on shapes first. Fig. 3 illustrates the overview of our proposed approach. In the first step, a category-specific 3D base shape is generated by maximizing the similarity between images in a dataset and images of the shape. We use randomly sampled viewpoints and strongly limited textures. In the second step, the whole model is trained limiting generated shapes to deformations of the obtained base shape; Section 3.2.1: Instead of predicting a mesh directly, we predict shape deformations using free-form deformation… use a 4 spatial grid of 4 × 4 × 4 vertices, and regress the difference between the original grid and a deformed grid using a neural network. In addition, we use another network to regress the relative height, width, and length of shapes. After deformation, the size of the predicted shape is scaled to fit a unit cube); and applying the non-rigid motion deformations to an identity shape to produce the 3D shape representation (Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration. Specifically, we render images using an estimated shape, a recorded best shape, a slightly perturbed the best shape, and random shapes. Then, we compute reconstruction loss to find the best one and record it; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses
Regarding claim 26, Zuffi, in view of Kato teaches the computer-implemented method of claim 12, Kato discloses further comprising: predicting non-rigid motion deformations for the images (Fig. 3; Section 3: we propose a two-stage training method that focuses on shapes first. Fig. 3 illustrates the overview of our proposed approach. In the first step, a category-specific 3D base shape is generated by maximizing the similarity between images in a dataset and images of the shape. We use randomly sampled viewpoints and strongly limited textures. In the second step, the whole model is trained limiting generated shapes to deformations of the obtained base shape; Section 3.2.1: Instead of predicting a mesh directly, we predict shape deformations using free-form deformation… use a 4 spatial grid of 4 × 4 × 4 vertices, and regress the difference between the original grid and a deformed grid using a neural network. In addition, we use another network to regress the relative height, width, and length of shapes. After deformation, the size of the predicted shape is scaled to fit a unit cube); applying the non-rigid motion deformations to identity shapes predicted for the images to produce 3D shape representations of the object (Section 3.2.1: variation between generated shapes tends to be very small because exploring various shapes using only a differentiable renderer and gradient descent is difficult due to local minima. To overcome this problem, we explore and record the best shape for each input image at each training iteration. Specifically, we render images using an estimated shape, a recorded best shape, a slightly perturbed the best shape, and random shapes. Then, we compute reconstruction loss to find the best one and record it; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses) representations (Fig. 4; Section 3.1: we propose a model that generates a shape, texture, and background from random noise by minimizing the difference between the set of rendered images and the set of images in a dataset. In the following sections, we explain each component along with the additional constraints and regularization needed to obtain a meaningful shape; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses; Section 3.2.3: addition to the components described above, we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse); and evaluating a loss function based on rotated differences between the identity shapes and differences between the 3D shape representations (Fig. 4; Section 3.1: we propose a model that generates a shape, texture, and background from random noise by minimizing the difference between the set of rendered images and the set of images in a dataset. In the following sections, we explain each component along with the additional constraints and regularization needed to obtain a meaningful shape; Section 3.2.2: parameterize the 6DoF object/camera pose by azimuth and elevation as with Section 3.1.3, in-plane rotation of an object, center point of an object in 2D image coordinates, and scale of an object. We train a decoder that outputs these six parameters. We adopt multiple regressor approach used in [14]. Exploring best pose Similarly to shape prediction, we also need to actively explore the best poses. At each training iteration, we explore and record the best pose for each input image by rendering images using estimated, recorded, random, and perturbed poses; Section 3.2.3: addition to the components described above, we employ view prior learning (VPL) [17] to reduce overfitting to the observed views. Summarily, a loss function is composed of the following four terms. (1) Reconstruction loss. Reconstructed images using the best shapes, estimated textures, the best poses, and estimated backgrounds are compared with input images. In addition, feature matching is also used. (2) Mean absolute error between estimated shapes/poses and the best shapes/poses that are recorded during training. (3) Total variation of estimated texture images for denoising. (4) VPL loss. To facilitate an early phase of training, at the i-th iteration, training samples are randomly selected from first to i-th data in the dataset. This makes the model see the same sample frequently in an early stage, which simplifies finding the best poses and makes the estimated poses diverse).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW D SALVUCCI whose telephone number is (571)270-5748. The examiner can normally be reached M-F: 7:30-4:00PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW SALVUCCI/Primary Examiner, Art Unit 2613