DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Applicant’s election without traverse of Group I (claims 1-20) in the reply filed on 14 November 2022 is acknowledged. Claims 1-37 are pending of which claims 21-37 are withdrawn. Claims 1-20 are rejected.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-9, 11-13, 15-17, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over “DeepVoxels: Learning Persistent 3D Feature Embeddings” by Sitzmann et al. (cited in the IDS filed 6/29/21; hereinafter “Sitzmann”) in view of “Implicit 3D Orientation Learning for 6D Object Detection from RGB Images” by Sundermeyer et al. (cited in the IDS filed 6/29/21; hereinafter “Sundermeyer”).
As to independent claim 1, Sitzmann discloses a computer-implemented method (Abstract and Fig. 2 discloses that Sitzmann is directed to a deep learning model having an encoder-decoder based architecture for 3D scene representation and novel view synthesis, such a model requiring implementation by a computer), comprising: generating a three-dimensional image volume based on a plurality of image volumes derived from two-dimensional image data (Section 3 discloses generating a persistent 3D DeepVoxels representation of an object by integrating 3D feature volumes lifted from 2D feature maps extracted from source views Si of the object; see also Figs. 1-2); processing the three-dimensional image volume to generate image data comprising a plurality of image views of an object (Section 3 discloses that the trained rendering network processes the 3D DeepVoxels representation to generate multiple novel views of the object; see also Fig. 2). 
Sitzmann does not expressly disclose using at least one of the plurality of image views of the object to estimate an object pose.
Sundermeyer, like Sitzmann, is directed to a trained deep network architecture that inputs 2D images of an object, learns a 3D representation of the object in a latent space, and renders synthetic views of the learned object with varying poses (Abstract, Section 1, and Fig. 4). In addition, Sundermeyer discloses performing pose estimation (Abstract). In particular, Sundermeyer discloses that a codebook is generated including all synthetic object views and the corresponding pose of the object therein (Section 3.5). At test time, a query image including the object is input to the trained network, the resulting code output by the network is compared with all codes from the codebook, and the rotation of the code that is the closest match to the test code is returned as a rotation estimate for the query image (Fig. 1 and Section 3.5). Sundermeyer further discloses a similar process of codebook comparison for estimating translation of the query object (Section 3.6 and Fig. 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sitzmann to use the generated novel views for object pose estimation, as taught by Sundermeyer, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to improve mobile robotic manipulation (Section 1 of Sundermeyer). 

As to claim 2, the proposed combination of Sitzmann and Sundermeyer further teaches that the plurality of image volumes is a plurality of three-dimensional feature volumes based on the two- dimensional image data (Section 3 of Sitzmann discloses 3D feature volumes lifted from 2D feature maps extracted from source views Si of the object; see also Fig. 2). 

As to claim 4, the proposed combination of Sitzmann and Sundermeyer further teaches obtaining an input image comprising image data of the object, wherein the estimated object pose is for the object associated with the input image (Section 3 of Sundermeyer discloses that, at test time, a query image including the object is input to the trained network and the resulting code output by the network is compared with all codes from the codebook to estimate pose for the object in the query image). 

As to claim 5, the proposed combination of Sitzmann and Sundermeyer further teaches that generating the three-dimensional image volume comprises fusing the plurality of image volumes derived from the two-dimensional image data to provide the three-dimensional image volume (Section 3 of Sitzmann discloses that the 3D DeepVoxels representation is generated by integrating the 3D feature volumes using a recurrent fusion process, the 3D feature volumes being lifted from 2D feature maps extracted from source views Si of the object; see also Fig. 2).

As to claim 6, the proposed combination of Sitzmann and Sundermeyer further teaches that processing the three-dimensional image volume comprises transforming the three-dimensional image volume to generate the image data comprising the plurality of image views of the object (Sections 3-4 disclose that the 3D DeepVoxels representation is transformed by projection, an occlusion network, and a 2D U-Net Rendering Network to generate the novel views of the object; see also Fig. 2). 

As to claim 7, Sitzmann discloses processing the two-dimensional image data comprises generating the plurality of image volumes based on a camera model comprising a collection of camera parameters, the collection of camera parameters comprising coordinate data of a principal point associated with the camera, and at least one of rotation or translation of the camera (Section 3.2 of Sitzmann discloses that the network architecture that generates the respective 3D feature volumes follows a perspective pinhole camera model comprising extrinsic and intrinsic camera parameters which include coordinates u and v and depth data d of voxel centers from the camera, and rotation and translation of the camera). 
Sitzmann does not expressly disclose that the camera parameters include one or more focal lengths of the camera. 
Sundermeyer, like Sitzmann, is directed to a trained deep network architecture that inputs 2D images of an object, learns a 3D representation of the object in a latent space, and renders synthetic views of the learned object with varying poses, wherein the network relies on a pinhole camera model (Abstract, Sections 1 and 3.6, and Fig. 4). Sundermeyer discloses that the pinhole camera model includes focal lengths of the camera (Section. 3.6). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sitzmann to include focal lengths of the camera as parameters in the pinhole camera model, as taught by Sundermeyer, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to more accurately model the camera. 

As to claim 8, the proposed combination of Sitzmann and Sundermeyer further teaches that generating the three-dimensional image volume comprises combining the plurality of image volumes using a recurrent neural network that sequentially integrates the plurality of image volumes to generate the three-dimensional image volume (Section 3.2 of Sitzmann discloses that the 3D feature volumes are integrated to generate the 3D DeepVoxels representation using a gated recurrent neural network architecture that integrates the 3D feature volumes incrementally and sequentially; for example, equations 2-5 show that a 3D feature volume of a current timestamp lifted from a source image affect the trainable parameters of the recurrent neural network, and a 3D feature volume of a subsequent timestamp lifted from a subsequently input source image further affect the trainable parameters of the recurrent neural network). 

As to claim 9, the proposed combination of Sitzmann and Sundermeyer further teaches that processing the three-dimensional image volume comprises flattening the three-dimensional image volume to generate a two-dimensional feature grid based on at least a camera model comprising one or more camera parameters, the image data comprising the plurality of image views of the object based at least on the two-dimensional feature grid (Section 3 and Fig. 3 of Sitzmann disclose that the occlusion-aware projection operation which processes the 3D DeepVoxels representation into the novel views includes sampling the 3D DeepVoxels representation into a view volume and collapsing the view volume in the depth direction (interpreted as “flattening”) to generate a 2D feature grid used to generate the novel views; see also Fig. 2; Fig. 3 shows that the feature grid is dependent on the camera model, and Section 3 discloses that the camera model includes intrinsic and extrinsic camera parameters). 

As to claim 11, the proposed combination of Sitzmann and Sundermeyer further teaches estimating a coarse object pose using at least one of the plurality of image views of the object; and estimating the object pose based on the coarse object pose and the three- dimensional image volume (Sections 3.5-3.6 of Sundermeyer discloses estimating the rotation of the object in the test image, then further estimating the translation of the object in the test image based on the synthetic views generated based on the trained network’s 3D representation of the object in latent space; Section 3.6 further discloses refining the pose estimate using depth data). 

As to claim 12, the proposed combination of Sitzmann and Sundermeyer further teaches obtaining a query image comprising image data of the object; calculating depth loss based on the image data of the query image and image data of at least one of the plurality of image views of the object; and estimating the object pose based the calculated depth loss, wherein the object pose is associated with the query image comprising the image data of the object (Sections 3.5-3.6 of Sundermeyer disclose obtaining a test image of the object and estimating a pose (translation and rotation) of the object in the test image, the object pose being of the object in the test image; Section 3.6 of Sundermeyer further discloses that the pose estimation involves pose refinement using an iterative closest point (“ICP”) approach on depth data of the provided test image; specifically, Appendix A.4 discloses projecting the depth image into a 3D point cloud, generating random points on the surface of the object model, and performing ICP to minimize the difference (interpreted as depth loss) between these point sets to arrive at the pose estimate). 

As to independent claim 13, Sitzmann discloses a computer system comprising one or more processors and computer readable memory storing executable instructions that, as a result of being executed by the one or more processors, cause the computer system to at least (Abstract and Fig. 2 discloses that Sitzmann is directed to a deep learning model having an encoder-decoder based architecture for 3D scene representation and novel view synthesis, such processing requiring implementation by software instructions stored in memory and executed by a processor; for example, Section 5 discloses a GPU and memory): obtain input image data comprising at least a first image of an object and a second image of the object (Section 3 discloses a training corpus comprising M source views Si of an object which are input to the network architecture of Fig. 2; any two of the source views Si correspond to the claimed first and second images of the object); process the input image data to generate a first three-dimensional feature volume corresponding to the first image and a second three-dimensional feature volume corresponding to the second image (Section 3 and Fig. 2 disclose a lifting layer that lifts a 3D feature volume from each 2D feature map extracted from the respective source views Si); combine the first and second three-dimensional feature volumes to generate a combined feature volume (Section 3 and Fig. 2 disclose generating a persistent 3D DeepVoxels representation of the object by integrating the lifted 3D feature volumes using a recurrent fusion process); transform the combined feature volume to generate output image data comprising a plurality of image views of the object (Section 3 discloses that the trained rendering network processes the 3D DeepVoxels representation to generate multiple novel views of the object; see also Fig. 2).
Sitzmann also does not expressly disclose that the computer system estimates an object pose based on at least one of the plurality of image views of the object.
Sundermeyer, like Sitzmann, is directed to a trained deep network architecture that inputs 2D images of an object, learns a 3D representation of the object in a latent space, and renders synthetic views of the learned object with varying poses (Abstract, Section 1, and Fig. 4). In addition, Sundermeyer discloses performing pose estimation (Abstract). In particular, Sundermeyer discloses that a codebook is generated including all synthetic object views and the corresponding pose of the object therein (Section 3.5). At test time, a query image including the object is input to the trained network, the resulting code output by the network is compared with all codes from the codebook, and the rotation of the code that is the closest match to the test code is returned as a rotation estimate for the query image (Fig. 1 and Section 3.5). Sundermeyer further discloses a similar process of codebook comparison for estimating translation of the query object (Section 3.6 and Fig. 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sitzmann to use the generated novel views for object pose estimation, as taught by Sundermeyer, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to improve mobile robotic manipulation (Section 1 of Sundermeyer).

As to claim 15, the proposed combination of Sitzmann and Sundermeyer further teaches that processing the image data comprises generating the first and second three-dimensional feature volumes based on a camera model comprising camera parameters, the camera parameters comprising one or more focal lengths of a camera, coordinate data of a principal point associated with the camera, or at least one of rotation or translation of the camera (Section 3.2 of Sitzmann discloses that the network architecture that generates the respective 3D feature volumes follows a perspective pinhole camera model comprising extrinsic and intrinsic camera parameters which include coordinates u and v and depth data d of voxel centers from the camera, and rotation and translation of the camera).

As to claim 16, the proposed combination of Sitzmann and Sundermeyer further teaches that combining the first and second three-dimensional feature volumes comprises fusing the first and second three- dimensional feature volumes using a recurrent neural network that sequentially integrates the first and second three-dimensional feature volumes to generate the combined feature volume (Section 3.2 of Sitzmann discloses that the 3D feature volumes are integrated to generate the 3D DeepVoxels representation using a gated recurrent neural network architecture that integrates the 3D feature volumes incrementally and sequentially; for example, equations 2-5 show that a 3D feature volume of a current timestamp lifted from a source image affect the trainable parameters of the recurrent neural network, and a 3D feature volume of a subsequent timestamp lifted from a subsequently input source image further affect the trainable parameters of the recurrent neural network; by this iterative process, the recurrent fusion is performed to integrate the 3D feature volumes and thereby generate the 3D DeepVoxels representation).

As to claim 17, the proposed combination of Sitzmann and Sundermeyer further teaches that transforming the combined feature volume comprises flattening the combined feature volume to generate a two- dimensional feature grid based on at least a camera model comprising one or more camera parameters (Section 3 and Fig. 3 of Sitzmann disclose that the occlusion-aware projection operation which processes the 3D DeepVoxels representation into the novel views includes sampling the 3D DeepVoxels representation into a view volume and collapsing the view volume in the depth direction (interpreted as “flattening”) to generate a 2D feature grid used to generate the novel views; see also Fig. 2; Fig. 3 shows that the feature grid is dependent on the camera model, and Section 3 discloses that the camera model includes intrinsic and extrinsic camera parameters).

As to claim 19, the proposed combination of Sitzmann and Sundermeyer further teaches that estimating the object pose comprises: estimating another object pose based on at least one of the plurality of image views of the object; and estimating the object pose based on the other object pose and the combined feature volume (Section 3.5-3.6 of Sundermeyer discloses that a codebook of poses is generated for each generated synthetic view, and the object pose of the object in the test image is calculated by comparing the test object pose estimated by the network with the stored object poses for the synthetic views to estimate object pose for the object in the test image; each of the generated synthetic views and the test object pose estimated by the network are based on the 3D representation of the object in a latent space characterized by the trained network). 

As to claim 20, the proposed combination of Sitzmann and Sundermeyer further teaches that estimating the object pose comprises: obtaining a query image comprising image data; calculating depth loss based on the image data of the query image and image data of at least one of the plurality of image views of the object; and estimating the object pose based the calculated depth loss, wherein the object pose is associated with the query image (Sections 3.5-3.6 of Sundermeyer disclose obtaining a test image of the object and estimating a pose (translation and rotation) of the object in the test image, the object pose being of the object in the test image; Section 3.6 of Sundermeyer further discloses that the pose estimation involves pose refinement using an iterative closest point (“ICP”) approach on depth data of the provided test image; specifically, Appendix A.4 discloses projecting the depth image into a 3D point cloud, generating random points on the surface of the object model, and performing ICP to minimize the difference (interpreted as depth loss) between these point sets to arrive at the pose estimate).

Claims 3 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Sitzmann in view of Sundermeyer and further in view of “Category-Specific Object Reconstruction from a Single Image” to Kar et al. (hereinafter “Kar”).
As to claim 3, the proposed combination of Sitzmann and Sundermeyer further teaches that the two- dimensional image data comprises a plurality of RGB images of the object (Section 3 and Fig. 2 of Sitzmann discloses that the source images Si are color images of the object).  
The proposed combination of Sitzmann and Sundermeyer does not expressly disclose that the 2D image data further comprises a plurality of masks of the object. 
Kar, like Sitzmann, is directed to training a model to generate a 3D reconstruction of an object in the training images (Abstract, Figs. 1-2). Kar discloses that each training image is provided to the model along with a binary mask of the object such that all keypoints of the object lie inside its binary mask (see Section 2, equation 2, and Fig. 2). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Sitzmann and Sundermeyer to provide a mask of the object along with the color training image, as taught by Kar, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to save computation resources. 

As to claim 14, the proposed combination of Sitzmann, Sundermeyer, and 
further teaches that the input image data further comprises a first binary mask based on the first image of the object and a second binary mask based on the second image of the object (Section 3 and Fig. 2 of Sitzmann discloses that the source images Si are color images of the object; Section 2, equation 2, and Fig. 2 of Kar discloses that each training image is provided to the model along with a binary mask of the object such that all keypoints of the object lie inside its binary mask; the reasons for combining the references are the same as those discussed above in conjunction with claim 3). 

Claims 10 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Sitzmann in view of Sundermeyer and further in view of “Neural Volumes: Learning Dynamic Renderable Volumes from Images” by Lombardi et al. (cited in IDS filed 6/29/21; hereinafter “Lombardi”).
The proposed combination of Sitzmann and Sundermeyer further teaches that the plurality of image views of the object comprises a first image view including a first depth image data and a second image view including a second depth image data (Section 3 and Fig. 3 of Sitzmann discloses a depth map that is processed to form each of the plurality of novel views). The proposed combination of Sitzmann and Sundermeyer does not expressly disclose that the first image view includes a first mask image data and the second image view includes a second mask image data. 
Lombardi, like Sitzmann, is directed to modeling an object as a 3D volume in latent space based on multiple images of the object and rendering novel views of the object based on the latent 3D volume representation (Abstract and Fig. 2). Lombardi discloses that each of the rendered novel views comprise an associated alpha mask (Fig. 2). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Sitzmann and Sundermeyer to output an associated mask with each rendered novel view, as taught by Lombardi, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to save computation resources.

As to claim 18, the proposed combination of Sitzmann, Sundermeyer, and
further teaches that the plurality of image views of the object comprises a first image view including first depth image data and first mask image data and a second image view including second depth image data and second mask image data (Section 3 and Fig. 3 of Sitzmann discloses a depth map that is processed to form each of the plurality of novel views; Fig. 2 of Lombardi discloses that each of the rendered novel views comprise an associated alpha mask; the reasons for combining the references are analogous to those discussed above in conjunction with claim 10). 

Pertinent Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
“Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation” by Rhodin et al. discloses a deep network architecture that learns a latent 3D representation of a human based on multiple 2D images thereof and then renders novel views of the human based on the learned latent 3D representation. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEAN M CONNER whose telephone number is (571)272-1486. The examiner can normally be reached noon - 8:30 PM Monday through Thursday and Saturday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Claire Wang can be reached on (571) 270-1051. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/SEAN M CONNER/Primary Examiner, Art Unit 2663