Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

Claim limitation “depth estimation/refinement module”, and “pose estimation/refinement module” has/have been interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because it uses/they use a generic unit” coupled with functional language “configured to” without reciting sufficient structure to achieve the function.  Furthermore, the generic placeholder is not preceded by a structural modifier.
Since the claim limitation(s) invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, claim(s) 1, 8 and 15 has/have been interpreted to cover the corresponding structure described in the specification that achieves the claimed function, and equivalents thereof.  
A review of the specification shows that the following appears to be the corresponding structure described in the specification for the 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph limitation: the cpu and processor of paragraph 0067 as published and/or of Figure 6.  
If applicant wishes to provide further explanation or dispute the examiner’s interpretation of the corresponding structure, applicant must identify the corresponding structure with reference to the specification by page and line number, and to the drawing, if any, by reference characters in response to this Office action. 
If applicant does not intend to have the claim limitation(s) treated under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112 , sixth paragraph, applicant may amend the claim(s) so that it/they will clearly not invoke 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, or present a sufficient showing that the claim recites/recite sufficient structure, material, or acts for performing the claimed function to preclude application of 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
et seq. and Supplementary Examination Guidelines for Determining Compliance With 35 U.S.C. 112 and for Treatment of Related Issues in Patent Applications, 76 FR 7162, 7167 (Feb. 9, 2011).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2021/0004660 A1 (effective filing date July 5, 2019) to Ambrus et al., hereinafter, “Ambrus” in view of DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency to Zou et al., hereinafter, “Zou” and US 2019/0279379 S1 to Srinivasan et al., hereinafter, “Srinivasan”.
Claim 1. A computer-implemented method executed on a processor for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), the method comprising: Ambrus [0019] teaches additional benefits of the disclosed embodiments include 3D metric reconstruction and increased understanding of scenes through monocular depth and ego-motion estimation from unlabeled images. The ability to bootstrap and learn a metric depth network from monocular camera sensors also benefits fusion stages for 3D spatial reconstruction (e.g., either from single/multi-view monocular imagery, or from both LIDAR and monocular imagery combined).
Ambrus [0032] teaches FIG. 3 shows an example network architecture for the pose module 230. In one or more embodiments the pose module 230 can be implemented in a two-stream network architecture including an appearance stream convolution neural network ("CNN") 310 for processing image data 250 and a structure stream CNN 320 for processing depth estimate data 260… Receiving two separate modalities (i.e., image and depth) allows the pose module 230 to learn both appearance and geometry features, leading to improved results., [0035]
Also in the same field of depth and motion estimation in image data Zou [Abstract] teaches we present an unsupervised learning framework for simultaneously training single-view depth prediction and optical flow estimation models using unlabeled video sequences. Existing unsupervised methods often exploit brightness constancy and spatial smoothness priors to train depth or flow models. In this paper, we propose to leverage geometric consistency as additional supervisory signals.

Zou [Introduction] teaches Single-view depth prediction and optical flow estimation are two fundamental problems in computer vision. While the two tasks aim to recover highly correlated information from the scene (i.e., the scene structure and the dense motion field between consecutive frames), existing efforts typically study each problem in isolation. In this paper, we demonstrate the benefits of exploring the geometric relationship between depth, camera motion, and flow for unsupervised learning of depth and flow estimation models. With the rapid development of deep convolutional neural networks (CNNs),
numerous approaches have been proposed to tackle dense prediction problems
in an end-to-end manner. However, supervised training CNN for such tasks of-
ten involves in constructing large-scale, diverse datasets with dense pixelwise
ground truth labels. Collecting such densely labeled datasets in real-world re-
quires significant amounts of human efforts and is prone to error. Existing efforts
of RGB-D dataset construction [18,45,53,54] often have limited scope (e.g., in
terms of locations, scenes, and objects), and hence are lack of diversity. For
optical flow, dense motion annotations are even more difficult to acquire [37].
Consequently, existing CNN-based methods rely on synthetic datasets for training the models [5,12,16,24]. These synthetic datasets, however, do not capture the complexity of motion blur, occlusion, and natural image statistics from real scenes.…Several work [17,21,28] have been proposed to capitalize on large-scale real-world videos to train the CNNs in the unsupervised setting. The main idea lies to exploit the brightness constancy and spatial smoothness assumptions of flow fields or disparity maps as supervisory signals. These assumptions, however, often do not hold at motion boundaries and hence makes the training unstable. Many recent efforts [59,60,65,73] explore the geometric relationship between the two problems. With the estimated depth and camera pose, these methods can produce dense optical flow by backprojecting the 3D scene flow induced from camera ego-motion. However, these methods implicitly assume perfect depth and camera pose estimation when “synthesizing” the optical flow. The errors in either depth or camera pose estimation inevitably produce inaccurate flow predictions. In this paper, we present a technique for jointly learning a single-view depth estimation model and a flow prediction model using unlabeled videos as shown in Figure 2. Our key observation is that the predictions from depth, pose, and optical flow should be consistent with each other. By exploiting this geometry cue, we present a novel cross-task consistency loss that provides additional supervisory signals for training both networks. We validate the effectiveness of the proposed approach through extensive experiments on several benchmark datasets. Experimental results show that our joint training method significantly improves the performance of both models (Figure 1). The proposed depth and flow models compare favorably with state-of-the-art unsupervised methods. We make the following contributions. (1) We propose an unsupervised learning framework to simultaneously train a depth prediction network and an optical flow network. We achieve this by introducing a cross-task consistency loss that enforces geometric consistency.

Zou [Structure from motion.] teaches Joint estimation of structure and camera pose from multiple images of a given scene is a long-standing problem [46,15,64].
Conventional methods can recover (semi-)dense depth estimation and camera
pose through keypoint tracking/matching.
capturing a sequence of RGB images from an unlabeled monocular video stream obtained by a monocular camera; Ambrus [0007] teaches the disclosed systems and methods relate to advancing and improving the task of learning depth and ego-motion estimation from streams of unlabeled red-green-blue (RGB) images in a self-supervised regime.
Ambrus [0019] teaches Additional benefits of the disclosed embodiments include 3D metric reconstruction and increased understanding of scenes through monocular depth and ego-motion estimation from unlabeled images.
Zou [Fig. 2] teaches In contrast, our work leverages the readily available unlabeled video sequences to jointly train the depth and flow models.
feeding the RGB images into a depth estimation/refinement module; outputting depth maps; Ambrus [0035] teaches the self-supervised learning problem of the ego-motion estimation system 170 may be defined as the task of recovering the following functions: [0036] (i) f.sub.d: I.fwdarw.D [0037] (ii) f.sub.x: (I.sub.t,D.sub.t,I.sub.s,D.sub.s).fwdarw.x.sub.t.fwdarw.s where (i) maps an RGB image I to its corresponding depth D,
Ambrus [0038] teaches ego-motion estimation system 270 includes a two-stream network (e.g., FIG. 3) that receives RGB image data and depth information as input and fuses the input into a unified pose output. As shown in FIG. 4, using the inferred depth [circumflex over (D)].sub.t and the estimated ego-motion, the synthesis module 240 transforms [circumflex over (D)].sub.t into a reference frame of I.sub.s and synthesizes a predicted image I.sub.t from I.sub.s in a differentiable manner.
outputting camera poses and point clouds; Zou [Methods exploiting geometry cues.] teaches recently, a number of work exploits the geometric relationship between depth, camera pose, and flow for learning depth or flow models [60,65,68,73]. These methods first estimate the depth of the input images. Together with the estimated camera poses between two consecutive frames, these methods “synthesize” the flow field of rigid regions.
Zou Fig. 5: Sample results on KITTI raw test set. The ground truth depth is interpolated from sparse point cloud for visualization only.
and constructing a 3D map of a surrounding environment displayed on a visualization device. Ambrus [0028] teaches image data 250 can include, for example, two or more RGB monocular images (e.g., a source image I.sub.s and a temporally subsequent target image I.sub.t) captured in sequential time frames by the camera 126 and encompassing a field-of-view about the vehicle 100 of at least a portion of the surrounding environment. That is, the image data 250 is, in one approach, generally limited to a subregion of the surrounding 360 environment.
Ambrus [0081] teaches the autonomous driving module(s) 160 can be configured to receive, and/or determine location information for obstacles within the external environment of the vehicle 100 for use by the processor(s) 110 , and/or one or more of the modules described herein to estimate position and orientation of the vehicle 100, vehicle position in global coordinates based on signals from a plurality of satellites, or any other data and/or signals that could be used to determine the current state of the vehicle 100 or determine the position of the vehicle 100 with respect to its environment for use in either creating a map or determining the position of the vehicle 100 in respect to map data.
feeding the depth maps and the RGB images to a pose estimation/refinement module, the depths maps and the RGB images collectively defining pseudo RGB-D images; Ambrus [0034] teaches FIG. 4 shows an example network architecture 400 of the ego-motion estimation system 170 according to the disclosed embodiments. As described above, the depth module 220 receives a source image I.sub.s and target image I.sub.t as input and outputs a depth estimation [circumflex over (D)].sub.t for the target image I.sub.t and a depth estimation [circumflex over (D)].sub.s for the source image. The pose module 230 receives the source image I.sub.s, target image I.sub.t, depth estimation [circumflex over (D)].sub.s, and depth estimation [circumflex over (D)].sub.t as input and outputs an ego-motion estimation in the form of a 6-DOF transformation between the source image I.sub.s and target image I.sub.t.
Ambrus [0038] teaches ego-motion estimation system 270 includes a two-stream network (e.g., FIG. 3) that receives RGB image data and depth information as input and fuses the input into a unified pose output. As shown in FIG. 4, using the inferred depth [circumflex over (D)].sub.t and the estimated ego-motion, the synthesis module 240 transforms [circumflex over (D)].sub.t into a reference frame of I.sub.s and synthesizes a predicted image I.sub.t from I.sub.s in a differentiable manner.
Ambrus and Zou fails to explicitly teach “pseudo” images, however Srinivasan, in the same field of depth estimation, teaches [0120] teaches In step 335, the electronic device 100 generates the dense depth map by combining or fusing the intermediate depth map with the high quality RGB image of the scene. The dense depth map generator 143 may generate the dense depth map by combining or fusing the intermediate depth map with the high quality RGB image. Examiner interprets intermediate depth map with RGB image to be pseudo RGB-D images.
Therefore, at the time of the invention, it would have been obvious to combine the teachings of Ambrus with the teachings of Zou and Srinivasan to improve the results of single-view depth prediction and optical flow estimation (Zou [Abstract]) and improve the imaging characteristics of the depth map in different conditions (Srinivasan [0007]).
Claim 2. Ambrus and Zou further teaches wherein common tracked keypoints from neighboring keyframes are employed. Ambrus [0052] teaches at operation 520, the ego-motion estimation system 170 obtains image data 250. For example, in one or more embodiments the camera 126 of vehicle 100 captures two or more images in adjacent, sequential time frames, e.g., a source image I.sub.s and a target image I.sub.t.
Zou [Structure from motion.] teaches Joint estimation of structure and camera pose from multiple images of a given scene is a long-standing problem [46,15,64].
Conventional methods can recover (semi-)dense depth estimation and camera
pose through keypoint tracking/matching.
Claim 3. Zou further teaches wherein a symmetric depth transfer loss and a depth consistency loss are imposed. Zou [3.1 Method Overview] teaches hence, we propose a cross-task consistency loss to enforce this constraint (Section 3.5). Our overall objective function can be formulated as follows: L = Lphotometric + λsLsmooth + λfLforward-backward + λcLcross. (1) All of the four loss terms are applied to both depth and flow networks. Also, all of the four loss terms are symmetric for forward and backward directions, for simplicity we only derive them for the forward direction.
Claim 4. Zou further teaches wherein the symmetric depth transfer loss is given as: (w)=|d.sub.c.fwdarw.k1.sup.i(w)-d.sub.k1.sup.i(w)|+|d.sub.k1.fwdarw.c.sup- .i(w)-d.sub.c.sup.i(w)| where d.sub.k1.sup.i(w) and d.sub.c.sup.i(w) are the depth values from the depth network, d.sub.c.fwdarw.k1.sup.i,(w) and d.sub.k1.fwdarw.c(w) are the transferred depth values, k.sub.1 and k.sub.2 are two neighboring keyframes of a current frame c, and w represents the depth network parameters. Zou [Fig. 3] Overview of our unsupervised joint learning framework] teaches our framework consists of three major modules: (1) a Depth Net for single-view depth estimation; (2) a Pose Net that takes two stacked input frames and estimates the relative camera pose between the two input frames; and (3) a Flow Net that estimates dense optical flow field between the two input frames. Given a pair of input images It and It+1 sampled from an unlabeled video, we first estimate the depth of each frame, the 6D camera pose, and the dense forward and backward flows. Using the predicted scene depth and the estimated camera pose, we can synthesize 2D forward and backward optical flows (referred as rigid flow) by backprojecting the induced 3D forward and backward scene flows (Section 3.2). As we do not have ground truth depth and flow maps for supervision, we leverage standard photometric and spatial smoothness costs to regularize the network training (Section 3.3, not shown in this figure for clarity). To enforce the consistency of flow and depth prediction in both directions, we exploit the forward-backward consistency (Section 3.4), and adopt the valid masks derived from it to filter out invalid regions (e.g., occlusion/dis-occlusion) for the photometric loss. Finally, we propose a novel cross-network consistency loss (Section 3.5) — encouraging the optical flow estimation (from the Flow Net) and the rigid flow (from the Depth and Pose Net) to be consistent to each other within in valid regions.
Zou [3.1 Method Overview] teaches Our overall objective function can be formulated as follows: L = Lphotometric + λsLsmooth + λfLforward-backward + λcLcross. (1) All of the four loss terms are applied to both depth and flow networks. Also, all of the four loss terms are symmetric for forward and backward directions, for simplicity we only derive them for the forward direction.
Claim 5. Zou further teaches wherein the depth consistency loss is given as: D c = .SIGMA. i .di-elect cons. d c i ( w ) - d c i ( S L AM ) ##EQU00002## where X represents a set of common tracked keypoints, c is a current frame, w is a depth network parameter, d.sub.c.sup.i(w) is a depth value from the depth network, and d.sub.c.sup.i(SLAM) is a depth value from SLAM. Zou [3.1 Method Overview] teaches Our overall objective function can be formulated as follows: L = Lphotometric + λsLsmooth + λfLforward-backward + λcLcross. (1) All of the four loss terms are applied to both depth and flow networks. Also, all of the four loss terms are symmetric for forward and backward directions, for simplicity we only derive them for the forward direction.
Zou [Fig. 3] Overview of our unsupervised joint learning framework] teaches our framework consists of three major modules: (1) a Depth Net for single-view depth estimation; (2) a Pose Net that takes two stacked input frames and estimates the relative camera pose between the two input frames; and (3) a Flow Net that estimates dense optical flow field between the two input frames. Given a pair of input images It and It+1 sampled from an unlabeled video, we first estimate the depth of each frame, the 6D camera pose, and the dense forward and backward flows. Using the predicted scene depth and the estimated camera pose, we can synthesize 2D forward and backward optical flows (referred as rigid flow) by backprojecting the induced 3D forward and backward scene flows (Section 3.2). As we do not have ground truth depth and flow maps for supervision, we leverage standard photometric and spatial smoothness costs to regularize the network training (Section 3.3, not shown in this figure for clarity). To enforce the consistency of flow and depth prediction in both directions, we exploit the forward-backward consistency (Section 3.4), and adopt the valid masks derived from it to filter out invalid regions (e.g., occlusion/dis-occlusion) for the photometric loss. Finally, we propose a novel cross-network consistency loss (Section 3.5) — encouraging the optical flow estimation (from the Flow Net) and the rigid flow (from the Depth and Pose Net) to be consistent to each other within in valid regions.

Zou [Structure from motion.] teaches Joint estimation of structure and camera pose from multiple images of a given scene is a long-standing problem [46,15,64].
Conventional methods can recover (semi-)dense depth estimation and camera
pose through keypoint tracking/matching.

Zou [3.2 Flow synthesis using depth and pose predictions]
Claim 6. Zou further teaches wherein a photometric reconstruction loss between a synthesized frame and a current frame is given as: =pe(I.sub.c+1.fwdarw.c(d.sub.c(w),T.sub.c+1.fwdarw.c.sup.SLAM,K),I.sub.c)- +pe(d.sub.c(w),T.sub.c-1.fwdarw.c.sup.SLAM,K),I.sub.c), where I.sub.c is a current keyframe, I.sub.c-1 and I.sub.c+1 are adjacent frames, K is a camera intrinsic matrix, w is the depth network parameter, d.sub.c(w) is a network-predicted depth value, and T.sub.c-1.fwdarw.c.sup.SLAM and T.sub.c+1.fwdarw.c.sup.SLAM represent relative camera poses between two frames. Zou [pages 6-8, 3.3 and 3.4] 
Claim 7. Zou further teaches wherein a total loss is computed as a weighted sum of the symmetric depth transfer loss, the depth consistency loss, and the photometric reconstruction loss. Zou [3.1 Method Overview] teaches Our overall objective function can be formulated as follows: L = Lphotometric + λsLsmooth + λfLforward-backward + λcLcross. (1) All of the four loss terms are applied to both depth and flow networks. Also, all of the four loss terms are symmetric for forward and backward directions, for simplicity we only derive them for the forward direction.
Claim 8. It differs from claim 1 in that it is a non-transitory computer-readable storage medium comprising a computer-readable program for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), wherein the computer-readable program when executed on a computer causes the computer to perform the steps of the method of claim 1. Therefore claim 8 has been analyzed and reviewed in the same way as claim 1. See the above analysis. 
Claim 9. It differs from claim 2 in that it is a non-transitory computer-readable storage medium comprising a computer-readable program for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), wherein the computer-readable program when executed on a computer causes the computer to perform the steps of the method of claim 2. Therefore claim 9 has been analyzed and reviewed in the same way as claim 2. See the above analysis. 
Claim 10. It differs from claim 3 in that it is a non-transitory computer-readable storage medium comprising a computer-readable program for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), wherein the computer-readable program when executed on a computer causes the computer to perform the steps of the method of claim 3. Therefore claim 10 has been analyzed and reviewed in the same way as claim 3. See the above analysis. 
Claim 11. It differs from claim 4 in that it is a non-transitory computer-readable storage medium comprising a computer-readable program for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), wherein the computer-readable program when executed on a computer causes the computer to perform the steps of the method of claim 4. Therefore claim 11 has been analyzed and reviewed in the same way as claim 4. See the above analysis. 
Claim 12. It differs from claim 5 in that it is a non-transitory computer-readable storage medium comprising a computer-readable program for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), wherein the computer-readable program when executed on a computer causes the computer to perform the steps of the method of claim 5. Therefore claim 12 has been analyzed and reviewed in the same way as claim 5. See the above analysis. 
Claim 13. It differs from claim 6 in that it is a non-transitory computer-readable storage medium comprising a computer-readable program for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), wherein the computer-readable program when executed on a computer causes the computer to perform the steps of the method of claim6. Therefore claim 13 has been analyzed and reviewed in the same way as claim 6. See the above analysis. 
Claim 14. It differs from claim 7 in that it is a non-transitory computer-readable storage medium comprising a computer-readable program for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), wherein the computer-readable program when executed on a computer causes the computer to perform the steps of the method of claim 7. Therefore claim 14 has been analyzed and reviewed in the same way as claim 7. See the above analysis. 
Claim 15. It differs from claim 1 in that it is a system for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), the system to perform the steps of the method of claim 1. Therefore claim 15 has been analyzed and reviewed in the same way as claim 1. See the above analysis. 
Claim 16. It differs from claim 2 in that it is a system for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), the system to perform the steps of the method of claim 2. Therefore claim 16 has been analyzed and reviewed in the same way as claim 2. See the above analysis. 
Claim 17. It differs from claim 3 in that it is a system for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), the system to perform the steps of the method of claim 3. Therefore claim 17 has been analyzed and reviewed in the same way as claim 3. See the above analysis. 
Claim 18. It differs from claim 4 in that it is a system for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), the system to perform the steps of the method of claim 4. Therefore claim 18 has been analyzed and reviewed in the same way as claim 4. See the above analysis. 
Claim 19. It differs from claim 5 in that it is a system for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), the system to perform the steps of the method of claim 5. Therefore claim 19 has been analyzed and reviewed in the same way as claim 5. See the above analysis. 
Claim 20. It differs from claim 6 in that it is a system for improving geometry-based monocular structure from motion (SfM) by exploiting depth maps predicted by convolutional neural networks (CNNs), the system to perform the steps of the method of claim 6. Therefore claim 20 has been analyzed and reviewed in the same way as claim 6. See the above analysis. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US 2019/0139179 A1 to Wang et al. and US 2010/0111370 A1 to Black et al., hereinafter, “Black”.
Wang [0009] teaches FIG. 4 is a flowchart for training a model on a set of unlabeled images to predict depths in an image in accordance with various embodiments of the present disclosure.
Wang [0059] teaches Embodiments use smoothness over neighboring normal values to provide higher order interaction between pixels. Formally, the smoothness for normals may have the same form as .sub.s in Eq. 3 for depth, while the first order gradient may be applied, i.e., .sub.s(N, 1). In embodiments, matching corresponding pixels between frames is used to find the correct geometry.
Wang [0064] teaches FIG. 4 is a flowchart for training a model on a set of unlabeled images to predict depths
Wang [0084] teaches it is noted that embodiments of the present invention may be trained on any frame sequence captured with a camera, e.g., a monocular camera. Certain embodiments are evaluated on known datasets, e.g., datasets comprising raw data comprising RGB and/or gray-scale videos captured by stereo cameras from different scenes and having a known image size.
Black [0057] teaches for image data, a standard image error function is implemented by projecting the 3D body model onto the camera image plane.
Black [0074] teaches camera calibration defines the transformation from any 3D world point X=[x, y, z].sup.T to a 2D image position U=[u, v].sup.T on an image sensor. Given the correct full calibration for a camera in its environment, the exact projection of any point in the world on the camera's sensor can be predicted (with the caveat that some 3D points may not be in the frustum of the sensor). Practically, calibration encodes both extrinsic parameters (the position/rotation of the camera in the world coordinate system) 
Black [0104] teaches it is desirable to have a single world coordinate frame that relates all the camera views with consistent extrinsic parameters between views. Unlike the patch detection step 304, where the correspondence of a detected quadrilateral with the checkerboard was established arbitrarily, here we need to search for the correct correspondence in each camera view.
Black [0184-0187] teaches 3D measurements simplifies the matching problem with a 3D body model. These measurements may consist of point clouds or polygonal meshes, and optionally contain color information or surface orientation… where the target shape is represented as a mesh or an oriented point cloud, the compatibility criterion also safeguards against front-facing surfaces being matched to back-facing surfaces, measured in terms of the angle between the surface normals. Two points are considered incompatible if their normals are significantly apart… The weight wv is used to account for holes in the target shape, particularly in the case of partial scans or depth maps that only provide a partial view of the body shape
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DELOMIA L GILLIARD whose telephone number is (571)272-1681. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached on 571 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DELOMIA L GILLIARD/Primary Examiner, Art Unit 2661