Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 3, 8, 9, 14, 15, 18 and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation to Ranjan et al., hereinafter, “Ranjan”.
Claim 1. A computer-implemented method comprising: obtaining a reference image and a target image each representing an environment containing a moving feature and a static feature, wherein the reference image has been captured by a camera at a first time and the target image has been captured by the camera at a second time different from the first time; Ranjan [Abstract] teaches we address the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow, and segmentation of a video into the static scene and moving regions… Collaboration works much like expectation-maximization, but with neural networks that act as both competitors to explain pixels that correspond to static or moving regions, and as collaborators through a moderator that assigns pixels to be either static or independently moving. Our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, the camera motion, depth of the static scene structure, and the optical flow of moving objects.

Ranjan [page 4] teaches consider an image sequence I-; I; I+ with target frame I and temporally neighboring reference frames I-; I; I+. In general, we can have many neighboring frames. In our implementation, we use 5-frame sequences for C_ and M_ but for simplicity use 3 frames to describe our approach. We estimate the depth of the target frame as equation (3) We estimate the camera motion, e, of each of the reference frames I-; I+ w.r.t. the target frame I as equation (4)  Similarly, we estimate the segmentation of the target image into the static scene and moving regions. The optical flow of the static scene is defined only by the camera motion and depth. This generally refers to the structure of the scene. The moving regions have independent motion w.r.t. the scene. The segmentation masks corresponding to each pair of target and reference image are given by…equation (5)

determining an object mask configured to (i) mask out the moving feature in the target image and (ii) preserve the static feature in the target image; Ranjan [Figure 1] teaches Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation. Left, top to bottom: sample image, soft masks representing motion segmentation, estimated depth map.

Ranjan [Figure 2] teaches the motion segmentation network, M, masks out static scene pixels from F to produce composite optical flow over the full image.

Ranjan [pages 3-4] teaches in the context of jointly learning depth, camera motion, optical flow and motion segmentation, the first player R = (D;C) consists of the depth and camera motion networks that reason about the static regions in the scene. The second player F is the optical flow network that reasons about the moving regions… In the rest of this section, we formulate the joint unsupervised estimation of depth, camera motion, optical flow and motion segmentation within this framework.

Ranjan [page 4] teaches the moving regions have independent motion w.r.t. the scene. The segmentation masks corresponding to each pair of target and reference image are given by…

Ranjan [Figure 4] teaches Visual results. Top to bottom: Sample image, estimated depth, soft consensus masks, motion segmented optical flow and combined optical flow.

determining, based on one or more of the reference image or the target image, a static depth image that represents depth values of the static features in the target image; Ranjan [Figures 1, 2 and 4] equation 2

and generating, using a machine learning (ML) model and based on (i) the static depth image, (ii) the object mask, and (iii) one or more of the target image or the reference image, a dynamic depth image that represents depth values of both the static features and the moving features in the target image.  Ranjan [Abstract] teaches our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, the camera motion, depth of the static scene structure, and the optical flow of moving objects., [Figure 3]

Ranjan [Figure 1] teaches Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation. Left, top to bottom: sample image, soft masks representing motion segmentation, estimated depth map. Right, top to bottom: static scene optical flow, segmented flow in the moving regions and combined optical flow.

Ranjan [Figure 2] teaches The network R = (D;C) reasons about the scene by estimating optical flow over static regions using depth, D, and camera motion, C. The optical flow network F estimates flow over the whole image. The motion segmentation network, M, masks out static scene pixels from F to produce composite optical flow over the full image. A loss, E, using the composite flow is applied over neighboring frames to train all these models jointly.

Ranjan [Introduction] teaches deep learning methods have achieved state-of-the-art results on computer vision problems with supervision using large amounts of data

Ranjan [Figure 4] teaches Visual results. Top to bottom: Sample image, estimated depth, soft consensus masks, motion segmented optical flow and combined optical flow.

Ranjan [page 4] teaches Loss. We learn the parameters of the networks {….} by jointly minimizing the energy..equation 7, where {…} are the weights on the respective energy terms.  The terms…- the loss EM minimizes the cross entropy, H, between the masks and a unit tensor regulated by… [page 5] teaches the loss in Eq. (7) is formulated to minimize the reconstruction error of the neighboring frames. Two competitors, the static scene reconstructor R = (D_;C_) and moving region reconstructor F  minimize this loss. The reconstructor R reasons about the static scene using Eq. (8) and the reconstructor F  reasons about the moving regions using Eq. (9). The moderation is achieved by the mask network, Mx using Eq. (11). Furthermore, the collaboration between R; F is driven using Eq. (12) to train the network Mx.

Claim 3. Ranjan also teaches wherein the object mask comprises a binary image that assigns a first value to a region of the target image that contains the moving feature and a second value to a region of the target image that contains the static feature. Ranjan [Figure 2] teaches the network R = (D;C) reasons about the scene by estimating optical flow over static regions using depth, D, and camera motion, C. The optical flow network F estimates flow over the whole image. The motion segmentation network, M, masks out static scene pixels from F to produce composite optical flow over the full image. A loss, E, using the composite flow is applied over neighboring frames to train all these models jointly.
Ranjan [page 4] teaches Similarly, we estimate the segmentation of the target image into the static scene and moving regions. The optical flow of the static scene is defined only by the camera motion and depth. This generally refers to the structure of the scene. The moving regions have independent motion w.r.t. the scene. The segmentation masks corresponding to each pair of target and reference image are given by…equation (5)
Claim 8. Ranjan also teaches wherein determining the object mask comprises processing the target image by way of an object instance segmentation algorithm 47configured to identify the moving feature within the target image and generate a mask region representing the moving feature. Ranjan [Abstract] teaches our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, [Figure 1]

Ranjan [Idea, page 2] teaches Motion segmentation classifies a scene into static and moving regions.

Claim 9. Ranjan also teaches wherein determining the static depth image comprises: determining an optical flow image based on the reference image and the target image; determining a camera pose associated with the target image; and determining a motion parallax depth image that represents depth values of both the static feature and the moving feature in the target image based on the optical flow image and the camera pose. Ranjan [Figures 1, 2 and 4] equation 2

Claim 14. Ranjan also teaches further comprising: inserting into the target image a visual representation of an object at a selected position within the environment; determining, based on the dynamic depth image and the selected position, an occlusion between the visual representation of the object and at least one feature of the target image; and rendering the target image to indicate the object, the at least one feature, and the occlusion therebetween.  Ranjan [Figure 2] teaches the network R = (D,C) reasons about the scene by estimating optical flow over static regions using depth, D, and camera motion, C. The optical flow network F estimates flow over the whole image. The motion segmentation network, M, masks out static scene pixels from F to produce composite optical flow over the full image. A loss, E, using the composite flow is applied over neighboring frames to train all these models jointly.
Claim 15. Ranjan also teaches wherein the reference image and the target image form part of a video, and wherein the method further comprises: removing from the target image a visual representation of the moving feature; and inpainting, based on other image frames within the video and the dynamic depth image, portions of the environment within the target image that, prior to removal of the moving feature, were occluded by the moving feature and have been exposed by removal of the moving feature. Ranjan [Competitive Collaboration] teaches in the context of jointly learning depth, camera motion, optical flow and motion segmentation, the first player R = (D, C) consists of the depth and camera motion networks that reason about the static regions in the scene. The second player F is the optical flow network that reasons about the moving regions. For training the competitors, the motion segmentation network M selects networks (D, C) on pixels that are static and selects F on pixels that belong to moving regions. The competition ensures that (D, C) reasons only about the static parts and prevents moving pixels from corrupting its training. Similarly, it prevents any static pixels from appearing in the training loss of F, thereby improving its performance in the moving regions. In the second phase of the training cycle, the competitors (D, C) and F now collaborate to reason about static scene and moving regions by forming a consensus that is used as a loss for training the moderator, M. In the rest of this section, we formulate the joint unsupervised estimation of depth, camera motion, optical flow and motion segmentation within this framework.
Ranjan [Figures 1, 2 and 4]

Claim 18. It differs from claim 1 in that it is a non-transitory computer readable storage medium having stored thereon instructions that, when executed by a computing device, cause the computing device to perform operations of the method of claim 1. Therefore claim 18 has been analyzed and reviewed in the same way as claim 18. See the above analysis.   

Claim 20. It differs from claim 1 in that it is a system performing the method of claim 1. Therefore claim 20 has been analyzed and reviewed in the same way as claim 1. See the above analysis.   

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 2, 4-6, 10-12, 16 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation to Ranjan et al., hereinafter, “Ranjan” in view of Self-supervised Learning for Single View Depth and Surface Normal Estimation to Zhan  et al., hereinafter, “Zhan” and Multi-view Inpainting for RGB-D Sequence to Li et al., hereinafter, “Li”.
Claim 2. Ranjan is silent on the limitations of claim 2, Zhan, in the field of predicting depth in images, teaches wherein determining the static depth image comprises: processing the one or more of the reference image or the target image by at least one of (i) a multi view stereo (MVS) algorithm, (ii) a structure from motion (SfM) algorithm, or (iii) a motion parallax algorithm. Zhan Fig. 1. Our test-time setup where depths and surface normals are predicted from a single image, and ego-motion is predicted from two views. At traintime, all three networks are trained in a self-supervised manner from stereo image sequence data.

Zhan [II. Related Work] teaches self-supervised learning of single-view geometry Recent work have started to incorporate multi-view geometry based loss functions for depth regression resulting in a self-supervised learning framework for inferring depth from single image. This stream of work aims to replace the more explicit sensory-data based ground truth supervision with a good image alignment loss between different views observing the same scene (by using stereo data or monocular videos for supervision).

Li [Introduction] teaches 3D reconstruction tools [1, 2] such as structure from motion (SfM) and visual simultaneous localization and mapping (v-SLAM) serve for many purposes from path planning to scene understanding. Many of such approaches have provided RGB-D versions for the easily achievable distance information.

Li [Introduction] teaches second, we use the local homography based warping method to achieve more accurate alignments than previous work [7, 9]; as well as to prevent information loss during 2D-3D projection in the SfM based methods [10, 11]. Our third contribution is a series of inpainting methods to make use of information from multiple source frames, in which we propose an MRF based approach for combining the candidates; extend the searching region of exemplar based color image inpainting methods to multiple views and coherently inpaint the depth.

Hence the prior art includes each element claimed, although not necessarily in a single prior art reference, with the only difference between the claimed invention and the prior art being the lack of actual combination of the elements in a single prior art reference. Thus, it would have been obvious to one of ordinary skill in the art to modify a reference image and a target image each representing an environment containing moving features and static features by Ranjan with Zhan and Li’s teaching of each of the movable training features is fixed in a respective pose while being filmed by the respective camera. One would have been motivated to perform this combination due to the fact that it allows one to accurately determine depth in image data. In combination, Ranjan is not altered in that Ranjan continues to acquire a reference image and a target image each representing an environment containing moving features and static features. Zhan's teachings perform the same as they do separately of each of the movable training features is fixed in a respective pose while being filmed by the respective camera. Li continues to teach a multi view stereo (MVS) algorithm and a structure from motion (SfM) algorithm
Therefore one of ordinary skill in the art, such as an individual working in the field of analyzing moving objects in images could have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately. It is for at least the aforementioned reasons that the Examiner has reached a conclusion of obviousness with respect to claim 2.

Claim 4. Zhan also teaches wherein the ML model has been trained using a training process comprising: obtaining a video captured by a camera moving through a training environment that contains (i) a static training feature and (ii) a movable training feature that is fixed in a respective pose while being filmed by the camera; Zhan [Fig. 1.] teaches our test-time setup where depths and surface normals are predicted from a single image, and ego-motion is predicted from two views. At traintime, all three networks are trained in a self-supervised manner from stereo image sequence data.

Zhan [Fig. 1.] teaches encouraging piece-wise smooth depth maps. [6] extended the above framework to jointly estimate depth and ego-motion using monocular videos

determining a supervised depth image of a scene represented by the video, wherein the supervised depth image is determined based on (i) a training reference image from the video that represent the scene from a first point of view and (ii) a training target image from the video that represent the scene from a second point of view different from the first point of view; and determining one or more parameters of the ML model based on the supervised depth image. Zhan [III. Framework for Joint Learning of Depths and Surface Normals] teaches we present our system which consists of three CNNs. One each for per-pixel single-view depth prediction, for single view surface normal prediction and a pose-net which takes two images – consecutive images of a monocular video – as input to predict the camera motion (vehicle’s egomotion in KITTI) between these two frames in metric units. Our system is trained in a self-supervised manner, which means no ground truth data (depths or surface normals) is required for training. Instead we use stereo sequences for training for depths and surface normal from a single image where two consecutive stereo-pairs…form a single training instance. The goal is to predict the depth map D and surface normal map ^N of ItL (the left image at time t) which we define as the reference image I for a particular training instance. At the same time we also want to predict Tt!t-1 which is the relative pose (ego-motion) between the left/right image at time t and the left/right image at time t- 1… involving the scene’s depth observed by the left camera at time t and t - 1 with the estimated egomotion, LDN enforces the estimated depths and normals to be consistent, LN enforces the predicted normals to face the camera and LNS is a smoothness prior which favors the predicted normals to be piece-wise smooth. Additionally, assuming the scene is rigid, two temporal geometric consistency terms LDC and LNC enforce the estimated depths and normals at the two time instances to be consistent given the egomotion. Each of these terms are elaborated in the following sections.

Zhan [A. Enforcing Multi-View Photometric Consistency]

Claim 5. Zhan also teaches wherein determining the one or more parameters of the ML model comprises: determining a training object mask configured to (i) mask out the movable training feature in the training target image and (ii) preserve the static training feature in the training target image; 46determining, based on at least one of the training reference image and the training target image, a training static depth image that represents depth values of the static training feature in the training target image; and generating, using the ML model and based on (i) the training static depth image, (ii) the training object mask, and (iii) one or more of the training target image or the training reference image, a training dynamic depth image that represents depth values of both the static training feature and the movable training feature in the training target image; determining a difference between the training dynamic depth image and the supervised depth image; and adjusting the one or more parameters of the ML model based on the difference. Zhan [C.  Depth –Normal Consistency], [E. Enforcing Temporal Consistency of Predicted Geometry]

Claim 6. Zhan also teaches wherein the movable training feature comprise a first human, wherein the moving feature comprises a second human, and wherein the object mask comprises a human-shaped region. Ranjan [page 2] teaches a key reason is that the constraints applied here do not distinguish or segment objects that move independently, such as people and cars.

Claim 10. Li also teaches further comprising: determining a confidence map that corresponds to the static depth image and indicates, for each respective pixel within the static depth image, a confidence value associated with the depth value of the respective pixel, wherein the ML model is configured to generate the dynamic depth image further based on the confidence map. Li [Abstract] teaches in this work we propose a novel approach to remove undesired objects from RGB-D sequences captured with freely moving cameras, which enables static 3D reconstruction… for the left holes, we employ exemplar based multi-view inpainting method to deal with the color image and coherently use it as guidance to complete the depth correspondence. Experiments show that our approach is qualified for removing the undesired objects and inpainting the holes

Li [2.4. Depth Inpainting] teaches depth teaches inpainting is similar to the propagation methods designed for the color images to a certain degree. Result quality may however be limited if the algorithms designed for color images are simply transplanted to the depth counterparts. Therefore popular solutions use color images as guidance to complete the holes on the depth ones. Miao et al. [19] introduce a texture assisted inpainting technique via dividing the target area into smooth and edge classes and distribute different partial differential equations (PDE) to each class. Atapour-Abarghouei et al. [20] perform semantic segmentation on the color images to get the object edges and the depth value is coherently propagated within every object. Such work targets on assigning value to each unknown pixel. In this work, however, we take the unknown as one of the existing values and only inpaint the mask left by the removed undesired objects

Li [2.4. Depth Transformation]
 
Claim 11. Li also teaches further comprising: based on the confidence map and prior to providing the static depth image as input to the ML model, removing, from the static depth image, pixels associated with corresponding confidence values that are below a threshold confidence value. Li [Abstract] teaches in this work we propose a novel approach to remove undesired objects from RGB-D sequences captured with freely moving cameras, which enables static 3D reconstruction… for the left holes, we employ exemplar based multi-view inpainting method to deal with the color image and coherently use it as guidance to complete the depth correspondence. Experiments show that our approach is qualified for removing the undesired objects and inpainting the holes

Li [Introduction] teaches It was not until the recent blossom of image semantic segmentation that demonstrates new insight on how to remove the undesired objects in 2D image level with more flexibility

Li [2.4. Depth Inpainting] teaches depth teaches inpainting is similar to the propagation methods designed for the color images to a certain degree. Result quality may however be limited if the algorithms designed for color images are simply transplanted to the depth counterparts. Therefore popular solutions use color images as guidance to complete the holes on the depth ones. Miao et al. [19] introduce a texture assisted inpainting technique via dividing the target area into smooth and edge classes and distribute different partial differential equations (PDE) to each class. Atapour-Abarghouei et al. [20] perform semantic segmentation on the color images to get the object edges and the depth value is coherently propagated within every object. Such work targets on assigning value to each unknown pixel. In this work, however, we take the unknown as one of the existing values and only inpaint the mask left by the removed undesired objects

Claim 12. Zhan also teaches wherein determining the confidence map comprises: determining a left-right consistency between (i) a forward optical flow field and (ii) a backward optical flow field, each determined based on the target image and the reference image; determining an extent to which the forward optical flow field complies with an epipolar constraint of the reference image and the target image; determining an extent of parallax between respective portions of the target image and the reference image; and determining the confidence map based on (i) the left-right consistency, (ii) the extent to which the forward optical flow field complies with the epipolar constraint, and (iii) the extent of parallax. Zhan [Introduction] teaches we introduce a depth and normal consistency term over time and penalize the inconsistent depth and normal predictions for two consecutive frames of the video sequences during training.
Zhan [III. Framework for Joint Learning of Depths and Surface Normals] teaches we present our system which consists of three CNNs. One each for per-pixel single-view depth prediction, for single view surface normal prediction and a pose-net which takes two images – consecutive images of a monocular video – as input to predict the camera motion (vehicle’s egomotion in KITTI) between these two frames in metric units. Our system is trained in a self-supervised manner, which means no ground truth data (depths or surface normals) is required for training. Instead we use stereo sequences for training for depths and surface normal from a single image where two consecutive stereo-pairs…form a single training instance. The goal is to predict the depth map D and surface normal map ^N of ItL (the left image at time t) which we define as the reference image I for a particular training instance. At the same time we also want to predict Tt!t-1 which is the relative pose (ego-motion) between the left/right image at time t and the left/right image at time t- 1… involving the scene’s depth observed by the left camera at time t and t - 1 with the estimated egomotion, LDN enforces the estimated depths and normals to be consistent, LN enforces the predicted normals to face the camera and LNS is a smoothness prior which favors the predicted normals to be piece-wise smooth. Additionally, assuming the scene is rigid, two temporal geometric consistency terms LDC and LNC enforce the estimated depths and normals at the two time instances to be consistent given the egomotion. Each of these terms are elaborated in the following sections.
Zhan [A. Enforcing Multi-View Photometric Consistency]
Claim 16. Li also teaches wherein the reference image and the target image form part of a video, and wherein the method further comprises: determining, based on other image frames within the video and the dynamic depth image, an additional image of a first point of view of the environment different from a second point of view represented by the target image, wherein determining the additional image comprises rendering portions of the environment that are (i) represented in the other image frames, (ii) not represented in the target image, and (iii) visible from the first point of view of the additional image. Li [2.3. Multi-View based Inpainting] teaches recent research begins to show interests on using the geometric connections among different views. Baek et al. [18] present a multi-view based method to complete the user defined region by jointly inpaint the color and depth image, which takes advantages from SfM to achieve geometric registration among different views.

Claim 19. It differs from claim 10 in that it is a non-transitory computer readable storage medium having stored thereon instructions that, when executed by a computing device, cause the computing device to perform operations of the method of claim 10. Therefore

Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation to Ranjan et al., hereinafter, “Ranjan” in view of US 2020/0320720 A1 to Baig et al., hereinafter, “Baig”.
Claim 7. Ranjan is silent on the limitations of claim 7, Baig, in the field of determining movement of an object in images, teaches wherein the camera is moving through the environment while capturing the reference image and the target image, wherein the static feature maintains a fixed pose within the environment between the first time and the second time, and wherein a pose of the moving feature within the environment changes between the first time and the second time.  Baig [Abstract] teaches determining and displaying movement of an object in an environment using a moving camera includes identifying later environment features located in the environment in a later image, earlier environment features located in the environment in an earlier image, and earlier object features located on the object in the earlier image. The method further includes estimating object features in the later image using the earlier object features and a determined camera movement. The method further includes locating, in the later image, matched object features that are actual object features in the later image at a same location as the estimated object features. The method further includes determining that the object has moved between the earlier image and the later image if a number of matched object features does not exceed a threshold. Examiner interprets earlier image and later image to be reference image and target image, respectively. 

Baig [0008] teaches a method of determining and displaying movement of an object in an environment using a moving camera. The method includes acquiring an earlier image and a later image of the environment from an image stream captured by the camera. The method further includes identifying later environment features located in the environment in the later image, earlier environment features located in the environment in the earlier image, and earlier object features located on the object in the earlier image. The method further includes determining a camera movement from the earlier image to the later image using a difference in location between the earlier environment features and the later environment features. The method further includes estimating object features in the later image using the earlier object features and the determined camera movement. The method further includes locating, in the later image, matched object features that are actual object features in the later image at a same location as the estimated object features. The method further includes determining that the object has moved between the earlier image and the later image if the number of matched object features does not exceed a threshold. The method further includes determining that the object has not moved between the earlier image and the later image if the number of matched object features exceeds the threshold.

Baig [0011] teaches if the object has moved, the method further includes determining a pose of the object in the later image using the actual object features in the later image and the earlier object features, and updating a location of a displayed object based on the determined pose of the object.

Hence the prior art includes each element claimed, although not necessarily in a single prior art reference, with the only difference between the claimed invention and the prior art being the lack of actual combination of the elements in a single prior art reference. Thus, it would have been obvious to one of ordinary skill in the art to modify a reference image and a target image each representing an environment containing moving features and static features by Ranjan with Baig’s teaching the camera is moving through the environment while capturing the reference image and the target image. One would have been motivated to perform this combination due to the fact that it allows one to accurately determine depth in image data. In combination, Ranjan is not altered in that Ranjan continues to acquire a reference image and a target image each representing an environment containing moving features and static features. Baig's teachings perform the same as they do separately of the camera is moving through the environment while capturing the reference image and the target image.
Therefore one of ordinary skill in the art, such as an individual working in the field of analyzing moving objects in images could have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately. It is for at least the aforementioned reasons that the Examiner has reached a conclusion of obviousness with respect to claim 7.

Claim(s) 13 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation to Ranjan et al., hereinafter, “Ranjan” in view of US 2013/0141539 A1 to Awazu et al., hereinafter, “Awazu”.
Claim 13. Ranjan is silent on the limitations of claim 7, Awazu, in the field of depth imaging, teaches further comprising: applying a focus effect to a selected feature of the target image based on the dynamic depth image. Awazu [0099] teaches as illustrated in FIG. 9A and FIG. 9B, the CPU 40 controls the degree of aperture of the diaphragm 16 in accordance with the variation of the focus distance, thereby stabilizing the stereoscopic effect at a constant level around the focus distance.

Awazu [0100] teaches as illustrated in FIG. 9A, since the stereoscopic effect becomes greater when the zoom lens is moved from the wide-angle end toward the telephoto end (in the direction of increasing the focus distance), the CPU 40 decreases the degree of aperture of the diaphragm 16 so as to decrease the stereoscopic effect. FIG. 10A and FIG. 10B are drawings explaining a relation between the degree of aperture of the diaphragm 16 and the parallax with the focus position located in back of the object.

Hence the prior art includes each element claimed, although not necessarily in a single prior art reference, with the only difference between the claimed invention and the prior art being the lack of actual combination of the elements in a single prior art reference. Thus, it would have been obvious to one of ordinary skill in the art to modify a reference image and a target image each representing an environment containing moving features and static features by Ranjan with Awazu’s teaching applying a focus effect to a selected feature of the target image based on the dynamic depth image. One would have been motivated to perform this combination due to the fact that it allows one to accurately determine depth in image data. In combination, Ranjan is not altered in that Ranjan continues to acquire a reference image and a target image each representing an environment containing moving features and static features. Awazu's teachings perform the same as they do separately of applying a focus effect to a selected feature of the target image based on the dynamic depth image.
Therefore one of ordinary skill in the art, such as an individual working in the field of analyzing moving objects in images could have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately. It is for at least the aforementioned reasons that the Examiner has reached a conclusion of obviousness with respect to claim 13.

Claim 17. Awazu also teaches wherein the reference image and the target image form part of a video generated by a monoscopic camera, Ranjan [page 6] teaches Monocular Depth and Camera Motion Estimation

and wherein the method further comprises: determining a stereo video stream based on the video and the dynamic depth image, wherein the stereo video stream comprises a left-eye video stream and a right-eye video stream. Awazu [0007] teaches when photographing an image for a three-dimensional display using a monocular 3D camera, a user can monitor how the image has been photographed by three-dimensionally displaying an image for a left-eye (referred to as a left-eye image, hereinafter) and an image for a right-eye (referred to as a right-eye image, hereinafter). 

Awazu [0013] teaches an object of the presently disclosed subject matter, which has been made in view of circumstances described above, is to provide a monocular stereoscopic imaging device capable of maintaining a stereoscopic effect of a left-eye image and a right-eye image that are three-dimensionally displayed at a substantially constant level even if a zoom lens is moved while photographing through images or moving images using a monocular 3D camera

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claim 1 and similarly recited claims 18 and 20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1 of U.S. Patent No. 11315274 B2. Although the claims at issue are not identical, they are not patentably distinct from each other because they are similar in scope.
17/656165
US 11315274 B2
Claim 1. A computer-implemented method comprising: obtaining a reference image and a target image each representing an environment containing a moving feature and a static feature, wherein the reference image has been captured by a camera at a first time and the target image has been captured by the camera at a second time different from the first time;
Claim 1.  A method comprising: obtaining, by a processor, a reference image and a target image each representing an environment containing moving features and static features, wherein the reference image has been captured by a camera at a first time and the target image has been captured by the camera at a second time later than the first time;
determining an object mask configured to (i) mask out the moving feature in the target image and (ii) preserve the static feature in the target image;
determining, by the processor, an object mask configured to (i) mask out the moving features in the target image and (ii) preserve the static features in the target image;
determining, based on one or more of the reference image or the target image, a static depth image that represents depth values of the static features in the target image;
determining, by the processor and based on motion parallax between the reference image and the target image, a static depth image that represents depth values of the static features in the target image; 
and generating, using a machine learning (ML) model and based on (i) the static depth image, (ii) the object mask, and (iii) one or more of the target image or the reference image, a dynamic depth image that represents depth values of both the static features and the moving features in the target image.  
and generating, by the processor and by way of a machine learning (ML) model, a dynamic depth image that represents depth values of both the static features and the moving features in the target image


Likewise claims 2-17 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 2-16 of U.S. Patent No. 11315274 B2
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DELOMIA L GILLIARD whose telephone number is (571)272-1681. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached on 571 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DELOMIA L GILLIARD/Primary Examiner, Art Unit 2661