DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 11/20/2019 and 4/30/2020 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Specification
The disclosure is objected to because of the following informalities: In Para. 0087, line 3, “at leas the” should read “at least the”.  
Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5-14, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over “Semi-Supervise Deep Learning for Monocular Depth Map Prediction” by Kuznietsov et al. in view of "DeMoN: Depth and Motion Network for Learning Monocular Stereo" by Uhrig et al .
Regarding claim 1, Kuznietsov et al. teaches, a depth system for training a depth model for monocular depth estimation (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Note: semi-supervised learning incorporates a series of algorithms for a computer system to learn), comprising: one or more processors (Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network); a memory communicably coupled to the one or more processors and storing: a network module including instructions that when executed by the one or more processors (Pg. 6, second paragraph of left-hand column: to train the CNN on KITTI we use stochastic gradient descent with momentum with a learning rate of and 0.01 momentum of 0.9. We train the variants of our model for at least 15 epochs on a 6 GB NVIDIA GTX 980Ti with 6GB memory which allows for a batch size of 5; Note: the computer used to train a CNN contains a processor for executing instructions from the neural network stored in the memory of the computer) cause the one or more processors to: generate, as part of training the depth model according to a supervised training stage, a depth map from a first image of a pair of training images using the depth model, wherein the pair of training images are separate frames depicting a scene from a monocular video, and wherein at least the first image includes corresponding depth data (Abstract: novel approach to depth map prediction from monocular 1 and stereo image I2; Pg. 5: first paragraph of right-hand column: the sequences contain stereo imagery taken from a driving car in an urban scenario.; Fig. 1: sparse ground-truth depth readings from a 3D sensor are used for supervised training), generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images; and a training module including instructions that when executed by the one or more processors cause the one or more processors to compute a supervised loss based (Pg. 3, first paragraph of      right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images; Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network), at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss.
Kuznietsov et al. does not expressly disclose the following limitations underlined above: 1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss.

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of 
Regarding claim 2, Kuznietsov et al. teaches, the depth system of claim 1 (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Note: semi-supervised learning incorporates a series of algorithms for a computer system to learn; see claim 1 above for more details), wherein the training module includes instructions to compute the supervised loss (Pg. 3, first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images; Pg. 3: equation 6) including instructions to compute the supervised loss in combination with a self- supervised loss for the supervised training stage that is a second stage of training in a semi- supervised training process (Pg. 3, first paragraph under “3.1. Loss function”: we formulate a single loss function that incorporates both types of constraints that arise from supervised and unsupervised cues; Pg. 3: equation 5; Fig. 2: semi-supervised loss function), and wherein a first stage of training is self-supervised and occurs prior to the second stage (Pg. 2, second paragraph of right-hand column: Our semi-supervised approach simplifies the use of unsupervised cues; Pg. 2: second paragraph of left-hand column: the use of supervised training also simplifies unsupervised learning significantly).

However, Urhig et al. teaches, generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial
component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image depth data; Fig. 1: egomotion R, t; Pg. 6, second paragraph under “6.2. Error metrics: for evaluating the camera motion estimation, we report the angle (in degrees) between the prediction and the ground truth for both the translation and the rotation; Pg. 6, first paragraph of right-hand column: we minimized the reprojection error using the ceres library), and update the depth model and the pose model (Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera) according to at 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
Regarding claim 5, Kuznietsov et al. teaches, the depth system of claim 1 (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Note: semi-supervised learning incorporates a series of algorithms for a computer system to learn; see claim 1 above for more details), wherein the training module includes instructions to compute the supervised loss (Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network; Pg. 3: equation 6 for supervised loss) including instructions to apply a reprojected distance loss function to project predicted pixels and ground truth pixels onto the image space that corresponds to a contextual view of a camera when capturing the second image, wherein the predicted pixels correspond with points in the scene as identified in the depth map and observed from the contextual view, wherein the ground truth pixels correspond with the depth data for the first image (Fig. 1: for l and Ir), and wherein the training module includes instructions to compute the supervised loss using the reprojected distance loss function including instructions to compare corresponding ones of the predicted pixels with the ground truth pixels to generate a reprojected distance loss (Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network; Pg. 3: equation 6 for supervised loss; Pg. 3: first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images).
Kuznietsov et al. does not expressly disclose the following limitations in claim 1 from which claim 5 depends: 1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation underlined above: including instructions to apply a reprojected distance loss function to project predicted pixels and ground truth pixels onto the image space that corresponds to a contextual view of a camera when capturing the second image, wherein the predicted pixels correspond with points in the scene as identified in the depth map and observed from the contextual view.

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojection of the depth points according to the image space of the camera view as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
Regarding claim 6, Kuznietsov et al. teaches, the depth system of claim 1 (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Note: semi-supervised learning incorporates a series of algorithms for a computer system to learn; see claim 1 for more details), wherein the depth data includes sparse LiDAR data comprising depth information from four beams that correspond with sparse t and γ are trade-off parameters between supervised loss                         
                            
                                
                                    L
                                
                                
                                    θ
                                
                                
                                    S
                                
                            
                        
                     , unsupervised loss                         
                            
                                
                                    L
                                
                                
                                    θ
                                
                                
                                    U
                                
                            
                        
                    ; Note: the weights are parameters (i.e.  λ and γ) that can be changed; Pg. 5, paragraph under “4.1. Implementation Details: we initialize the encoder part of our network with ResNet-50 [11] weights pretrained for ImageNet classification task. The convolution filter weights in the decoder part are initialized randomly) and the pose model using the depth data to train the depth model on scale by accounting for scale aware differences between the depth maps and the sparse LiDAR data to improve scale awareness of the depth model in producing depth estimates.
Kuznietsov et al. does not expressly disclose the following limitations in claim 1 from which claim 6 depends: 1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation underlined above: and the pose model using the depth data to train the depth model on scale by accounting for scale aware differences between the depth maps and the sparse LiDAR data to improve scale awareness of the depth model in producing depth estimates.

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images, reprojecting the depth map and the depth data onto an image space, and scale awareness as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
Regarding claim 7, Kuznietsov et al. teaches, the depth system of claim 1 (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Note: semi-supervised learning incorporates a series of algorithms for a computer system to learn; see claim 1 above for more details), wherein the network module includes instructions to provide the depth model to infer distances from monocular images in a device (Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network; Abstract: novel approach to depth map prediction from monocular images that learns in a semi-supervised way; Pg. 1, first paragraph under “1. Introduction”: supervised deep learning approaches have demonstrated promising results for Ir) and the warped image (warped Il) based on the prediction of the CNN; Pg. 4, equation 9 for unsupervised loss; Note: photometric loss can include self-supervised loss which is part of the unsupervised loss category)  by generating a synthesized version of the first image using the depth map and the transformation, and calculate the photometric loss according to a comparison of the synthesized version with the first image (Fig. 2: semi-supervised loss function showing the difference between the input image (i.e. Ir) and the warped image (warped Il) based on the prediction of the CNN; Pg. 4, equation 9 for unsupervised loss; Note: photometric loss can include self-supervised loss which is part of the unsupervised loss category; Pg. 2, second paragraph of right-hand column: the loss quantifies the photometric error of the input image warped into its corresponding stereo image using the predicted depth).
Kuznietsov et al. does not expressly disclose the following limitations in claim 1 from which claim 7 depends: 1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation underlined above: by generating a synthesized version of the first image using the depth map and the transformation.

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
Regarding claim 8, Kuznietsov et al. teaches, the depth system of claim 1 (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Note: semi-supervised learning incorporates a series of algorithms for a computer system to learn; see claim 1 above for more details), wherein the depth model is a machine learning algorithm comprised of an encoder and a decoder that function together to generate depth estimates of a scene from a monocular image (we concurrently train a CNN from unsupervised and supervised depth cues to achieve state-of-the-art performance in single image depth prediction; Pg. 8, first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Pg. 2, second paragraph of left-hand column: we base our approach on a state-of-the-art deep residual network in an encoder decoder architecture for this task [17] and augment it with long skip connections between corresponding layers in encoder and decoder to predict high detail and wherein the pose model is a machine learning algorithm that performs a dimensional reduction of the training images to derive the transformation describing a change in pose between images within respective ones of the pairs.
Kuznietsov et al. does not expressly disclose the following limitations in claim 1 from which claim 8 depends: 1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation underlined above: and wherein the pose model is a machine learning algorithm that performs a dimensional reduction of the training images to derive the transformation describing a change in pose between images within respective ones of the pairs.
However, Urhig et al. teaches, generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by 
Regarding claim 9, Kuznietsov et al. teaches, a non-transitory computer-readable medium (Pg. 6, second paragraph of left-hand column: to train the CNN on KITTI we use stochastic gradient descent with momentum with a learning rate of and 0.01 momentum of 0.9. We train the variants of our model for at least 15 epochs on a 6 GB NVIDIA GTX 980Ti with 6GB memory which allows for a batch size of 5; Note: the computer used to train a CNN contains a processor for executing instructions from the neural network stored in the memory of the computer) for training a depth model for monocular depth estimation (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction) and including instructions that when executed by one or more processors (Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network) cause the one or more processors to: generate a depth map from a first image of a pair of training images using the depth model, wherein the pair of training images are separate frames depicting a scene from a monocular video, and wherein at least the first image includes corresponding depth data (Abstract: novel approach to depth map prediction from monocular images that learns in a semi-supervised way; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning 1 and stereo image I2; Pg. 5: first paragraph of right-hand column: the sequences contain stereo imagery taken from a driving car in an urban scenario; Fig. 1: sparse ground-truth depth readings from a 3D sensor are used for supervised training); generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images; compute a supervised loss based(Pg. 3, first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images; Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network), at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation; and update the depth model and the pose model according to at least the supervised loss.
Kuznietsov et al. does not expressly disclose the following limitations underlined above: 1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss.
However, Urhig et al. teaches, generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to 
Regarding claim 10, Kuznietsov et al. teaches, the non-transitory computer-readable medium of claim 9 (Pg. 6, second paragraph of left-hand column: to train the CNN on KITTI we use stochastic gradient descent with momentum with a learning rate of and 0.01 momentum of 0.9. We train the variants of our model for at least 15 epochs on a 6 GB NVIDIA GTX 980Ti with 6GB memory which allows for a batch size of 5; Note: the computer used to train a CNN contains a processor for executing instructions from the neural network stored in the memory of the computer; see claim 9 for more details), wherein the instructions to compute the supervised loss (Pg. 3, first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images; Pg. 3: equation 6)  include instructions to compute the supervised loss in combination with a self- supervised loss for a supervised training stage that is a second stage of training in a semi-supervised training process (Pg. 3, first paragraph under “3.1. Loss function”: we formulate a single loss function that incorporates both types of constraints that arise from supervised and unsupervised cues; Pg. 3: equation 5; Fig. 2: semi-supervised loss function), and wherein a first stage of training is self-supervised and occurs prior to the second stage (Pg. 2, second paragraph of right-hand column: our semi-supervised approach simplifies the use of unsupervised cues; Pg. 2: second paragraph of left-hand column: the use of supervised training also simplifies unsupervised learning significantly).
Kuznietsov et al. does not expressly disclose the following limitations in claim 9 from which claim 10 depends: 1) generate a transformation from the first image and a second image 
However, Urhig et al. teaches, generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial
component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image depth data; Fig. 1: egomotion R, t; Pg. 6, second paragraph under “6.2. Error metrics: for evaluating the camera motion estimation, we report the angle (in degrees) between the prediction and the ground truth for both the translation and the rotation; Pg. 6, first paragraph of right-hand column: we minimized the reprojection error using the ceres library), and update the depth model and the pose model (Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera) according to at least the supervised loss (Abstract: a crucial component of the approach is a training loss based on spatial relative difference; see section “5.1. Loss functions”).

Regarding claim 11, Kuznietsov et al. teaches, the non-transitory computer-readable medium of claim 9 (Pg. 6, second paragraph of left-hand column: to train the CNN on KITTI we use stochastic gradient descent with momentum with a learning rate of and 0.01 momentum of 0.9. We train the variants of our model for at least 15 epochs on a 6 GB NVIDIA GTX 980Ti with 6GB memory which allows for a batch size of 5; Note: the computer used to train a CNN contains a processor for executing instructions from the neural network stored in the memory of the computer; see claim 9 for more details), wherein the instructions to compute the supervised loss include instructions to apply a reprojected distance loss function to project predicted pixels and ground truth pixels onto the image space that corresponds to a contextual view of a camera when capturing the second image.
Kuznietsov et al. does not expressly disclose the following limitations in claim 9 from which claim 11 depends: 1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also 
However, Urhig et al. teaches, generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial
component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image depth data; Fig. 1: egomotion R, t; Pg. 6, second paragraph under “6.2. Error metrics: for evaluating the camera motion estimation, we report the angle (in degrees) between the prediction and the ground truth for both the translation and the rotation; Pg. 6, first paragraph of right-hand column: we minimized the reprojection error using the ceres library), and update the depth model and the pose model (Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera) according to at least the supervised loss (Abstract: a crucial component of the approach is a training loss based on spatial relative difference; see section “5.1. Loss functions”). Uhrig et al. also teaches, 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojection of the depth points according to the image space of the camera view as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
Regarding claim 12, Kuznietsov et al. teaches, the non-transitory computer-readable medium of claim 11 (Pg. 6, second paragraph of left-hand column: to train the CNN on KITTI we use stochastic gradient descent with momentum with a learning rate of and 0.01 momentum of wherein the predicted pixels correspond with points in the scene as identified in the depth map and observed from the contextual view, wherein the ground truth pixels correspond with the depth data for the first image (Fig. 1: for supervised training, we use (sparse) ground-truth depth readings from a supplementary sensing cue such as a 3D laser; Fig. 2: sparse ground-truth depth readings for Il and Ir), and wherein the instructions to compute the supervised loss using the reprojected distance loss function include instructions to compare corresponding ones of the predicted pixels with the ground truth pixels to generate a reprojected distance loss (Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network; Pg. 3: equation 6 for supervised loss; Pg. 3: first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images).
Kuznietsov et al. does not expressly disclose the following limitations in claim 9 from which claim 11 depends: 1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation in claim 11 from which claim 12 depends: 
However, Urhig et al. teaches, generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial
component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image depth data; Fig. 1: egomotion R, t; Pg. 6, second paragraph under “6.2. Error metrics: for evaluating the camera motion estimation, we report the angle (in degrees) between the prediction and the ground truth for both the translation and the rotation; Pg. 6, first paragraph of right-hand column: we minimized the reprojection error using the ceres library), and update the depth model and the pose model (Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera) according to at 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojection of the depth points according to the image space of the camera view as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
Regarding claim 13, Kuznietsov et al. teaches, a method of training a depth model for monocular depth estimation (Pg. 7, first paragraph of right-hand column: our semi-supervised learning method converges much faster (in about one third the number of iterations) than purely supervised training; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction), comprising: generating, as part of training the depth model according to a supervised training stage, a depth map from a first image of a pair of training images using the depth model, wherein the pair of training images are separate frames depicting a scene from a monocular video, and wherein at least the first image includes corresponding depth data (Abstract: novel approach to depth map prediction from monocular images that learns in a semi-supervised way; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning 1 and stereo image I2; Pg. 5: first paragraph of right-hand column: the sequences contain stereo imagery taken from a driving car in an urban scenario; Fig. 1: sparse ground-truth depth readings from a 3D sensor are used for supervised training); generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images; computing a supervised loss based (Pg. 3, first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images; Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network), at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation; and updating the depth model and the pose model according to at least the supervised loss.
Kuznietsov et al. does not expressly disclose the following limitations underlined above: 1) generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss.
However, Urhig et al. teaches, generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to 
Regarding claim 14, Kuznietsov et al. teaches, the method of claim 13 (Pg. 7, first paragraph of right-hand column: our semi-supervised learning method converges much faster (in about one third the number of iterations) than purely supervised training; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; see claim 13 for more details), wherein computing the supervised loss (Pg. 3, first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images; Pg. 3: equation 6) includes computing the supervised loss in combination with a self-supervised loss as the supervised training stage that is a second stage of training in a semi-supervised training process (Pg. 3, first paragraph under “3.1. Loss function”: we formulate a single loss function that incorporates both types of constraints that arise from supervised and unsupervised cues; Pg. 3: equation 5; Fig. 2: semi-supervised loss function), and wherein a first stage of training is self-supervised and occurs prior to the second stage (Pg. 2, second paragraph of right-hand column: Our semi-supervised approach simplifies the use of unsupervised cues; Pg. 2: second paragraph of left-hand column: the use of supervised training also simplifies unsupervised learning significantly).
Kuznietsov et al. does not expressly disclose the following limitations in claim 13 from which claim 14 depends: 1) generating a transformation from the first image and a second 
However, Urhig et al. teaches, generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image depth data; Fig. 1: egomotion R, t; Pg. 6, second paragraph under “6.2. Error metrics: for evaluating the camera motion estimation, we report the angle (in degrees) between the prediction and the ground truth for both the translation and the rotation; Pg. 6, first paragraph of right-hand column: we minimized the reprojection error using the ceres library), and update the depth model and the pose model (Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera) according to at least the supervised loss (Abstract: a crucial component of the approach is a training loss based on spatial relative difference; see section “5.1. Loss functions”).

Regarding claim 17, Kuznietsov et al. teaches, the method of claim 13 (Pg. 7, first paragraph of right-hand column: our semi-supervised learning method converges much faster (in about one third the number of iterations) than purely supervised training; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; see claim 13 for more details), wherein computing the supervised loss (Fig. 1: train a CNN from unsupervised and supervised depth cues; Note: training a CNN is done using a computer, in which the computer contains a processor for executing instructions from the neural network; Pg. 3: equation 6 for supervised loss) includes using a reprojected distance loss function to project predicted pixels and ground truth pixels onto the image space that corresponds to a contextual view of a camera when capturing the second image, wherein the predicted pixels correspond with points in the scene as identified in the depth map and observed from the contextual view, wherein the ground truth pixels correspond with the depth data for the first image (Fig. 1: for supervised training, we use (sparse) ground-truth depth readings from a supplementary sensing cue such as a 3D laser; Fig. 2: sparse ground-truth l and Ir), and wherein computing the supervised loss using the reprojected distance loss function includes comparing corresponding ones of the predicted pixels with the ground truth pixels to generate a reprojected distance loss (Fig. 1: train a CNN from unsupervised and supervised depth cues; Pg. 3: equation 6 for supervised loss; Pg. 3: first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images).
Kuznietsov et al. does not expressly disclose the following limitations in claim 13 from which claim 17 depends: 1) generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation underlined above: includes using a reprojected distance loss function to project predicted pixels and ground truth pixels onto the image space that corresponds to a contextual view of a camera when capturing the second image, wherein the predicted pixels correspond with points in the scene as identified in the depth map and observed from the contextual view.
However, Urhig et al. teaches, generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojection of the depth points according to the image space of the camera view as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
Regarding claim 18, Kuznietsov et al. teaches, the method of claim 13 (Pg. 7, first paragraph of right-hand column: our semi-supervised learning method converges much faster (in about one third the number of iterations) than purely supervised training; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; see claim 13 for more details), wherein the depth data includes sparse LiDAR data comprising depth information from four beams that correspond with sparse locations in the training images (Fig. 1: for supervised training we use (sparse) ground-truth depth readings from a supplementary sensing cue such as a 3D laser; Note: LiDAR use a laser to calculate depth information), and wherein the second stage refines learned weights of the depth model (Pg. 3: equation 5 for loss function in which λt and γ are trade-off parameters between supervised loss                         
                            
                                
                                    L
                                
                                
                                    θ
                                
                                
                                    S
                                
                            
                        
                     , unsupervised loss                         
                            
                                
                                    L
                                
                                
                                    θ
                                
                                
                                    U
                                
                            
                        
                    ; Note: the weights are parameters (i.e.  λ and γ) that can be changed; Pg. 5, paragraph under “4.1. Implementation Details: we initialize the encoder part of our network with ResNet-50 [11] weights pretrained for ImageNet classification task. The convolution filter weights in the decoder part are initialized randomly) and the pose model using the depth data to train the depth model on scale by accounting for scale aware differences between the depth maps and the sparse LiDAR data to improve scale awareness of the depth model in producing depth estimates.
Kuznietsov et al. does not expressly disclose the following limitations in claim 13 from which claim 18 depends: 1) generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation underlined above:  and the pose model using the depth data to train the depth model on scale by accounting for scale aware differences between the depth maps and the sparse LiDAR data to improve scale awareness of the depth model in producing depth estimates.
However, Urhig et al. teaches, generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first 

Regarding claim 19, Kuznietsov et al. teaches, the method of claim 13 (Pg. 7, first paragraph of right-hand column: our semi-supervised learning method converges much faster (in about one third the number of iterations) than purely supervised training; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; see claim 13 for more details), wherein the depth model is a machine learning algorithm comprised of an encoder and a decoder that function together to generate depth estimates of a scene from a monocular image (we concurrently train a CNN from unsupervised and supervised depth cues to achieve state-of-the-art performance in single image depth prediction; Pg. 8, first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Pg. 2, second paragraph of left-hand column: we base our approach on a state-of-the-art deep residual network in an encoder decoder architecture for this task [17] and augment it with long skip connections between corresponding layers in encoder and decoder to predict high detail output depth maps; see section “3.2, Network Architecture), and wherein the pose model is a machine learning algorithm that performs a dimensional reduction of the training images to derive the transformation describing a change in pose between images within respective ones of the pairs.
Kuznietsov et al. does not expressly disclose the following limitations in claim 13 from which claim 19 depends: 1) generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation underlined above: and wherein the pose model is a machine learning algorithm that performs a dimensional reduction of the training images to derive the transformation describing a change in pose between images within respective ones of the pairs.
However, Urhig et al. teaches, generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial
component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to 
Regarding claim 20, Kuznietsov et al. teaches, the method of claim 13 (Pg. 7, first paragraph of right-hand column: our semi-supervised learning method converges much faster (in about one third the number of iterations) than purely supervised training; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; see claim 13 for more details), further comprising: providing the depth model to infer distances from monocular images in a device (Abstract: novel approach to depth map prediction from monocular images that learns in a semi-supervised way; Pg. 1, first paragraph under “1. Introduction”: supervised deep learning approaches have demonstrated promising results for single image depth prediction. These learning approaches appear to capture the statistical relationship between appearance and distance to objects well), wherein computing the supervised loss further includes generating a photometric loss (Pg. 3, equation 6 for supervised loss; Fig. 2: semi-supervised loss function showing the difference between the input image (i.e. Ir) and the warped image (warped Il) based on the prediction of the CNN; Pg. 4, equation 9 for unsupervised loss; Note: photometric loss can include self-supervised loss which is part of the unsupervised loss category) by generating a synthesized version of the first image using the depth map and the transformation, and calculating the photometric loss according to a comparison of the synthesized version with the first image (Fig. 2: semi-supervised loss function showing the difference between the input image (i.e. Ir) and the warped image Il) based on the prediction of the CNN; Pg. 4, equation 9 for unsupervised loss; Note: photometric loss can include self-supervised loss which is part of the unsupervised loss category; Pg. 2, second paragraph of right-hand column: the loss quantifies the photometric error of the input image warped into its corresponding stereo image using the predicted depth).
Kuznietsov et al. does not expressly disclose the following limitations in claim 13 from which claim 20 depends: 1) generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss. Kuznietsov et al. also does not expressly disclose the following limitation underlined above: by generating a synthesized version of the first image using the depth map and the transformation.
However, Urhig et al. teaches, generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
Claims 3-4 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over “Semi-Supervise Deep Learning for Monocular Depth Map Prediction” by Kuznietsov et al. in view of "DeMoN: Depth and Motion Network for Learning Monocular Stereo" by Uhrig et al and further in view of “Digging into Self-Supervised Monocular Depth Estimation” by Godard et al.
Regarding claim 3, Kuznietsov et al. teaches, the depth system of claim 2 (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Note: semi-supervised learning incorporates a series of algorithms for a computer system to learn; see claim 2 above for more details), wherein the first stage is a self-supervised structure from motion (SfM) training process that accounts for motion of a camera between the training images to cause the depth model to learn how to infer depths without annotated training data, and wherein the training module includes instructions to compute the supervised loss during the second stage (Pg. 3, first paragraph under “3.1. Loss function”: we formulate a single loss function that incorporates both types of constraints that arise from supervised and unsupervised cues; Pg. 3: equation 6), including instructions to compute the supervised loss to refine the depth model using the depth data as selective dispersed ground truths providing limited supervision over depth estimates of the depth model (Pg. 3, first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images; Pg. 3, last paragraph of right-hand 
Kuznietsov et al. does not expressly disclose the following limitations in claim 1 from which claim 2 and thus claim 3 depends:  1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss.
However, Urhig et al. teaches, generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial
component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image depth data; Fig. 1: egomotion R, t; Pg. 6, second paragraph under “6.2. Error metrics: for evaluating the camera motion estimation, we report the angle (in degrees) between the prediction and the ground truth for both the translation and the rotation; Pg. 6, first paragraph 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
The combination of Kuznietsov et al. and Uhrig et al. does not expressly disclose the following limitation underlined above: wherein the first stage is a self-supervised structure from motion (SfM) training process that accounts for motion of a camera between the training images to cause the depth model to learn how to infer depths without annotated training data. 
However, Godard et al. teaches, wherein the first stage is a self-supervised structure from motion (SfM) training process that accounts for motion of a camera between the training images to cause the depth model to learn how to infer depths without annotated training data (Pg. 1, second paragraph of right-hand column: in addition to estimating depth, the model also needs to estimate the egomotion between temporal image pairs during training. This typically involves training a pose estimation network that takes a finite sequence of frames as input, and outputs the corresponding camera transformations; Pg. 7: first paragraph under “5. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include a self-supervised structure from motion (SfM) as taught by Godard et al. into the combined depth model of Kuznietsov et al. and Uhrig et al. in order to generate a useful sparse training signal for both camera pose and depth (Godard et al., Pg. 2, last paragraph of left-hand column).
Regarding claim 4, Kuznietsov et al. teaches, the depth system of claim 2 (Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; Note: semi-supervised learning incorporates a series of algorithms for a computer system to learn; see claim 2 above for more details), wherein the training module includes instructions to, during the first stage, produce first stage loss values from a first stage loss function that includes a photometric loss function and a depth smoothness loss function that separately account for pixel-level similarities and irregularities along edge regions between a synthesized image derived from depth predictions of the depth model and a target image of a respective one of the pairs.
Kuznietsov et al. does not expressly disclose the following limitations in claim 1 from which claim 2 and thus claim 4 depends:  1) generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship 
However, Urhig et al. teaches, generate a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial
component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image depth data; Fig. 1: egomotion R, t; Pg. 6, second paragraph under “6.2. Error metrics: for evaluating the camera motion estimation, we report the angle (in degrees) between the prediction and the ground truth for both the translation and the rotation; Pg. 6, first paragraph of right-hand column: we minimized the reprojection error using the ceres library), and update the depth model and the pose model (Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera) according to at least the supervised loss (Abstract: a crucial component of the approach is a training loss based on spatial relative difference; see section “5.1. Loss functions”).

The combination of Kuznietsov et al. and Uhrig et al. does not expressly disclose the following limitation underlined above: wherein the training module includes instructions to, during the first stage, produce first stage loss values from a first stage loss function that includes a photometric loss function and a depth smoothness loss function that separately account for pixel-level similarities and irregularities along edge regions between a synthesized image derived from depth predictions of the depth model and a target image of a respective one of the pairs.
However, Godard et al. teaches, wherein the training module includes instructions to, during the first stage, produce first stage loss values from a first stage loss function that includes a photometric loss function and a depth smoothness loss function (Pg. 3, second paragraph of right-hand column: we also formulate our problem as the minimization of a photometric reprojection error at training time; Pg. 3: equations 1-3; Pg. 3: first paragraph of right-hand column: classical binocular and multi-view stereo methods typically address this ambiguity by enforcing smoothness in the depth maps; Pg. 5, paragraph under “Final Training Loss”: we combine our per-pixel smoothness and masked photometric losses as, L = µLp + λLs) that separately account for pixel-level similarities and irregularities along edge regions between t’, with respect to the target image It’s pose, as Tt → t’. We predict a dense depth map Dt that minimizes the photometric reprojection error Lp; Pg. 3: equations 1-3; Pg. 5, paragraph under “Final Training Loss”: we combine our per-pixel smoothness and masked photometric losses as, L = µLp + λLs, and average over each pixel, scale, and batch; Pg. 5, paragraph under “Final Training Loss: edge-aware smoothness Ls; Pg. 12, equation 7).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include a photometric loss function and a depth smoothness loss function as taught by Godard et al. into the combined depth model of Kuznietsov et al. and Uhrig et al. in order to improve depth estimation performance and improve synthetic image output (Godard et al., Pg. 3, paragraph under “Appearance Based Losses”).
Regarding claim 15, Kuznietsov et al. teaches, the method of claim 14 (Pg. 7, first paragraph of right-hand column: our semi-supervised learning method converges much faster (in about one third the number of iterations) than purely supervised training; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; see claim 14 for more details), wherein the first stage is a self-supervised structure from motion (SfM) training process that accounts for motion of a camera between the training images to cause the depth model to learn how to infer depths without annotated training data, and wherein computing the supervised loss during the second stage includes computing the supervised loss (Pg. 3, first paragraph under “3.1. Loss function”: we formulate a single loss function that incorporates both types of constraints that arise from supervised and unsupervised cues; Pg. 3: equation 6) to refine the depth model using the depth data as selective dispersed ground truths providing limited supervision over depth estimates of the depth model (Pg. 3, first paragraph of right-hand column: we quantify the supervised loss in both images by projecting the ground truth laser data into each of the stereo images; Pg. 3, last paragraph of right-hand column: the supervised loss term measures the deviation of the predicted depth map from the available ground truth at the pixels).
Kuznietsov et al. does not expressly disclose the following limitations in claim 13 from which claim 14 and thus claim 15 depends: 1) generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss.
However, Urhig et al. teaches, generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojection of the depth points as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
The combination of Kuznietsov et al. and Uhrig et al. does not expressly disclose the following limitation underlined above: wherein the first stage is a self-supervised structure from motion (SfM) training process that accounts for motion of a camera between the training images to cause the depth model to learn how to infer depths without annotated training data

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include a self-supervised structure from motion (SfM) as taught by Godard et al. into the combined depth model of Kuznietsov et al. and Uhrig et al. in order to generate a useful sparse training signal for both camera pose and depth (Godard et al., Pg. 2, last paragraph of left-hand column).
Regarding claim 16, Kuznietsov et al. teaches, the method of claim 14 (Pg. 7, first paragraph of right-hand column: our semi-supervised learning method converges much faster (in about one third the number of iterations) than purely supervised training; Pg. 2: first paragraph under “3. Approach”: we base our approach on supervised as well as unsupervised principles for learning single image depth map prediction; Pg. 8: first paragraph under “5. Conclusions”: we propose a novel semi-supervised deep learning approach to monocular depth map prediction; see claim 14 for more details), wherein the first stage includes producing first stage loss values from a first stage loss function that includes a photometric loss function and a depth smoothness loss function that separately account for pixel-level similarities and irregularities along edge regions between a synthesized image derived from depth predictions of the depth model and a target image of a respective one of the pairs.
Kuznietsov et al. does not expressly disclose the following limitations in claim 13 from which claim 14 and thus claim 16 depends: 1) generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images, and 2) at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation, and update the depth model and the pose model according to at least the supervised loss.
However, Urhig et al. teaches, generating a transformation from the first image and a second image of the pair using a pose model, the transformation defining a relationship between the pair of training images (Fig. 1: input to the network is two successive images from a monocular camera. The network estimates the depth in the first image and the camera motion; Fig. 2: DeMoN takes an image pair as input and predicts the depth map of the first image and the relative pose of the second camera; Fig. 2: egomotion estimation r, t). Urhig et al. also teaches, at least in part, on reprojecting the depth map and the depth data onto an image space of the second image according to at least the transformation (Abstract: a crucial
component of the approach is a training loss based on spatial relative difference; Note: training loss is derived by spatial transformation and reprojecting the depth data to the original image depth data; Fig. 1: egomotion R, t; Pg. 6, second paragraph under “6.2. Error metrics: for 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include spatial transformation of images and reprojecting the depth map and the depth data onto an image space as taught by Urhig et al. into the semi-supervised depth model as taught by Kuznietsov et al. in order to improve depth accuracy and minimize reprojection error (Uhrig et al., Pg. 6, first and second paragraphs of right-hand column).
The combination of Kuznietsov et al. and Uhrig et al. does not expressly disclose the following limitation underlined above: wherein the first stage includes producing first stage loss values from a first stage loss function that includes a photometric loss function and a depth smoothness loss function that separately account for pixel-level similarities and irregularities along edge regions between a synthesized image derived from depth predictions of the depth model and a target image of a respective one of the pairs.
However, Godard et al. teaches, wherein the first stage includes producing first stage loss values from a first stage loss function that includes a photometric loss function and a depth smoothness loss function (Pg. 3, second paragraph of right-hand column: we also formulate our L = µLp + λLs) that separately account for pixel-level similarities and irregularities along edge regions between a synthesized image derived from depth predictions of the depth model and a target image of a respective one of the pairs (Pg. 3: second paragraph of right-hand column: We express the relative pose for each source view It’, with respect to the target image It’s pose, as Tt → t’. We predict a dense depth map Dt that minimizes the photometric reprojection error Lp; Pg. 3: equations 1-3; Pg. 5, paragraph under “Final Training Loss”: we combine our per-pixel smoothness and masked photometric losses as, L = µLp + λLs, and average over each pixel, scale, and batch; Pg. 5, paragraph under “Final Training Loss: edge-aware smoothness Ls; Pg. 12, equation 7).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to include a photometric loss function and a depth smoothness loss function as taught by Godard et al. into the combined depth model of Kuznietsov et al. and Uhrig et al. in order to improve depth estimation performance and improve synthetic image output (Godard et al., Pg. 3, paragraph under “Appearance Based Losses”).
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely 
Claims 1, 3-4, 6-9, 13, 15-16 and 18-20 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 3-5, 7-9, 13, 15-17 and 19-20 of copending Application No. 16/701,515 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because of variation of wording.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Instant application,
Claims 1, 3-4, 6-9, 13, 15-16 and 18-20
Application 16/701,515,
Claims 1, 3-5, 7-9, 13, 15-17 and 19-20
1 (independent claim)
1 (independent claim)
3
5
4
4
6
3
7
8
8
7
9 (independent claim)
9 (independent claim)
13 (independent claim)
13 (independent claim)
15
17
16
16
18
15
19
19
20
20

Claim 1 of the instant application is anticipated by claim 1 of application 16/701,515. Claim 1 of both applications disclose a depth system for training of a depth model. The training of a depth model in claim 1 of the instant application is broad, whereas the training of a depth 
Claim 3 of the instant application is anticipated by claim 5 of application 16/701,515. Claim 3 of the instant application and claim 5 of application 16/701,515 disclose a self-supervised structure from motion (SfM) training process in which the depth model learns how to infer depths without annotated training data, with slight variation in wording.
Claim 4 of the instant application is anticipated by claim 4 of application 16/701,515. Claim 4 of both applications disclose producing first stage loss values, including photometric loss and depth smoothness loss, with slight variation in wording.
Claim 6 of the instant application is anticipated by claim 3 of application 16/701,515. Claim 6 of the instant application and claim 3 of application 16/701,515 disclose training or adapting the depth model based on the difference in depths (i.e. loss), with slight variation in wording.
Claim 7 of the instant application is anticipated by claim 8 of application 16/701,515. Claim 7 of the instant application and claim 8 of application 16/701,515 disclose generating a photometric loss, with slight variation in wording.
Claim 8 of the instant application is anticipated by claim 7 of application 16/701,515. Claim 8 of the instant application and claim 7 of application 16/701,515 disclose transformation of images. The transformation of images in claim 8 of the instant application is broad, whereas the transformation of images in claim 7 of application 16/701,515 is specific (i.e. a rigid-body transformation).
Claim 9 of the instant application is anticipated by claim 9 of application 16/701,515. Claim 9 of both applications disclose a non-transitory computer-readable medium of training of a depth model for monocular depth estimation. The training of a depth model in claim 9 of the instant application is broad, whereas the training of a depth model in claim 9 of application 16/701,515 is specific (i.e. self-supervised training and weakly supervised training).
Claim 13 of the instant application is anticipated by claim 13 of application 16/701,515. Claim 13 of both applications disclose a method of training of a depth model for monocular depth estimation. The training of a depth model in claim 13 of the instant application is broad, whereas the training of a depth model in claim 13 of application 16/701,515 is specific (i.e. self-supervised training and weakly supervised training).
Claim 15 of the instant application is anticipated by claim 17 of application 16/701,515. Claim 15 of the instant application and claim 17 of application 16/701,515 disclose a self-supervised structure from motion (SfM) training process in which the depth model learns how to infer depths without annotated training data, with slight variation in wording.
Claim 16 of the instant application is anticipated by claim 16 of application 16/701,515. Claim 16 of both applications disclose producing first stage loss values, including photometric loss and depth smoothness loss, with slight variation in wording.
Claim 18 of the instant application is anticipated by claim 15 of application 16/701,515. Claim 18 of the instant application and claim 15 of application 16/701,515 disclose training or adapting the depth model based on the difference in depths (i.e. loss), with slight variation in wording.
Claim 19 of the instant application is anticipated by claim 19 of application 16/701,515. Claim 19 of both applications disclose transformation of images. The transformation of images in claim 19 of the instant application is broad, whereas the transformation of images in claim 19 of application 16/701,515 is specific (i.e. a rigid-body transformation).
Claim 20 of the instant application is anticipated by claim 20 of application 16/701,515. Claim 20 of both applications disclose generating a photometric loss, with slight variation in wording. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Mousavian et al. (US Patent No. 10,580,158 B1) teaches a dense depth estimation of image data. Lukierski et al. (US 2018/0189565 A1) teaches using the monocular multi-directional camera device, a sequence of images are obtained at different angular positions during the instructed movement and pose data is determined (Abstract).
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Daniella M. DiGuglielmo whose telephone number is 571-272-2682.  The examiner can normally be reached on Monday - Friday 7:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/Daniella M. DiGuglielmo/Examiner, Art Unit 2664                                                                                                                                                                                                        /NANCY BITAR/Primary Examiner, Art Unit 2664