Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 5, 8, 10, 13, 14, 17, 20 and 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2021/0150203 A1 to Liu et al., hereinafter, “Liu” in view of US 2019/0005718 A1 to Zhou et al., hereinafter, “Zhou” and Unsupervised Learning of Depth and Ego-Motion from Video to Zhou et al., hereinafter, Zhou2.
Claim 1. A method of map construction using a video sequence captured on a camera of a vehicle in an environment, comprising: Liu [0002] teaches generating parametric top-view representation of road scenes and more particularly to systems and methods of capturing and converting perspective video frames into a top-down view of complex road scenes.

Liu [0066] teaches the perception of a road scene involves capturing a perspective image using a camera (e.g., a digital video camera), where the image(s) can be made up of an array (i.e., row and column) of pixels that can be analyzed pixel by pixel. In various embodiments, a plurality of images can be captured in sequence as a video, where the images are digital images made up of row by column pixels. The digital images can be captured and recorded by a digital camera. The cameral can be mounted forward-facing in a vehicle.

receiving a video sequence from the camera, the video sequence comprising a plurality of image frames capturing a scene of the environment of the vehicle; Liu [0025] teaches a sequence of images 101, 102, 103, 104 of a roadway 110 can be captured as a video, where the images 101, 102, 103, 104 can be a perspective view of a roadway 110.

Liu fails to explicitly teach predicting a ray surface, Zhou in the same field of training depth images, teaches using a neural camera model to predict a depth map and a ray surface for the plurality of image frames in the received video sequence; Zhou [Abstract] teaches performing pairwise feature matching on the plurality of images, generating a corresponding eigen matrix according to the pairwise feature matching, and performing noise processing on the eigen matrix; performing 3D reconstruction according to the feature matching and the noise-processed eigen matrix and based on a ray model, to generate a 3D feature point cloud and a reconstructed camera pose set

Zhou [0029] teaches it can be understood that, compared with pixel-based plane models in the related art, the present disclosure can be applied to various types of camera (such as a panoramic type, a fisheye type, a plane type, etc.) and unify them by using the ray model.

Zhou [0037] teaches In detail, the models of the camera (such as a panoramic model, a fisheye model, a plane model, etc.) can be acquired first, and then the corresponding ray models can be defined according to the models of the camera. It should be noted that, the ray model may be defined based on a fact that each ray r can be defined by an origin point and another point x(x,y,z), x.sup.2+y.sup.2+z.sup.2=1 on a unit ball. The ray is one-to-one corresponding to an image coordinate u(u,v) through a mapping function. The mapping function can be defined as x=k(u,K), u=k.sup.−1(x,K), where K is internal parameters of the camera.

Zhou2, in the same field of training depth images teaches and constructing a map of the scene of the environment based on image data captured in the plurality of frames and depth information in the predicted depth maps. Zhou2 Figure 2. Overview of the supervision pipeline based on view synthesis. The depth network takes only the target view as input, and outputs a per-pixel depth map Dˆt. The pose network takes both the target view (It) and the nearby/source views (e.g., It−1 and It+1) as input, and outputs the relative camera poses (Tˆt→t−1, Tˆt→t+1). The outputs of both networks are then used to inverse warp the source views (see Sec. 3.2) to reconstruct the target view, and the photometric reconstruction loss is used for training the CNNs.

Thus at the time of the invention, it would have been obvious to one of ordinary skills in the art to modify teaching of the art above of the same field of endeavor of Liu, Zhou and Zhou2 to achieve effective depth training in image data. (Zhou [0005] and Zhou2 [Abstract])

Claim 2. Liu further teaches wherein predicting the depth map comprises performing the prediction under a constraint that predicted depths for corresponding pixels in the plurality of image frames are consistent across the plurality of image frames in the video sequence. Liu [0070] teaches at block 218, the location of objects within each image relative to each other can be identified using depth prediction/estimation.

Liu [0073] teaches a convolutional neural network (CNN) can predict dense depth, where for each pixel in perspective image space, a depth value is provided for it. Such depth value(s) represent the predicted absolute depth of this pixel in real world, e.g. 10.5 meters from the point in the image represented by this pixel in real world to camera. The CNN can take as input a perspective image with occluded regions (corresponding to foreground objects) masked out, and estimate the segmentation labels and depth values over the entire image.

Claim 5. Zhou further teaches further comprising using the plurality of image frames to train the neural camera model at the same time the neural camera model is used to predict the depth map and ray surface for the plurality of image frames. Zhou [Abstract] teaches performing pairwise feature matching on the plurality of images, generating a corresponding eigen matrix according to the pairwise feature matching, and performing noise processing on the eigen matrix; performing 3D reconstruction according to the feature matching and the noise-processed eigen matrix and based on a ray model, to generate a 3D feature point cloud and a reconstructed camera pose set

Zhou [0029] teaches it can be understood that, compared with pixel-based plane models in the related art, the present disclosure can be applied to various types of camera (such as a panoramic type, a fisheye type, a plane type, etc.) and unify them by using the ray model.

Zhou [0037] teaches In detail, the models of the camera (such as a panoramic model, a fisheye model, a plane model, etc.) can be acquired first, and then the corresponding ray models can be defined according to the models of the camera. It should be noted that, the ray model may be defined based on a fact that each ray r can be defined by an origin point and another point x(x,y,z), x.sup.2+y.sup.2+z.sup.2=1 on a unit ball. The ray is one-to-one corresponding to an image coordinate u(u,v) through a mapping function. The mapping function can be defined as x=k(u,K), u=k.sup.−1(x,K), where K is internal parameters of the camera.

Claim 8. Zhou further teaches wherein predicting the ray surfaces comprises performing the prediction under a constraint that predicted ray surfaces for corresponding pixels in the plurality of image frames are consistent across the plurality of image frames in the video sequence. Zhou [Abstract] teaches performing pairwise feature matching on the plurality of images, generating a corresponding eigen matrix according to the pairwise feature matching, and performing noise processing on the eigen matrix; performing 3D reconstruction according to the feature matching and the noise-processed eigen matrix and based on a ray model, to generate a 3D feature point cloud and a reconstructed camera pose set

Zhou [0037] teaches In detail, the models of the camera (such as a panoramic model, a fisheye model, a plane model, etc.) can be acquired first, and then the corresponding ray models can be defined according to the models of the camera. It should be noted that, the ray model may be defined based on a fact that each ray r can be defined by an origin point and another point x(x,y,z), x.sup.2+y.sup.2+z.sup.2=1 on a unit ball. The ray is one-to-one corresponding to an image coordinate u(u,v) through a mapping function. The mapping function can be defined as x=k(u,K), u=k.sup.−1(x,K), where K is internal parameters of the camera.

Claim 10. Zhou2 further teaches further comprising using the camera model to predict a depth map and a ray surface for the plurality of image frames for each of a plurality of different video sequences to train the neural camera model independently on each of the different video sequences. Zhou2 [Introduction] teaches in this work, we mimic this approach by training a model that observes sequences of images and aims to explain its observations by predicting likely camera motion and the scene structure (as shown in Fig. 1). We take an end-to-end approach in allowing the model to map directly from input pixels to an estimate of ego-motion (parameterized as 6-DoF transformation matrices) and the underlying scene structure (parameterized as per-pixel depth maps under a reference view).
Zhou2 Figure 2. Overview of the supervision pipeline based on view synthesis. The depth network takes only the target view as input, and outputs a per-pixel depth map Dˆt. The pose network takes both the target view (It) and the nearby/source views (e.g., It−1 and It+1) as input, and outputs the relative camera poses (Tˆt→t−1, Tˆt→t+1). The outputs of both networks are then used to inverse warp the source views (see Sec. 3.2) to reconstruct the target view, and the photometric reconstruction loss is used for training the CNNs.
Claim 13. It differs from claim 1 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 1. Therefore claim 13 has been analyzed and reviewed in the same way as claim 1. See the above analysis. 
Claim 14. It differs from claim 2 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 2. Therefore claim 14 has been analyzed and reviewed in the same way as claim 2. See the above analysis. 
Claim 17. It differs from claim 5 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 5. Therefore claim 17 has been analyzed and reviewed in the same way as claim 5. See the above analysis. 
Claim 20. It differs from claim 8 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 8. Therefore claim 18 has been analyzed and reviewed in the same way as claim 8. See the above analysis. 

Claim 22. It differs from claim 10 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 10. Therefore claim 22 has been analyzed and reviewed in the same way as claim 10. See the above analysis. 

Claims 3, 6, 7, 9, 12, 15, 18, 21 and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2021/0150203 A1 to Liu et al., hereinafter, “Liu” in view of US 2019/0005718 A1 to Zhou et al., hereinafter, “Zhou” and Unsupervised Learning of Depth and Ego-Motion from Video to Zhou et al., hereinafter, Zhou2 and in further view of US 2020/0258249 A1 to Angelova et al., hereinafter, “Angelova”.
Claim 3. The combination of Liu, Zhou, and Zhou2 are silent of claim 3, however, Angelova, in the same field of depth prediction in image data, teaches further comprising using the neural camera model to estimate ego motion between a first image frame and a second image frame to determine displacement relative to objects in the scene. Angelova [0020] teaches the subject matter described in this specification is generally directed to a training scheme for unsupervised learning of depth and camera motion (or ego-motion) from a sequence of images, e.g., frames of a video captured by a camera of a robotic agent (i.e. a monocular video).

Thus at the time of the invention, it would have been obvious to one of ordinary skills in the art to modify teachings of Liu with the teaching of Angelova to predict depth in image data [0006-0022].


Claim 6. Angelova further teaches, further comprising using a neural camera model to predict a pose of the camera. Angelova [0040] teaches the subset of images includes three or more images from the sequence of images 106. For example, the subset of images may include images X.sub.t-2, X.sub.t-1, and X.sub.t. In this example, given the subset of images, the camera motion network 104 generates a camera motion output that represents the camera's movement from time t−2 to time t.

Claim 7. Angelova further teaches wherein using a neural camera model to predict a depth map and a ray surface for the plurality of image frames in the received video sequence comprises passing each frame of the video sequence through the neural camera model individually to train the neural camera model and to provide depth and ray surface predictions for each image. Angelova [Abstract] teaches receive a sequence of images and, for each image in the sequence: process the image in accordance with a current internal state of the recurrent neural network to (i) update the current internal state and (ii) generate a depth output that characterizes a predicted depth of a future image in the sequence.

Angelova [0008] teaches the system may further include an image generation subsystem configured to, for each image in the sequence: receive the depth output that characterizes the predicted depth of the future image, and generate a prediction of the future image using the depth output. The depth output may include a predicted depth value for each pixel of a plurality of pixels in the future image that represents a respective distance of a scene depicted at the pixel from a focal plane of the future image.

Angelova [0030] teaches the image prediction system 100 is configured to receive a sequence of images 106 and to process the sequence of images 106 to generate, for each image in the sequence, an output image that is a prediction of a future image in the sequence of images.

Claim 9. Angelova further teaches wherein the video sequence comprises a portion of an entire video file. Angelova [0030] teaches the image prediction system 100 is configured to receive a sequence of images 106 and to process the sequence of images 106 to generate, for each image in the sequence, an output image that is a prediction of a future image in the sequence of images. For example, the sequence of images 106 may include frames of video being captured by the camera of a robotic agent and a future image may be an image that will be captured by the camera of the robotic agent in the future. A future image can be, for example, an image that immediately follows the current image in the sequence, an image that is three images after the current image in the sequence, or an image that is five images after the current image in the sequence in the sequence.

Claim 12. The method of claim 1, wherein the neural camera model is configured to learn a pixel-wise ray surface that enables learning depth and pose estimates in a self- supervised way. Angelova [0036] teaches once the neural network 102 has generated the depth map D.sub.k-1, the subsystem 104 uses the depth map D.sub.k-1 and the current image X.sub.k-1 to construct multiple three-dimensional (3D) points, each 3D point corresponding to a different pixel in the current image X.sub.k-1. In particular, for each pixel in the multiple pixels in the current image, the subsystem uses (i) the x and y coordinates of the pixel, and (ii) the pixel's depth value obtained from the depth map D.sub.k-1 in order to construct a 3D point. The newly constructed 3D points form a point cloud C. Each point has x, y, z coordinates, in which the x and y coordinates of the 3D point in the point cloud C are determined based on the x and y coordinates of the pixel in the current image, and the z coordinate of the 3D point is determined based on the depth value of the pixel. The 3D point is assigned the same pixel values (e.g. RGB values) as its corresponding pixel in the current image X.sub.k-1., 

Angelova [0038] teaches camera motion of the camera can be computed based on a given sequence of camera pose vectors {P.sub.1, P.sub.2, . . . , P.sub.k}. A camera pose vector P.sub.i represents a position and orientation of the camera at time step i. Specifically, a camera pose vector P.sub.i includes a 3D position and 3D orientation, i.e. yaw, pitch, and roll angles, of the camera at time step i. To predict the depth map D.sub.k of the future image X.sub.k, the subsystem 104 computes, based on camera pose vectors P.sub.k-1 and P.sub.k, the camera motion between frames X.sub.k-1 and X.sub.k. The computed camera motion includes three translation components t.sub.x, t.sub.y, t.sub.z and three rotation components r.sub.x, r.sub.y, r.sub.z. The subsystem 104 then computes, based on the camera motion between frame X.sub.k-1 and X.sub.k, new coordinates and orientation of the camera at time step k. Given the new coordinates and orientation of the camera, the subsystem 104 projects the point cloud C to a plane that is at a predetermined distance from the camera and is orthogonal to the camera's principal axis, which is formed by the yaw, pitch, and roll orientation angles of the camera. The subsystem 104 then updates the depth value of each projected point in the plane based on a respective newly-calculated distance from its corresponding 3D point in the point cloud C to the plane. The obtained projected points in the plane form the future depth map D.sub.k of the future frame X.sub.k. The subsystem 104 then creates a prediction of the future frame X.sub.k by painting each of the projected points in the plane with the respective pixel values, such as RGB values, that were assigned to its corresponding 3D point in the point cloud C.

Angelova [0040] teaches the image depth prediction neural network is configured to receive a sequence of images and to generate, for each image in the sequence, a depth map that characterizes a current depth of the current image or a predicted depth of a future image. The image depth prediction neural network may be a neural network which is trained using supervised training employing image sequences associated with ground truth depth maps.

Claim 15. It differs from claim 3 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 3. Therefore claim 15 has been analyzed and reviewed in the same way as claim 3. See the above analysis. 
Claim 18. It differs from claim 6 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 6. Therefore claim 18 has been analyzed and reviewed in the same way as claim 6. See the above analysis. 

Claim 18. It differs from claim 6 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 6. Therefore claim 18 has been analyzed and reviewed in the same way as claim 6. See the above analysis. 

Claim 21. It differs from claim 9 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 9. Therefore claim 21 has been analyzed and reviewed in the same way as claim 9. See the above analysis. 

Claim 24. It differs from claim 12 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 12. Therefore claim 25 has been analyzed and reviewed in the same way as claim 12. See the above analysis. 

Claims 4 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2021/0150203 A1 to Liu et al., hereinafter, “Liu” in view of US 2019/0005718 A1 to Zhou et al., hereinafter, “Zhou” and Unsupervised Learning of Depth and Ego-Motion from Video to Zhou et al., hereinafter, Zhou2 and in further view of US 2020/0258249 A1 to Angelova et al., hereinafter, “Angelova” and US 2021/0065391 A1 to Tran et al., hereinafter, “Tran”.
Claim 4. Tran in the same field of depth prediction in image data, teaches wherein optimizing further comprises using ego motion estimated between two frames to transfer depth information from the first image frame to the second image frame. Tran [0053] teaches here, d.sub.c.fwdarw.k1.sup.i(w) is the transferred depth of the i.sup.th keypoint from frame c to frame k1.

Thus at the time of the invention, it would have been obvious to one of ordinary skills in the art to modify teaching of Liu and Angelova with the teaching of Tran to improve monocular geometric SLAM and the rapid advances in unsupervised monocular depth prediction approaches [0020] and 0028].

Claim 16. It differs from claim 4 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 4. Therefore claim 16 has been analyzed and reviewed in the same way as claim 4. See the above analysis. 

Claims 11 and 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2021/0150203 A1 to Liu et al., hereinafter, “Liu” in view of US 2019/0005718 A1 to Zhou et al., hereinafter, “Zhou” and Unsupervised Learning of Depth and Ego-Motion from Video to Zhou et al., hereinafter, Zhou2 and in further view of  Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras to Gordon et al., hereinafter, “Gordon”.
Claim 11. Gordon, in the field of learning depth in image data, teaches wherein predicting is performed without a known or calibrated camera model for the camera. Gordon [Abstract] teaches learn the camera intrinsic parameters, including lens distortion, from video in an unsupervised manner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale





Thus at the time of the invention, it would have been obvious to one of ordinary skills in the art to modify teaching of Liu with the teachings of Gordon to simultaneously learning depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as a supervision signal [Abstract].

Claim 23. It differs from claim 11 in that it is a system for map construction using a video sequence captured on a camera of a vehicle in an environment, the system comprising: a non-transitory memory configured to store instructions; a processor configured to execute the instructions to perform the method of claim 11. Therefore claim 23 has been analyzed and reviewed in the same way as claim 11. See the above analysis. 
Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DELOMIA L GILLIARD whose telephone number is (571)272-1681. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached on 571 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DELOMIA L GILLIARD/Primary Examiner, Art Unit 2661