Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 4-6, 11-13, 15-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lin et al. (US 20160350930 in view of McCormac et al. (“SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks”).
	Re claim 1, Lin teaches a depth information generation method implemented by a computer system having at least one processor configured to execute computer-readable instructions included in a memory, the method comprising:
by the at least one processor, (see [0092], computing device with processor and memory).
calculating distance information of an object on a map using the map corresponding to a street view image (see [0028], each pixel in image is assigned a depth value representing an absolute distance value between the camera and objects representing by the pixel), see [0039-0040], in reference to Fig. 4, wherein the input image 112 is used to obtain depth layout 304)  (see [0045], street view), ([0026] In the following, a complementary effect from the typical failure cases of the two tasks is observed, which leads to a description of a unified coarse-to-fine framework for joint semantic segmentation and depth estimation that is usable for a single image. For example, a framework is proposed that first predicts a coarse global model composed of semantic labels and depth values (e.g., absolute depth values) through machine learning to represent an overall context of an image. The semantic labels describe "what" is being represented by respective pixels in the image, e.g., sky, plant, ground, wall, building, and so forth. The depth values describe a distance between a camera used to capture a scene in the image and respective objects in the scene represented by the pixel, e.g., a "z" distance in the scene captured by the image)

generating depth information of the street view image based on the distance (see [0042] in reference to Fig. 4, wherein merge calculation module merges depth layouts to form depth map 122) and ([0026] In the following, a complementary effect from the typical failure cases of the two tasks is observed, which leads to a description of a unified coarse-to-fine framework for joint semantic segmentation and depth estimation that is usable for a single image. For example, a framework is proposed that first predicts a coarse global model composed of semantic labels and depth values (e.g., absolute depth values) through machine learning to represent an overall context of an image. The semantic labels describe "what" is being represented by respective pixels in the image, e.g., sky, plant, ground, wall, building, and so forth. The depth values describe a distance between a camera used to capture a scene in the image and respective objects in the scene represented by the pixel, e.g., a "z" distance in the scene captured by the image
	Lin does not explicitly teach wherein the map is a 2d map, and wherein generating depth information on the image is based on the semantic information.
	However, McCormac teaches wherein the map is a 2d map (see p. 4628, in reference to Fig. 1 and 2, wherein the input data is 2D frames, wherein “Our approach is to use the SLAM system to provide correspondence from the 2D frame into a globally consistent 3D map”).
Wherein generating depth information on the image is based on the semantic information (see p. 4628, I. Introduction: “what is the distance between the lectern and its nearest chair?  In this work, we combine the geometric information from a state-of-the-art SLAM system…with recent advances in semantic segmentation using Convolution Neural Networks (CNN)”), and (see Fig. 1-2, in reference to p. 4629, III. Method, wherein from input 2D data, the SLAM reconstruction (distance information) and the CNN probability maps (semantic information) are used to generate a Semantically Fused Dense Reconstruction (generated depth information based on distance and semantic information).
	Lin and McCormac teaches claim 1.  It would have been obvious to one of ordinary skill in the art at the time of filing to modify Lin and McCormac’s depth and semantic segmentation system to explicitly include generating depth information based on semantic information, as taught by McCormac, as the references are in the analogous art of semantic segmentation of input images.  An advantage of the modification is that it achieves the result of explicitly using input 2d images and generating a semantically fused dense reconstruction of the input data using semantic information with distance information of objects within the input 2d map.
	Re claim 2, Lin and McCormac teaches claim 1.  Furthermore, Lin teaches calculating of the distance information comprises calculating a distance from the object based on location information included in the street view image on the 2D map (see [0026], wherein depth values describe distance between a camera and respective objects in the scene represented by the pixel (location of objects in the captured image).
	 Re claim 4, Lin and McCormac teaches claim 1.  Furthermore, Lin teaches wherein extracting of the semantic information comprises generating an object mask as depth information in a vertical direction by extracting a portion corresponding to the object in the street view image ([0026] In the following, a complementary effect from the typical failure The semantic labels describe "what" is being represented by respective pixels in the image, e.g., sky, plant, ground, wall, building, and so forth. The depth values describe a distance between a camera used to capture a scene in the image and respective objects in the scene represented by the pixel, e.g., a "z" distance in the scene captured by the image) and (see [0040], in reference to Fig. 3, semantic layout 302 labeling pixels in a vertical direction such as sky, building, ground, to create an object mask).
	Re claim 5, Lin and McCormac teaches claim 1.  Furthermore, Lin teaches wherein the extracting of the semantic information comprises extracting the semantic information by classifying each pixel of the street view image into a background and the object through a semantic segmentation scheme (see [0026] and [0040], in reference to Fig. 3, wherein sky and ground are considered segmented backgrounds from a building object).
	Re claim 6, Lin and McCormac teaches claim 4.  Furthermore, Lin teaches wherein the generating of the object mask comprises generating the depth information of the street view image by inserting the distance information as depth information in a horizontal direction (see [0026], wherein distance between a camera to a captured scene in the image and respective objects are represented with a z depth distance as a “horizontal direction”) and (see Fig. 4, in reference to [0063-0064], wherein depth map is generated.

	Claim 11 claims limitations in scope to claim 1, and is rejected for at least the reasons above.
Re claim 12, a computer system comprising:
at least one processor configured to execute computer-readable instructions included in a memory, (see [0092], computing device with processor and memory).
Wherein the processor comprises:
A distance information calculator configured to calculate distance information of an object on a map using the map corresponding to a street view image (see [0028], each pixel in image is assigned a depth value representing an absolute distance value between the camera and objects representing by the pixel), see [0039-0040], in reference to Fig. 4, wherein the input image 112 is used to obtain depth layout 304) (see [0045], street view)
A mask generator configured to generate an object mask that includes semantic information on the object from the street view image (see [0038-0039] in reference to Fig. 4 wherein input image 112 is used to obtain semantic layout 302, and create semantically labeled image 120) and (see [0040], in reference to Fig. 3, semantic layout 302 labeling pixels such as sky, building, ground, to create an object mask).
The semantic labels describe "what" is being represented by respective pixels in the image, e.g., sky, plant, ground, wall, building, and so forth. The depth values describe a distance between a camera used to capture a scene in the image and respective objects in the scene represented by the pixel, e.g., a "z" distance in the scene captured by the image
	Lin does not explicitly teach wherein the map is a 2d map, and wherein generating depth information on the image is based on the semantic information.
	However, McCormac teaches wherein the map is a 2d map (see p. 4628, in reference to Fig. 1 and 2, wherein the input data is 2D frames, wherein “Our approach is to use the SLAM system to provide correspondence from the 2D frame into a globally consistent 3D map”).
Wherein generating depth information on the image is based on the semantic information (see p. 4628, I. Introduction: “what is the distance between the lectern and its nearest chair?  In this work, we combine the geometric information from a state-of-the-art SLAM system…with recent advances in semantic segmentation using Convolution Neural Networks (CNN)”), and (see Fig. 1-2, in reference to p. 4629, III. Method, wherein from input 2D 
	Lin and McCormac teaches claim 12.  For motivation, see claim 1.
Claim 13 claims limitations in scope to claim 2 and is rejected for at least the reasons above.
Claim 15-17 claims limitations in scope to claims 4-6, respectively, and is rejected for at least the reasons above.
Claims 3, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lin et al. (US 20160350930 in view of McCormac et al. (“SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks”) and MonkeyProof Solutions (“How to calculate the shortest distance between a point and a line,” hereinafter “Monkey”).
Re claim 3, Lin and McCormac teaches claim 1.  Lin and McCormac teaches objects located on a 2D map, but do not explicitly teach find an intersection point with the object by projecting a virtual ray based on location information included in the street view image on the 2D map and calculating a distance from the location information to the intersection point.
However, Monkey teaches finding an intersection point with the object by projecting a virtual ray based on location information included in the street view image on the 2D map (see p. 1-2, wherein at location P, an orthogonal projection “virtual ray” is projected to an object (line AB), and point C being the intersection point between a ray and a line object on a 2D coordinate map.
Calculating a distance from the location information to the intersection point (see p. 1, “The figure below shows how we construct the distance (red, dashed) and closes point C”), (see figure on p. 2, showing visually calculations of a location information P to an intersection point 
Lin, McCormac, and Monkey teaches claim 3.  It would have been obvious to one of ordinary skill in the art at the time of filing to modify Lin and McCormac’s mapping system including an input 2D image to explicitly include calculating distances between points on a 2D coordinate map, as taught by Monkey, as the reference is pertinent to the problem of solving for a distance between a point and an intersection point.  An advantage of the modification is that it achieves the result of using known mathematical techniques to obtain the distance value between a location point on a 2d coordinate map and an intersection point of an object on the same 2D map.
Claim 14 claims limitations in scope to claim 3 and is rejected for at least the reasons above.
Claims 7-8, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lin et al. (US 20160350930 in view of McCormac et al. (“SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks”) and Huston et al. (US 20180108172).
	Re claim 7, Lin and McCormac teaches claim 1.  Furthermore, Lin and McCormac do not explicitly teach wherein by the at least one processor, acquiring the street view image and the 2D map as data open through a map service.  However, Huston teaches by the at least one processor, acquiring the street view image and the 2D map as data open through a map service (see [0003], wherein images captured by moving vehicles comprise photographs and video images that users can access from a mapping service).
	Lin, McCormac, and Huston teaches claim 7.  It would have been obvious to one of ordinary skill in the art at the time of filing to modify Lin and McCormac’s mapping system to 
	Re claim 8, Lin, McCormac, and Huston teaches claim 7.  Furthermore, Huston teaches acquiring a 2D map of an area in which the street view image is captured based on location information included in the street view image (see [0007], image processing server is connected to the network for receiving the images and metadata with the server processing the images to determine the location of various targets in the images).  For motivation, see claim 7.
	Claim 18 claims limitations in scope to claim 7-8 and is rejected for at least the reasons above.
Claims 9, 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lin et al. (US 20160350930 in view of McCormac et al. (“SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks”) and Lynch (US 9404764).
Re claim 9, Lin and McCormac teaches claim 1.  Lin and McCormac do not explicitly teach by the at least one processor, preprocessing a spherical panoramic image that is the street view image through cropping for each unit angle.
However, Lynch teaches by the at least one processor, preprocessing a spherical panoramic image that is the street view image through cropping for each unit angle (see col 2, lines 47-57: “The system 150 receives image data from at least one panoramic image or image bubble. The at least one image bubble may be collected by a camera. The image bubble may have a center point measured in Cartesian coordinates such as an X-coordinate, a Y-coordinate, and a Z-coordinate. Each point on the image bubble is defined by the center point and one or more angles (e.g., roll, pitch, yaw). By correlating the Cartesian space of the image associates one or more points of the optical data with one or more pixels in the image bubble), (col 2, lines 16-26, wherein Fig. 1b is a generated from a plurality of image bubbles at a different perspective angle), and (col 7, lines 38-51, “FIG. 10 illustrates a plurality of image bubbles 410a-e correlated with a street side view. Because of the panoramic nature of the image bubbles 410a-e, successive image bubbles overlap. That is, a single point in the real world, and accordingly in optical data 201, occurs in multiple image bubbles 410a-e. This principle is shown by building 411 in FIG. 10. Each of image bubbles 410a-e may provide a different perspective of building 411. Any of the algorithms for selecting the pixel values for the predefined two-dimensional plane, as described above, may be modified to include pixel values from a plurality of image bubbles. The pixel values from the plurality of image bubbles may be averaged. In other examples, pixel values from certain image bubbles may be ignored).  
Lynch teaches preprocessing a spherical panoramic image that is the street view (for example in fig. 10, spherical image bubbles corresponding to a panoramic view of the street image is shown) through cropping for each unit angle (see Fig. 1b wherein each image strip 102 corresponds to a different perspective “unit angle” of the panorama, generated from image bubbles, and Fig. 10, wherein each image bubble 410a-e is a cropping at a “unit angle” of the panorama).
Lin, McCormac, and Lynch teaches claim 9.  It would have been obvious to one of ordinary skill in the art at the time of filing to modify Lin and McCormac’s street view mapping system to explicitly include a spherical panoramic image of the street view, as taught by Lynch, as the references are in the analogous art of street view mapping systems.  An advantage of the 
Claim 19 claims limitations in scope to claim 9 and is rejected for at least the reasons above.
Claims 10, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lin et al. (US 20160350930 in view of McCormac et al. (“SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks”) and Lynch (US 9404764) and Morin (US 20140125812).
Re claim 10, Lin, McCormac, and Lynch teaches claim 9.  Furthermore, Lynch teaches converting the street view image to a spherical coordinate system (see col 5, lines 37-59, mapping of spherical coordinates of the image bubble to the geographical coordinate space, wherein spherical coordinates may be converted to image pixel coordinates).  
Acquiring an image of unit angle by projecting an image on the spherical coordinate system (see col 7, lines 38-51, in reference to Fig. 10, wherein images at “unit angles” are projected in the spherical image bubbles based on a spherical coordinate system).
Lin, McCormac, and Lynch do not explicitly teach wherein the spherical coordinate system is based on a pin-hole camera model.
However, Morin teaches wherein the spherical coordinate system is based on a pin-hole camera model (see abstract, A camera image processing subsystem processes image data corresponding to observations taken through a lens of focal point f using a spherical pin-hole model that maps the image data through a perspective center of a pin-hole prospective plane located within the lens onto a model sphere that is a focal length f in diameter and has its center at the perspective center of the pin-hole prospective plane. The subsystem models systematic distortion as rotation about coordinate axis of the pin-hole prospective plane, and maps all of the data, over the entire field of view of the lens, to corresponding spherical coordinates).  Morin teaches wherein the spherical coordinate system is based on a pin-hole camera model (spherical pin-hole model that maps all data over the entire field of view of the lens to corresponding spherical coordinates.
	Lin, McCormac, Lynch, and Morin teaches claim 10.  It would have been obvious to one of ordinary skill in the art at the time of filing to modify Lin, McCormac, and Lynch’s spherical panoramic image mapping system to explicitly include a pin-hole camera model, as taught by Morin, as the references are in the analogous art of spherical coordinate mapping systems.  An advantage of the modification is that it achieves the result of using well-known pin-hole camera models to determine x, y, z, coordinates of image data (see [0005]).
	Claim 20 claims limitations in scope to claim 10 and is rejected for at least the reasons above.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Peter Hoang whose telephone number is (571)270-1346. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Gregory Tryder can be reached on (571)270-7365. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PETER HOANG/Primary Examiner, Art Unit 2616