DETAILED ACTION
Remarks
This final office action is in response to the application filled on 4/18/2022. 
Claims 1, 8 and 15 are amended. 
Claims 1-20 are pending and examined below. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 5-8, 12-15, 19 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2019/0080467 (“Hirzer”), and in view of US 2019/0219401 (“Daniilidis”), and further in view of US 2019/0026917 (“Liao”). 
Regarding claim 1, Hirzer discloses a method of robot autonomous navigation in an environment (see [0030], where “Accurate geo-localization of images is used by applications such as outdoor augmented reality (AR), autonomous driving, mobile robotics, and navigation, extended reality (XR), virtual reality (VR), and augmented virtuality (AV).”; see also [0045], where “The image capture device 102 may also be used for navigation”), the method comprising: 
capturing an image of the environment (see [0032], where “the image capture device captures an image, and the method can use semantic segmentation to localize the image capture device. The pose determination based on semantic segmentation of the captured image is considered to be ground truth, and is used to update the 3D tracker. The system can efficiently and reliably determine the pose in an urban environment or other environment having buildings.”; see also [0044], where “System 100 includes an image capture device 102 having image capture hardware (optics and an imaging sensor, not shown) capable of capturing images of a scene including object/environment 114.”); 
segmenting the captured image to identify one or more foreground objects and one or more background objects (Submitted specification does not provide any examples of foreground and background objects. For the examination purposes, foreground objects are interpreted as objects closer to the camera for example cars, pedestrians, buildings etc. Background objects are interpreted as objects far from the camera, behind the scene objects for example sky, vegetation etc. see Hirzer fig 3B, block 368, perform semantic segmentation to generate segmented image. see also fig 5, where block diagram of semantic segmentation is shown. see also fig 8B, where a segmented image corresponding to image of fig 8A is shown. see also [0037], where “The semantic segmentation can identify the edges of buildings reliably”; see also [0039], where “The semantic segmentation can also distinguish the vertical and horizontal edges at the boundaries of a building facade from smaller architectural features, such as windows, doors, and ledges.”; see also [0089], where “In some embodiments, all image features can be classified as either facades, vertical edges, horizontal edges, or background. Other static objects, which do not block a facade (e.g., roofs, ground, sky or vegetation), are all classified as background.”; see also [0099], where “For example, a shrub within the outline of a facade is labeled as a facade. Similarly, the sky is labeled as background, and an airplane or bird (not shown) within an area of the sky is also labeled as background.”); 
determining a match between one or more of the foreground objects to one or more predefined image files (See [0034], where “an image can be generated and matched against a 2.5D map.”; see also [0041], where “The semantic segmentation can use a small number of classes to allow accurate matching (or alignment) between an input image from a camera and a 2.5D model.”; camera image is comparing/matching with existing model. see also fig 9A-C, where examples of labeled training images are shown. see also [0095], where “FIG. 9A shows the handling of architectural features. An input labeled image 900 has a building 901 with a facade 902, vertical edges 902e and 902/ a roof 904, windows 906, and a door 908, set against a background 910. The semantic segmentation block 308 outputs the corresponding segmented image 920 having a facade 922, a pair of vertical edges 924a, 924b and a horizontal edge 926.”; see also [0100], where “FIGS. 9A-9C are only exemplary. The training dataset can include a large number (e.g., 1000 or more) of labeled images, having a variety of building configurations, background configurations and poses, and a large number of blocking objects partially blocking the facade, vertical edges, horizontal edges and/or the background.”; see also [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block a vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”; see also [0119])
estimating an object pose for the one or more foreground objects by implementing an iterative estimation loop (for the examination purposes the claim limitation is interpreted as estimating pose of another vehicle/pedestrian/building, not the host robot/vehicle. see Hirzer fig 3B, block 370-380. See also fig 11, where block diagram of pose hypothesis is shown. see also [0067], where “The 3D tracker 314 provides the initial pose to the pose hypothesis sampling block 310.”; see also [0069], where “At block 374, the pose hypothesis sampling block 310 performs a loop containing block 376 for each respective pose hypothesis.”; see also [0070], where “At block 376, the pose hypothesis sampling block 310 generates a 3D rendering of the scene corresponding to the respective pose.”; 3D rendering of the scene corresponds to robot-centric environment model. see also [0071-72] and [0113-124], where one pose is selected from plurality of poses by generating pose hypothesis and 3D rendering generation. So, the system is determining the object pose iteratively. The pose hypothesis is based on the possible poses of the camera of the robot. The surrounding environment/scene would look like depending on the direction of the camera pose. Foreground/background objects in the scene would be included depending on the direction that the scene is facing. Hirzer does not explicitly disclose that the robot with the camera will also be on the scene. Hirzer discloses that foreground/background objects will be on the on the scene.); 

associating semantic labels to the matched foreground object (see fig 9A-C, where examples of labeled training images are shown. see also [0096], where “FIG. 9A shows the handling of architectural features. An input labeled image 900 has a building 901 with a facade 902, vertical edges 902e and 902f, a roof 904, windows 906, and a door 908, set against a background 910. The semantic segmentation block 308 outputs the corresponding segmented image 920 having a facade 922, a pair of vertical edges 924a, 924b and a horizontal edge 926.”; see also [0038], where “the blocking foreground objects are labeled as belonging to the same class as the component”; see also [0096-99]); 
compiling a semantic map containing the semantic labels and segmented foreground object image pose (see fig 5, block 501; see [0034], where “Examples below generate 3D renderings from the 2.5D maps for several poses to facilitate matching.”; see also [0085], where “For each pixel in the imaging sensor of the image capture device 102 (FIG. 1), the CNN or FCN 501 determines a respective probability that the feature captured by that pixel belongs to a respective classification.”); and 
providing localization information to the robot based on the semantic map (see [0032], where “An exemplary system described below can determine a pose of an image capture device at any given time using a 3D tracker, such as visual odometry tracking or simultaneous localization and mapping (SLAM).”; see also [0078], where “The pose hypothesis sampling block 310 projects map points onto the image based on the initial pose estimate from SLAM based 3D tracker 452.”; see also [0007], where “Simultaneous localization and mapping (SLAM) based systems may be used in outdoor localization tasks.”; see also [0034] and [0050]).
Hirzer does not disclose the following limitations:
at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model; 
determining a robot pose estimate by applying a robot-centric environmental model to the object foreground pose estimate by implementing an iterative refinement loop; and
providing localization information to the robot based on the robot pose estimate.
However, Daniilidis discloses a method for simultaneous localization and mapping of a mobile robot, wherein determining a robot pose estimate by applying a robot-centric environmental model to the object foreground pose estimate by implementing an iterative refinement loop (see fig 1, where an example method of semantic SLAM is shown. see also [0033], where “In robotics, simultaneous localization and mapping (SLAM) is the problem of mapping an unknown environment while estimating a robot's pose within it.”; see also [0038], where “we provide a formal decomposition of the joint metricsemantic SLAM problem into continuous (pose)”; see also [0134], where “The second optimization above is typically carried out via filtering [4]-[6] or pose-graph optimization”; pose-graph optimization is interpreted as iterative refinement loop.); and
providing localization information to the robot based on the robot pose estimate (see [0040], where “Consider the classical localization and mapping problem, in which a mobile sensor moves through an unknown environment, modeled as a collection… of static landmarks. Given a set of sensor measurements…the task is to estimate the landmark positions £ and a sequence of poses …representing the sensor trajectory.”; see also [0074], where “The advantage of our work is that by having semantic features directly into the optimization, we include a relatively sparse and easily distinguishable set of features that allows for improved localization performance and loop closure,”).
Because both Hirzer and Daniilidis are in the same field of endeavor of mobile robot localization and mapping system. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer to incorporate the teachings of Daniilidis by including the above feature, determining a robot pose estimate by applying a robot-centric environmental model to the object foreground pose estimate by implementing an iterative refinement loop; and providing localization information to the robot based on the robot pose estimate, for providing correct localization and mapping information by assigning semantic labels assigned to all landmarks observed in the environment.
Hirzer in view of Daniilidis does not disclose the following limitation:
at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model. 
However, Liao discloses an object matching method wherein at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model (see [0075], where “That is, given a library of 3D models (e.g., 3D CAD models) from various object categories and their sub-categories, the system finds the 3D model that best matches the object in an image. 3D-to-2D object mapping improves system performance (e.g., reduces processor and memory load) because the system does not need to build a completely new 3D model from scratch for every new image.”). 
Because Hirzer, Daniilidis and Liao are in the same field of endeavor of mobile robot localization and mapping system. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis to incorporate the teachings of Liao by including the above feature, at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model, for providing correct match between captured image of object and database object without building a model for every new image.
Regarding claim 5, Hirzer further discloses a method, including performing a model training process to train one or more deep neural networks by placing an object within the environment (see [0094], where “The network is fine-tuned with data, and then the resulting model is used to initialize the weights of a more fine-grained network (FCN-16s). This process is repeated in order to compute the final segmentation network having an 8 pixels prediction stride (FCN-Ss).”; see also [0102], where “In a variation of the training method, to create ground truth data with reduced effort, one can record short video sequences in an urban environment. A model and key point-based 3D tracking system can use untextured 2.5D models, with this approach, one can label the facades and their edges efficiently.”; see also [0118]), the object having annotated pixels (see [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block the vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”; see also [0085], [0111] and [0116]).
Regarding claim 6, Hirzer further discloses a method, performing an adversarial training process to train one or more deep neural networks to recognize boundaries between the one or more foreground and the one or more background objects (Per submitted specification adversarial training can train to recognize both foreground and background portions of the images, see [0014] of PGPUB of submitted specification. see Hirzer fig 3B, block 360; see also [0057], where “The semantic segmentation block 308 performs: image rectification, classification of scene components within the rectified image into facades, vertical edge, horizontal edges and background, division of the image into regions (e.g., columns), and classification of each column into one of a predetermined number of combinations of facades, vertical edge, horizontal edges and background.”; see also [0089], where “In some embodiments, all image features can be classified as either facades, vertical edges, horizontal edges, or background.”; see also [0099], where “For example, a shrub within the outline of a facade is labeled as a facade. Similarly, the sky is labeled as background, and an airplane or bird (not shown) within an area of the sky is also labeled as background.”; see also [0039], where “The semantic segmentation can also distinguish the vertical and horizontal edges at the boundaries of a building facade from smaller architectural features, such as windows, doors, and ledges.”; see also [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block the vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”).
Regarding claim 7, Hirzer further discloses a method, refining the segmented image using a deep neural network (see [0083], where “Semantic segmentation block 308 can use deep learning methods. The semantic segmentation block 308 includes image rectification block 500, ANN (e.g., CNN or FCN) 501,”; see also [0094]).
Regarding claim 8, Hirzer further discloses a non-transitory computer-readable medium having stored thereon executable instructions when executed by a processor unit cause the processor unit to perform a method of autonomous robot navigation (see [0030], where “Accurate geo-localization of images is used by applications such as outdoor augmented reality (AR), autonomous driving, mobile robotics, and navigation, extended reality (XR), virtual reality (VR), and augmented virtuality (AV).”; see also [0045], where “The image capture device 102 may also be used for navigation”; see also fig 3A, where 320 is processor and 330 is storage medium), the method comprising: 
capturing an image of the environment (see [0032], where “the image capture device captures an image, and the method can use semantic segmentation to localize the image capture device. The pose determination based on semantic segmentation of the captured image is considered to be ground truth, and is used to update the 3D tracker. The system can efficiently and reliably determine the pose in an urban environment or other environment having buildings.”; see also [0044], where “System 100 includes an image capture device 102 having image capture hardware (optics and an imaging sensor, not shown) capable of capturing images of a scene including object/environment 114.”); 
segmenting the captured image to identify one or more foreground objects and one or more background objects (Submitted specification does not provide any examples of foreground and background objects. For the examination purposes, foreground objects are interpreted as objects closer to the camera for example cars, pedestrians, buildings etc. Background objects are interpreted as objects far from the camera, behind the scene objects for example sky, vegetation etc. see Hirzer fig 3B, block 368, perform semantic segmentation to generate segmented image. see also fig 5, where block diagram of semantic segmentation is shown. see also fig 8B, where a segmented image corresponding to image of fig 8A is shown. see also [0037], where “The semantic segmentation can identify the edges of buildings reliably”; see also [0039], where “The semantic segmentation can also distinguish the vertical and horizontal edges at the boundaries of a building facade from smaller architectural features, such as windows, doors, and ledges.”; see also [0089], where “In some embodiments, all image features can be classified as either facades, vertical edges, horizontal edges, or background. Other static objects, which do not block a facade (e.g., roofs, ground, sky or vegetation), are all classified as background.”; see also [0099], where “For example, a shrub within the outline of a facade is labeled as a facade. Similarly, the sky is labeled as background, and an airplane or bird (not shown) within an area of the sky is also labeled as background.”); 
determining a match between one or more of the foreground objects to one or more predefined image files (See [0034], where “an image can be generated and matched against a 2.5D map.”; see also [0041], where “The semantic segmentation can use a small number of classes to allow accurate matching (or alignment) between an input image from a camera and a 2.5D model.”; camera image is comparing/matching with existing model. see also fig 9A-C, where examples of labeled training images are shown. see also [0095], where “FIG. 9A shows the handling of architectural features. An input labeled image 900 has a building 901 with a facade 902, vertical edges 902e and 902/ a roof 904, windows 906, and a door 908, set against a background 910. The semantic segmentation block 308 outputs the corresponding segmented image 920 having a facade 922, a pair of vertical edges 924a, 924b and a horizontal edge 926.”; see also [0100], where “FIGS. 9A-9C are only exemplary. The training dataset can include a large number (e.g., 1000 or more) of labeled images, having a variety of building configurations, background configurations and poses, and a large number of blocking objects partially blocking the facade, vertical edges, horizontal edges and/or the background.”; see also [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block the vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”; see also [0119]), 
estimating an object pose for the one or more foreground objects by implementing an iterative estimation loop (for the examination purposes the claim limitation is interpreted as estimating pose of another vehicle/pedestrian/building, not the host robot/vehicle. see Hirzer fig 3B, block 370-380. See also fig 11, where block diagram of pose hypothesis is shown. see also [0067], where “The 3D tracker 314 provides the initial pose to the pose hypothesis sampling block 310.”; see also [0069], where “At block 374, the pose hypothesis sampling block 310 performs a loop containing block 376 for each respective pose hypothesis.”; see also [0070], where “At block 376, the pose hypothesis sampling block 310 generates a 3D rendering of the scene corresponding to the respective pose.”; 3D rendering of the scene corresponds to robot-centric environment model. see also [0071-72] and [0113-124], where one pose is selected from plurality of poses by generating pose hypothesis and 3D rendering generation. So, the system is determining the object pose iteratively. The pose hypothesis is based on the possible poses of the camera of the robot. The surrounding environment/scene would look like depending on the direction of the camera pose. Foreground/background objects in the scene would be included depending on the direction that the scene is facing. Hirzer does not explicitly disclose that the robot with the camera will also be on the scene. Hirzer discloses that foreground/background objects will be on the on the scene.); 

associating semantic labels to the matched foreground object (see fig 9A-C, where examples of labeled training images are shown. see also [0096], where “FIG. 9A shows the handling of architectural features. An input labeled image 900 has a building 901 with a facade 902, vertical edges 902e and 902f, a roof 904, windows 906, and a door 908, set against a background 910. The semantic segmentation block 308 outputs the corresponding segmented image 920 having a facade 922, a pair of vertical edges 924a, 924b and a horizontal edge 926.”; see also [0038], where “the blocking foreground objects are labeled as belonging to the same class as the component”; see also [0096-99]); 
compiling a semantic map containing the semantic labels and segmented foreground object image pose (see fig 5, block 501; see [0034], where “Examples below generate 3D renderings from the 2.5D maps for several poses to facilitate matching.”; see also [0085], where “For each pixel in the imaging sensor of the image capture device 102 (FIG. 1), the CNN or FCN 501 determines a respective probability that the feature captured by that pixel belongs to a respective classification.”); and 
providing localization information to the robot based on the semantic maptracking or simultaneous localization and mapping (SLAM).”; see also [0078], where “The pose hypothesis sampling block 310 projects map points onto the image based on the initial pose estimate from SLAM based 3D tracker 452.”; see also [0007], where “Simultaneous localization and mapping (SLAM) based systems may be used in outdoor localization tasks.”; see also [0034] and [0050]).
Hirzer does not disclose the following limitations:
at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model; 
determining a robot pose estimate by applying a robot-centric environmental model to the foreground object pose estimate by implementing an iterative refinement loop; and
providing localization information to the robot based on the robot pose estimate.
However, Daniilidis further discloses a method for simultaneous localization and mapping of a mobile robot, wherein determining a robot pose estimate by applying a robot-centric environmental model to the foreground object pose estimate by implementing an iterative refinement loop (see fig 1, where an example method of semantic SLAM is shown. see also [0033], where “In robotics, simultaneous localization and mapping (SLAM) is the problem of mapping an unknown environment while estimating a robot's pose within it.”; see also [0038], where “we provide a formal decomposition of the joint metricsemantic SLAM problem into continuous (pose)”; see also [0134], where “The second optimization above is typically carried out via filtering [4]-[6] or pose-graph optimization”; pose-graph optimization is interpreted as iterative refinement loop.); and
providing localization information to the robot based on the robot pose estimate (see [0040], where “Consider the classical localization and mapping problem, in which a mobile sensor moves through an unknown environment, modeled as a collection… of static landmarks. Given a set of sensor measurements…the task is to estimate the landmark positions £ and a sequence of poses …representing the sensor trajectory.”; see also [0074], where “The advantage of our work is that by having semantic features directly into the optimization, we include a relatively sparse and easily distinguishable set of features that allows for improved localization performance and loop closure,”).
Because both Hirzer and Daniilidis are in the same field of endeavor of mobile robot localization and mapping system. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer to incorporate the teachings of Daniilidis by including the above feature, determining a robot pose estimate by applying a robot-centric environmental model to the foreground object pose estimate by implementing an iterative refinement loop; and providing localization information to the robot based on the robot pose estimate, for providing correct localization and mapping information by assigning semantic labels assigned to all landmarks observed in the environment.
Hirzer in view of Daniilidis does not disclose the following limitation:
at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model. 
However, Liao further discloses an object matching method wherein at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model (see [0075], where “That is, given a library of 3D models (e.g., 3D CAD models) from various object categories and their sub-categories, the system finds the 3D model that best matches the object in an image. 3D-to-2D object mapping improves system performance (e.g., reduces processor and memory load) because the system does not need to build a completely new 3D model from scratch for every new image.”). 
Because Hirzer, Daniilidis and Liao are in the same field of endeavor of mobile robot localization and mapping system. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis to incorporate the teachings of Liao by including the above feature, at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model, for providing correct match between captured image of object and database object without building a model for every new image.
Regarding claim 12, Hirzer further discloses a non-transitory computer-readable medium, the executable instructions configured to cause the processor unit to perform the method, including performing a model training process to train one or more deep neural networks by placing an object within the environment (see [0094], where “The network is fine-tuned with data, and then the resulting model is used to initialize the weights of a more fine-grained network (FCN-16s). This process is repeated in order to compute the final segmentation network having an 8 pixels prediction stride (FCN-Ss).”; see also [0102], where “In a variation of the training method, to create ground truth data with reduced effort, one can record short video sequences in an urban environment. A model and key point-based 3D tracking system can use untextured 2.5D models, with this approach, one can label the facades and their edges efficiently.”; see also [0118]), the object having annotated pixels (see [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block the vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”; see also [0085], [0111] and [0116]).
Regarding claim 13, Hirzer further discloses a non-transitory computer-readable medium, the executable instructions configured to cause the processor unit to perform the method, including performing an adversarial training process to train one or more deep neural networks to recognize boundaries between the one or more foreground and the one or more background objects (Per submitted specification adversarial training can train to recognize both foreground and background portions of the images, see [0014] of PGPUB of submitted specification. see Hirzer see fig 3B, block 360; see also [0057], where “The semantic segmentation block 308 performs: image rectification, classification of scene components within the rectified image into facades, vertical edge, horizontal edges and background, division of the image into regions (e.g., columns), and classification of each column into one of a predetermined number of combinations of facades, vertical edge, horizontal edges and background.”; see also [0089], where “In some embodiments, all image features can be classified as either facades, vertical edges, horizontal edges, or background.”; see also [0099], where “For example, a shrub within the outline of a facade is labeled as a facade. Similarly, the sky is labeled as background, and an airplane or bird (not shown) within an area of the sky is also labeled as background.”; see also [0039], where “The semantic segmentation can also distinguish the vertical and horizontal edges at the boundaries of a building facade from smaller architectural features, such as windows, doors, and ledges.”; see also [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block the vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”).
Regarding claim 14, Hirzer further discloses a non-transitory computer-readable medium, the executable instructions further configured to cause the processor unit to perform the method, including refining the segmented image using a deep neural network (see [0083], where “Semantic segmentation block 308 can use deep learning methods. The semantic segmentation block 308 includes image rectification block 500, ANN (e.g., CNN or FCN) 501,”; see also [0094]).
Regarding claim 15, Hirzer further discloses a system for autonomous robot navigation (see [0030], where “Accurate geo-localization of images is used by applications such as outdoor augmented reality (AR), autonomous driving, mobile robotics, and navigation, extended reality (XR), virtual reality (VR), and augmented virtuality (AV).”; see also [0045], where “The image capture device 102 may also be used for navigation”), the system comprising: 
an image capture device (see fig 2, where 202 and 204 are camera. See also [0051], where “FIG. 2 shows another example, in which an automotive vehicle 200 has two image capture devices 202, 204 mounted thereto.”); 
a data store having executable instructions stored thereon (see fig 3A, where 330 is storage medium);
 a neural network in communication with a processor unit and the data store (see fig 3A, where 320 is processor. See also [0053], where “The processor 320 can also be coupled to a non-transitory, machine readable storage medium 330 storing the image.”);
 the executable instructions when executed by the processor unit cause the processor unit to perform a method comprising:
 capturing an image of the environment with the image capture device (see [0032], where “the image capture device captures an image, and the method can use semantic segmentation to localize the image capture device. The pose determination based on semantic segmentation of the captured image is considered to be ground truth, and is used to update the 3D tracker. The system can efficiently and reliably determine the pose in an urban environment or other environment having buildings.”; see also [0044], where “System 100 includes an image capture device 102 having image capture hardware (optics and an imaging sensor, not shown) capable of capturing images of a scene including object/environment 114.”); 
segmenting the captured image to identify one or more foreground objects using a foreground deep neural network (DNN) and to identify one or more background objects using a background DNN, the foreground and the background DNN located within the neural network (Submitted specification does not provide any examples of foreground and background objects. For the examination purposes, foreground objects are interpreted as objects closer to the camera for example cars, pedestrians, buildings etc. Background objects are interpreted as objects far from the camera, behind the scene objects for example sky, vegetation etc. see Hirzer fig 3B, block 368, perform semantic segmentation to generate segmented image. see also fig 5, where block diagram of semantic segmentation is shown. see also fig 8B, where a segmented image corresponding to image of fig 8A is shown. see also [0037], where “The semantic segmentation can identify the edges of buildings reliably”; see also [0039], where “The semantic segmentation can also distinguish the vertical and horizontal edges at the boundaries of a building facade from smaller architectural features, such as windows, doors, and ledges.”; see also [0089], where “In some embodiments, all image features can be classified as either facades, vertical edges, horizontal edges, or background. Other static objects, which do not block a facade (e.g., roofs, ground, sky or vegetation), are all classified as background.”; see also [0099], where “For example, a shrub within the outline of a facade is labeled as a facade. Similarly, the sky is labeled as background, and an airplane or bird (not shown) within an area of the sky is also labeled as background.”; see also fig 3A, block 308, where CNN is used for semantic segmentation. See also [0083], where “Semantic segmentation block 308 can use deep learning methods.”); 
determining a match between one or more of the foreground objects to one or more predefined image files (See [0034], where “an image can be generated and matched against a 2.5D map.”; see also [0041], where “The semantic segmentation can use a small number of classes to allow accurate matching (or alignment) between an input image from a camera and a 2.5D model.”; camera image is comparing/matching with existing model. see also fig 9A-C, where examples of labeled training images are shown. see also [0095], where “FIG. 9A shows the handling of architectural features. An input labeled image 900 has a building 901 with a facade 902, vertical edges 902e and 902/ a roof 904, windows 906, and a door 908, set against a background 910. The semantic segmentation block 308 outputs the corresponding segmented image 920 having a facade 922, a pair of vertical edges 924a, 924b and a horizontal edge 926.”; see also [0100], where “FIGS. 9A-9C are only exemplary. The training dataset can include a large number (e.g., 1000 or more) of labeled images, having a variety of building configurations, background configurations and poses, and a large number of blocking objects partially blocking the facade, vertical edges, horizontal edges and/or the background.”; see also [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block the vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”; see also [0119]), 
estimating an object pose for the one or more foreground objects by implementing an iterative estimation loop (for the examination purposes the claim limitation is interpreted as estimating pose of another vehicle/pedestrian/building, not the host robot/vehicle. see Hirzer fig 3B, block 370-380. See also fig 11, where block diagram of pose hypothesis is shown. see also [0067], where “The 3D tracker 314 provides the initial pose to the pose hypothesis sampling block 310.”; see also [0069], where “At block 374, the pose hypothesis sampling block 310 performs a loop containing block 376 for each respective pose hypothesis.”; see also [0070], where “At block 376, the pose hypothesis sampling block 310 generates a 3D rendering of the scene corresponding to the respective pose.”; 3D rendering of the scene corresponds to robot-centric environment model. see also [0071-72] and [0113-124], where one pose is selected from plurality of poses by generating pose hypothesis and 3D rendering generation. So, the system is determining the object pose iteratively. The pose hypothesis is based on the possible poses of the camera of the robot. The surrounding environment/scene would look like depending on the direction of the camera pose. Foreground/background objects in the scene would be included depending on the direction that the scene is facing. Hirzer does not explicitly disclose that the robot with the camera will also be on the scene. Hirzer discloses that foreground/background objects will be on the on the scene.); 

associating semantic labels to the matched foreground object (see fig 9A-C, where examples of labeled training images are shown. see also [0096], where “FIG. 9A shows the handling of architectural features. An input labeled image 900 has a building 901 with a facade 902, vertical edges 902e and 902f, a roof 904, windows 906, and a door 908, set against a background 910. The semantic segmentation block 308 outputs the corresponding segmented image 920 having a facade 922, a pair of vertical edges 924a, 924b and a horizontal edge 926.”; see also [0038], where “the blocking foreground objects are labeled as belonging to the same class as the component”; see also [0096-99]); 
compiling a semantic map containing the semantic labels and segmented foreground object image pose (see fig 5, block 501; see [0034], where “Examples below generate 3D renderings from the 2.5D maps for several poses to facilitate matching.”; see also [0085], where “For each pixel in the imaging sensor of the image capture device 102 (FIG. 1), the CNN or FCN 501 determines a respective probability that the feature captured by that pixel belongs to a respective classification.”); and 
providing localization information to the robot based on the semantic map (see [0032], where “An exemplary system described below can determine a pose of an image capture device at any given time using a 3D tracker, such as visual odometry tracking or simultaneous localization and mapping (SLAM).”; see also [0078], where “The pose hypothesis sampling block 310 projects map points onto the image based on the initial pose estimate from SLAM based 3D tracker 452.”; see also [0007], where “Simultaneous localization and mapping (SLAM) based systems may be used in outdoor localization tasks.”; see also [0034] and [0050]).
Hirzer does not disclose the following limitations:
at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model;
determining a robot pose estimate by applying a robot-centric environmental model to the foreground object pose estimate by using a refinement DNN to implement an iterative refinement loop, the refinement DNN located within the neural network; and
providing localization information to the robot based on the robot pose estimate.
However, Daniilidis further discloses a method for simultaneous localization and mapping of a mobile robot, wherein determining a robot pose estimate by applying a robot-centric environmental model to the foreground object pose estimate by using a refinement DNN to implement an iterative refinement loop, the refinement DNN located within the neural network (see fig 1, where an example method of semantic SLAM is shown. see also [0033], where “In robotics, simultaneous localization and mapping (SLAM) is the problem of mapping an unknown environment while estimating a robot's pose within it.”; see also [0038], where “we provide a formal decomposition of the joint metricsemantic SLAM problem into continuous (pose)”; see also [0134], where “The second optimization above is typically carried out via filtering [4]-[6] or pose-graph optimization”; pose-graph optimization is interpreted as iterative refinement loop.); and
providing localization information to the robot based on the robot pose estimate (see [0040], where “Consider the classical localization and mapping problem, in which a mobile sensor moves through an unknown environment, modeled as a collection… of static landmarks. Given a set of sensor measurements…the task is to estimate the landmark positions £ and a sequence of poses …representing the sensor trajectory.”; see also [0074], where “The advantage of our work is that by having semantic features directly into the optimization, we include a relatively sparse and easily distinguishable set of features that allows for improved localization performance and loop closure,”).
Because both Hirzer and Daniilidis are in the same field of endeavor of mobile robot localization and mapping system. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer to incorporate the teachings of Daniilidis by including the above feature, determining a robot pose estimate by applying a robot-centric environmental model to the foreground object pose estimate by using a refinement DNN to implement an iterative refinement loop, the refinement DNN located within the neural network; and providing localization information to the robot based on the robot pose estimate, for providing correct localization and mapping information by assigning semantic labels assigned to all landmarks observed in the environment.
Hirzer in view of Daniilidis does not disclose the following limitation:
at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model. 
However, Liao further discloses an object matching method wherein at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model (see [0075], where “That is, given a library of 3D models (e.g., 3D CAD models) from various object categories and their sub-categories, the system finds the 3D model that best matches the object in an image. 3D-to-2D object mapping improves system performance (e.g., reduces processor and memory load) because the system does not need to build a completely new 3D model from scratch for every new image.”). 
Because Hirzer, Daniilidis and Liao are in the same field of endeavor of mobile robot localization and mapping system. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis to incorporate the teachings of Liao by including the above feature, at least one of the one or more predefined image files comprising a three- dimensional (3D) computer-aided design (CAD) wireframe model or a 3D CAD solid modeling model, for providing correct match between captured image of object and database object without building a model for every new image.
Regarding claim 19, Hirzer further discloses a system, the executable instructions further configured to cause the processor unit to perform the method, performing a model training process to train one or more deep neural networks by placing an object within the environment (see [0094], where “The network is fine-tuned with data, and then the resulting model is used to initialize the weights of a more fine-grained network (FCN-16s). This process is repeated in order to compute the final segmentation network having an 8 pixels prediction stride (FCN-Ss).”; see also [0102], where “In a variation of the training method, to create ground truth data with reduced effort, one can record short video sequences in an urban environment. A model and key point-based 3D tracking system can use untextured 2.5D models, with this approach, one can label the facades and their edges efficiently.”; see also [0118]), the object having annotated pixels (see [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block the vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”; see also [0085], [0111] and [0116]).
Regarding claim 20, Hirzer further discloses a system, the executable instructions configured to cause the processor unit to perform: 
a (see fig 3B, block 360; see also [0057], where “The semantic segmentation block 308 performs: image rectification, classification of scene components within the rectified image into facades, vertical edge, horizontal edges and background, division of the image into regions (e.g., columns), and classification of each column into one of a predetermined number of combinations of facades, vertical edge, horizontal edges and background.”; see also [0089], where “In some embodiments, all image features can be classified as either facades, vertical edges, horizontal edges”; see also [0039], where “The semantic segmentation can also distinguish the vertical and horizontal edges at the boundaries of a building facade from smaller architectural features, such as windows, doors, and ledges.”; see also [0062], where “A training set containing labeled images of buildings is input. Foreground objects that partially block the facade (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the facade. Foreground objects that partially block the vertical edge (or horizontal edge) of a facade-including pedestrians, automotive vehicles, shrubs, trees, or the like-are labeled as portions of the vertical edge (or horizontal edge). Foreground objects that partially block the background outside the perimeter of the building (including pedestrians, automotive vehicles, shrubs, trees, or the like) are labeled as portions of the background.”); and 
a  (see fig 3B, block 360; see also [0057], where “The semantic segmentation block 308 performs: image rectification, classification of scene components within the rectified image into facades, vertical edge, horizontal edges and background, division of the image into regions (e.g., columns), and classification of each column into one of a predetermined number of combinations of facades, vertical edge, horizontal edges and background.”; see also [0089], where “In some embodiments, all image features can be classified as … background.”; see also [0099], where “For example, a shrub within the outline of a facade is labeled as a facade. Similarly, the sky is labeled as background, and an airplane or bird (not shown) within an area of the sky is also labeled as background.”).
Per submitted specification adversarial training can train to recognize both foreground and background portions of the images, see [0014] of PGPUB of submitted specification. 
Hirzer discloses a training process that recognizes both foreground and background portions of an image (see citation above). So, it would have been obvious to train a system to recognize both foreground and background objects via a (one) training process instead of having separate training process for foreground and background objects for classifying an object and its background in the image by using a trained model without any delay.

Claim(s) 2, 4, 9, 11, 16 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2019/0080467 (“Hirzer”), in view of US 2019/0219401 (“Daniilidis”), and in view of US 2019/0026917 (“Liao”), as applied to claim 1, 8 and 15 above, and further in view of US 2019/0213443 (“Cunningham”). 
Regarding claim 2, Hirzer further discloses a method, including: the iterative estimation loop performing the matching and object pose estimation (see fig 3B, block 370-380. See also fig 11, where block diagram of pose hypothesis is shown. see also [0067], where “The 3D tracker 314 provides the initial pose to the pose hypothesis sampling block 310.”; see also [0069], where “At block 374, the pose hypothesis sampling block 310 performs a loop containing block 376 for each respective pose hypothesis.”; see also [0070], where “At block 376, the pose hypothesis sampling block 310 generates a 3D rendering of the scene corresponding to the respective pose.”; see also [0071-72] and [0113-124], where one pose is selected from plurality of poses by generating pose hypothesis and 3D rendering generation. So, the system is determining the object pose iteratively. See also [0034] and [0041]). 
Hirzer in view of Daniilidis and Liao does not disclose the following limitation:
terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance.
However Cunningham discloses method for detecting objects in images, wherein terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance (see [0050], where “The training module 230 stops 370 backpropagation of the error terms through the neural network model 220 after both the first loss function and the second loss function satisfy a criterion, for example, once the error terms are within a predetermined acceptable range.”; predetermined acceptable range corresponds to predetermined tolerance. See also fig 1, where object detection module is connected with the training server. See also fig 5, where an object on the acquired image is detected using a trained neural network model. see also fig 6, where fully/partially labeled training images and acquired images are compared. When the error terms are within a predetermined range, the system stops the iteration.).
Because Hirzer, Daniilidis, Liao and Cunningham are in the same field of endeavor of object detection on an image. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis and Liao to incorporate the teachings of Cunningham by including the above feature, terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance, for classifying an object in the image by using a trained model without any delay.
Regarding claim 4, Hirzer in view of Daniilidis and Liao does not disclose the following limitation:
terminating the iterative refinement loop based on a determination that a result of the iterative refinement loop is within a predetermined tolerance.
However Cunningham further discloses method for detecting objects in images, wherein terminating the iterative refinement loop based on a determination that a result of the iterative refinement loop is within a predetermined tolerance (see [0050], where “The training module 230 stops 370 backpropagation of the error terms through the neural network model 220 after both the first loss function and the second loss function satisfy a criterion, for example, once the error terms are within a predetermined acceptable range.”; predetermined acceptable range corresponds to predetermined tolerance. See also fig 1, where object detection module is connected with the training server. See also fig 5, where an object on the acquired image is detected using a trained neural network model. see also fig 6, where fully/partially labeled training images and acquired images are compared. When the error terms are within a predetermined range, the system stops the iteration.).
Because Hirzer, Daniilidis, Liao and Cunningham are in the same field of endeavor of object detection on an image. Thus, before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis and Liao to incorporate the teachings of Cunningham by including the above feature, terminating the iterative estimation loop based on a determination that a result of the iterative loop is within a predetermined tolerance, for classifying an object in the image by using a trained model without any delay.
Regarding claim 9, Hirzer further discloses a non-transitory computer-readable medium, the executable instructions configured to cause the processor unit to perform the method, including: 
the iterative estimation loop performing the matching and object pose estimation (see fig 3B, block 370-380. See also fig 11, where block diagram of pose hypothesis is shown. see also [0067], where “The 3D tracker 314 provides the initial pose to the pose hypothesis sampling block 310.”; see also [0069], where “At block 374, the pose hypothesis sampling block 310 performs a loop containing block 376 for each respective pose hypothesis.”; see also [0070], where “At block 376, the pose hypothesis sampling block 310 generates a 3D rendering of the scene corresponding to the respective pose.”; see also [0071-72] and [0113-124], where one pose is selected from plurality of poses by generating pose hypothesis and 3D rendering generation. So, the system is determining the object pose iteratively. See also [0034] and [0041]). 
Hirzer in view of Daniilidis and Liao does not disclose the following limitation:
terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance.
However Cunningham further discloses method for detecting objects in images, wherein terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance (see [0050], where “The training module 230 stops 370 backpropagation of the error terms through the neural network model 220 after both the first loss function and the second loss function satisfy a criterion, for example, once the error terms are within a predetermined acceptable range.”; predetermined acceptable range corresponds to predetermined tolerance. See also fig 1, where object detection module is connected with the training server. See also fig 5, where an object on the acquired image is detected using a trained neural network model. see also fig 6, where fully/partially labeled training images and acquired images are compared. When the error terms are within a predetermined range, the system stops the iteration.).
Because Hirzer, Daniilidis, Liao and Cunningham are in the same field of endeavor of object detection on an image. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis to incorporate the teachings of Cunningham by including the above feature, terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance, for classifying an object in the image by using a trained model without any delay.
Regarding claim 11, Hirzer in view of Daniilidis and Liao does not disclose the following limitation:
 including terminating the iterative refinement loop based on a determination that a result of the iterative refinement loop is within a predetermined tolerance.
However Cunningham further discloses method for detecting objects in images, wherein terminating the iterative refinement loop based on a determination that a result of the iterative refinement loop is within a predetermined tolerance (see [0050], where “The training module 230 stops 370 backpropagation of the error terms through the neural network model 220 after both the first loss function and the second loss function satisfy a criterion, for example, once the error terms are within a predetermined acceptable range.”; predetermined acceptable range corresponds to predetermined tolerance. See also fig 1, where object detection module is connected with the training server. See also fig 5, where an object on the acquired image is detected using a trained neural network model. see also fig 6, where fully/partially labeled training images and acquired images are compared. When the error terms are within a predetermined range, the system stops the iteration.).
Because Hirzer, Daniilidis, Liao and Cunningham are in the same field of endeavor of object detection on an image. Thus, before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis and Liao to incorporate the teachings of Cunningham by including the above feature, terminating the iterative estimation loop based on a determination that a result of the iterative loop is within a predetermined tolerance, for classifying an object in the image by using a trained model without any delay.
Regarding claim 16, Hirzer further discloses a system, the executable instructions configured to cause the processor unit to perform the method, including: 
the iterative estimation loop performing the matching and object pose estimation (see fig 3B, block 370-380. See also fig 11, where block diagram of pose hypothesis is shown. see also [0067], where “The 3D tracker 314 provides the initial pose to the pose hypothesis sampling block 310.”; see also [0069], where “At block 374, the pose hypothesis sampling block 310 performs a loop containing block 376 for each respective pose hypothesis.”; see also [0070], where “At block 376, the pose hypothesis sampling block 310 generates a 3D rendering of the scene corresponding to the respective pose.”; see also [0071-72] and [0113-124], where one pose is selected from plurality of poses by generating pose hypothesis and 3D rendering generation. So, the system is determining the object pose iteratively. See also [0034] and [0041]). 
Hirzer in view of Daniilidis and Liao does not disclose the following limitation:
terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance.
However Cunningham further discloses method for detecting objects in images, wherein terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance (see [0050], where “The training module 230 stops 370 backpropagation of the error terms through the neural network model 220 after both the first loss function and the second loss function satisfy a criterion, for example, once the error terms are within a predetermined acceptable range.”; predetermined acceptable range corresponds to predetermined tolerance. See also fig 1, where object detection module is connected with the training server. See also fig 5, where an object on the acquired image is detected using a trained neural network model. see also fig 6, where fully/partially labeled training images and acquired images are compared. When the error terms are within a predetermined range, the system stops the iteration.).
Because Hirzer, Daniilidis, Liao and Cunningham are in the same field of endeavor of object detection on an image. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis to incorporate the teachings of Cunningham by including the above feature, terminating the iterative estimation loop based on a determination that a comparison between a segmented image and the predefined image file is within a predetermined tolerance, for classifying an object in the image by using a trained model without any delay.
Regarding claim 18, Hirzer in view of Daniilidis and Liao does not disclose the following limitation: 
terminating the iterative refinement loop based on a determination by the refinement DNN that a result of the iterative refinement loop is within a predetermined tolerance.
However Cunningham further discloses method for detecting objects in images, wherein terminating the iterative refinement loop based on a determination by the refinement DNN that a result of the iterative refinement loop is within a predetermined tolerance (see [0050], where “The training module 230 stops 370 backpropagation of the error terms through the neural network model 220 after both the first loss function and the second loss function satisfy a criterion, for example, once the error terms are within a predetermined acceptable range.”; predetermined acceptable range corresponds to predetermined tolerance. See also fig 1, where object detection module is connected with the training server. See also fig 5, where an object on the acquired image is detected using a trained neural network model. see also fig 6, where fully/partially labeled training images and acquired images are compared. When the error terms are within a predetermined range, the system stops the iteration.).
Because Hirzer, Daniilidis, Liao and Cunningham are in the same field of endeavor of object detection on an image. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis and Liao to incorporate the teachings of Cunningham by including the above feature, terminating the iterative refinement loop based on a determination by the refinement DNN that a result of the iterative refinement loop is within a predetermined tolerance, for classifying an object in the image by using a trained model without any delay.

Claim(s) 3, 10 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2019/0080467 (“Hirzer”), in view of US 2019/0219401 (“Daniilidis”), and in view of US 2019/0026917 (“Liao”), as applied to claim 1, 8 and 15 above, and further in view of US 2013/0033522 (“Calman”). 
Regarding claim 3, Hirzer in view of Daniilidis and Liao does not disclose the following limitation: 
projecting the predefined image file determined to be a match to adjust for the segmented object pose.
However, Calman discloses method for populating database based on video analysis of identified objects, wherein projecting the predefined image file determined to be a match to adjust for the segmented object pose (for the examination purposes, projecting is interpreted as predicting, see at least [0034] of PGPUB of submitted specification. see Calman [0041], where “In block 106, the computer-implemented method 100 identifies a document associated with the object. In an embodiment, the database storing the reference images also includes associated documents. For example, when the computer-implemented method 100 identifies a match for the image of the object captured in the real-time video stream, the computer-implemented method receives information on the associated document.”; documents associated with the object corresponds to projecting the predefined image file. see also [0014], where “The apparatus further comprises image comparison logic stored in the memory, executable by the processor, and configured to identify a document based on comparison of the object to a reference image”; see also [0083], where “As discussed herein, the mobile device captures the objects 320 and the processor compares the objects to reference images stored in databases, such as an Augmented Reality (AR) database 514 or a financial institution database 512.”).
Because Hirzer, Daniilidis, Liao and Calman are in the same field of endeavor of object detection on an image. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis to incorporate the teachings of Calman by including the above feature, projecting the predefined image file determined to be a match to adjust for the segmented object pose, for reducing the object pose population time by narrowing (selecting a database from the many available databases) the available databases.
Regarding claim 10, Hirzer in view of Daniilidis and Liao does not disclose the following limitation: 
projecting the predefined image file determined to be a match to adjust for the segmented object pose.
However, Calman further discloses method for populating database based on video analysis of identified objects, wherein projecting the predefined image file determined to be a match to adjust for the segmented object pose (for the examination purposes, projecting is interpreted as predicting, see at least [0034] of PGPUB of submitted specification. see Calman [0041], where “In block 106, the computer-implemented method 100 identifies a document associated with the object. In an embodiment, the database storing the reference images also includes associated documents. For example, when the computer-implemented method 100 identifies a match for the image of the object captured in the real-time video stream, the computer-implemented method receives information on the associated document.”; documents associated with the object corresponds to projecting the predefined image file. see also [0014], where “The apparatus further comprises image comparison logic stored in the memory, executable by the processor, and configured to identify a document based on comparison of the object to a reference image”; see also [0083], where “As discussed herein, the mobile device captures the objects 320 and the processor compares the objects to reference images stored in databases, such as an Augmented Reality (AR) database 514 or a financial institution database 512.”).
Because Hirzer, Daniilidis, Liao and Calman are in the same field of endeavor of object detection on an image. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis and Liao to incorporate the teachings of Calman by including the above feature, projecting the predefined image file determined to be a match to adjust for the segmented object pose, for reducing the object pose population time by narrowing (selecting a database from the many available databases) the available databases.
Regarding claim 17, Hirzer in view of Daniilidis and Liao does not disclose the following limitation: 
projecting the predefined image file determined to be a match to adjust for the segmented object pose.
However, Calman further discloses method for populating database based on video analysis of identified objects, wherein projecting the predefined image file determined to be a match to adjust for the segmented object pose (for the examination purposes, projecting is interpreted as predicting, see at least [0034] of PGPUB of submitted specification. see Calman [0041], where “In block 106, the computer-implemented method 100 identifies a document associated with the object. In an embodiment, the database storing the reference images also includes associated documents. For example, when the computer-implemented method 100 identifies a match for the image of the object captured in the real-time video stream, the computer-implemented method receives information on the associated document.”; documents associated with the object corresponds to projecting the predefined image file. see also [0014], where “The apparatus further comprises image comparison logic stored in the memory, executable by the processor, and configured to identify a document based on comparison of the object to a reference image”; see also [0083], where “As discussed herein, the mobile device captures the objects 320 and the processor compares the objects to reference images stored in databases, such as an Augmented Reality (AR) database 514 or a financial institution database 512.”).
Because Hirzer, Daniilidis, Liao and Calman are in the same field of endeavor of object detection on an image. Thus before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Hirzer in view of Daniilidis and Liao to incorporate the teachings of Calman by including the above feature, projecting the predefined image file determined to be a match to adjust for the segmented object pose, for reducing the object pose population time by narrowing (selecting a database from the many available databases) the available databases. 
Response to Arguments
Applicant’s arguments with respect to claim 1-20 have been considered but are moot because the arguments do not apply to the new combination used in the current rejection that is due to the newly added claim amendments.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOHANA TANJU KHAYER whose telephone number is (408)918-7597.  The examiner can normally be reached on Monday - Thursday, 7 am-5.30 pm, PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abby Lin can be reached on 571-270-3976.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/S.T.K./
 Examiner, Art Unit 3664/ABBY Y LIN/Supervisory Patent Examiner, Art Unit 3664