DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
Claims 1-19 are pending in this application. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

35 U.S.C. § 112 Sixth Paragraph - Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) are: “unit” in claims 16-19. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.

Claims 1-19 are rejected under 35 U.S.C. 103(a) as being unpatentable over Sharp et al. (US Patent 9251590, hereby referred to as “Sharp”, in view of Kendall et al. (Alex Kendall, Matthew Grimes, Roberto Cipolla; "PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization", Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2938-2946), hereby referred to as “Kendall”. Kendall was cited by applicant in IDS submitted on December 10, 2020. 
Consider Claims 1 and 9. 
Sharp teaches: 
1. A method for image-based positioning to predict a camera pose from image data, comprising: / 9. An apparatus for image-based positioning to predict a camera pose from image data, comprising: (Sharp: abstract, Camera pose estimation for 3D reconstruction is described, for example, to enable position and orientation of a depth camera moving in an environment to be tracked for robotics, gaming and other applications. In various embodiments, depth observations from the mobile depth camera are aligned with surfaces of a 3D model of the environment in order to find an updated position and orientation of the mobile depth camera which facilitates the alignment. column 2 lines 57-67, FIG. 1 is a schematic diagram of a person 100 standing in a room and holding a mobile depth camera 102 which in this example also incorporates a projector which is projecting the image of a cat 108 into the room. The room contains various objects 106 Such as a chair, door, window, plant, light and another person 104. Many of the objects 106 are static although some of the objects such as person 104 may move. As the person moves around the room the mobile depth camera captures images which are used by a real-time camera tracking system 112 to monitor the location and orientation of the camera in the room)
9. an interface device configured to receive image data for image-based positioning; a memory in which map information on cells constituting a space for positioning is stored, wherein the map information includes minimum and maximum values of coordinates constituting a cell for each cell index, and a cell size; and a processor configured to predict a camera pose based on the image data, wherein the processor is configured to perform operations by: (Sharp: column 10 lines 56-67, column 11 lines 1-20, Starting with the lowest resolution versions of the current and previous depth image frames, a frame to frame energy function is minimized to find a camera pose which best aligns the two frames. The camera pose is refined by repeating this process for Successively higher resolution pairs of the previous and current depth image frames from the pyramids. For example the frame to frame energy function which is an energy of a function of a camera pose registration parameter vector is equal to the Sum over image elements i of the squares of the differences between, the depth component (obtained by applying a matrix transpose T with the unit vectore) of the 3D points in view space of the image elements of the previous frame, times a rigid transformation M(0) from the view space of the previous frame to that of the current frame, and the 3D points in view space of the image element of the current frame, given a function L(x) which represents the projection of a 3D view space coordinate X to a 2D image coordinate; where X is given by the corresponding depth value of the image element of the previous frame times the rigid transformation from the view space of the previous frame to that of the current frame.)
1. obtaining, by a positioning apparatus, a prediction result indicating which cell the image data belongs to among cells constituting a space for positioning from a classification network that processes the image data based on hard classification; / 9. obtaining, through the interface device, a prediction result indicating which cell the image data belongs to among cells constituting a space for positioning from a classification network learned based on hard classification; (Sharp: column 5 lines 30-51, For example, in the case of FIG. 1 the 3D model would be a 3D model of the surfaces and objects in the room. In the case of FIG. 2 the 3D model would be a 3D model of the floor of the building including the objects and surfaces on that floor of the building. The dense 3D model 326 may be stored in GPU memory or in other ways. For example, the dense 3D model may be stored as a linear array (Examiner note: a linear array comprises of cells) in slice-row-column order, optionally with some padding so that slices and rows align certain memory block sizes.)
1. obtaining, by the positioning apparatus, map information on the space for positioning by using a cell index selected based on the prediction result, wherein the map information includes minimum and maximum values of coordinates constituting a cell for each cell index, and a cell size; / 9. obtaining map information on the space for positioning by using a cell index selected based on the prediction result; (Sharp: column 5 lines 30-51, For example, the model may be stored in GPU texture memory or it may be stored as a linear array of memory locations used to represent a 3D Volume. This may be achieved by mapping each voxel (or other 3D image element Such as a group of Voxels) to a memory array index using a linear pitched memory which provides fast, parallel access to the data stored on the parallel computing unit memory. Each Voxel may store a numerical value which may be zero at a surface represented by the model, positive outside objects represented by the model and negative inside objects represented by the model, where the magnitude of the numerical value is related to distance from the closest surface represented by the model. Column 6 lines 1-18, The camera pose engine 318 of the real-time tracker is arranged to compute the camera pose by finding a camera pose which gives a good alignment of a depth map frame with the dense 3D model. It uses an iterative process which may be implemented using one or more graphics processing units, or using multiple CPU cores, in order that the camera pose engine operates in real-time.)
1. and calculating, by the positioning apparatus, a position of the image data based on the map information and outputting a corresponding camera pose. / 9. and calculating a location for the image data based on the map information and outputting a corresponding camera pose. (Sharp: column 6 lines 29-59, The relocalization engine 322 is arranged to deal with the situation where the real-time tracker loses the current location of the mobile environment sensor 300 and relocalizes or finds the current location again. The processing performed by the real-time tracker 316 and/or the dense 3D model formation system324 can, in one example, be executed remotely from the location of the mobile environment sensor 300. For example, the mobile environment sensor 300 can be connected to (or comprise) a computing device having relatively low processing power, and which streams the depth images over a communications network to a server.)
Sharp does not teach: 
- 1. processes the image data based on a pre-learned weight
Kendall teaches: 
1. A method for image-based positioning to predict a camera pose from image data, comprising: / 9. An apparatus for image-based positioning to predict a camera pose from image data, comprising: (Kendall: page 2938 section Abstract and Introduction, Figure 1 PoseNet Convolutional neural network monocular camera re-localization. We present a robust and real-time monocular six degree of freedom re-localization system. Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimization)
9. an interface device configured to receive image data for image-based positioning; a memory in which map information on cells constituting a space for positioning is stored, wherein the map information includes minimum and maximum values of coordinates constituting a cell for each cell index, and a cell size; and a processor configured to predict a camera pose based on the image data, wherein the processor is configured to perform operations by: (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames)
1. obtaining, by a positioning apparatus, a prediction result indicating which cell the image data belongs to among cells constituting a space for positioning from a classification network that processes the image data based on a pre-learned weight; / 9. obtaining, through the interface device, a prediction result indicating which cell the image data belongs to among cells constituting a space for positioning from a classification network learned based on hard classification; (Kendall: page 2940 section 3.1. Simultaneously learning location and
Orientation, We found it was important to randomly initialize the final position regressor layer so that the norm of the weights corresponding to each position dimension was proportional to that dimension’s spatial extent…. Furthermore, other convnets that have been used for regression operate off very large datasets [25, 19]. For localization regression to work off limited data we leverage the powerful representations learned off these large classification datasets by pretraining the weights on these datasets.) 
1. obtaining, by the positioning apparatus, map information on the space for positioning by using a cell index selected based on the prediction result, wherein the map information includes minimum and maximum values of coordinates constituting a cell for each cell index, and a cell size; / 9. obtaining map information on the space for positioning by using a cell index selected based on the prediction result; (Kendall: page 2939 section 2 Related Work, column 2 paragraph 2, Our work most closely follows from the Scene Coordinate Regression Forests for relocalization proposed in [20]. This algorithm uses depth images to create scene coordinate labels which map each pixel from camera coordinates to global scene coordinates. This was then used to train a regression forest to regress these labels and localize the camera. Page 2941, Section 5 Experiments, Figure 6: Dataset details and results. We show median performance for PoseNet on all scenes, evaluated on a single 224x224 center crop and 128 uniformly separated dense crops. For comparison we plot the results from SCoRe Forest [20] which uses depth, therefore fails on outdoor scenes. This system regresses pixel-wise world coordinates of the input image at much larger resolution. This requires a dense depth map for training and an extra RANSAC step to determine the camera’s pose. Additionally, we compare to matching the nearest neighbour feature vector representation from PoseNet. This demonstrates our regression PoseNet performs better than a classifier. )
1. and calculating, by the positioning apparatus, a position of the image data based on the map information and outputting a corresponding camera pose. / 9. and calculating a location for the image data based on the map information and outputting a corresponding camera pose. (Kendall: page 2941 Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames. Section 5 Experiments, Figure 6: Dataset details and results. We show median performance for PoseNet on all scenes, evaluated on a single 224x224 center crop and 128 uniformly separated dense crops. For comparison we plot the results from SCoRe Forest [20] which uses depth, therefore fails on outdoor scenes. This system regresses pixel-wise world coordinates of the input image at much larger resolution. This requires a dense depth map for training and an extra RANSAC step to determine the camera’s pose. Additionally, we compare to matching the nearest neighbour feature vector representation from PoseNet. This demonstrates our regression PoseNet performs better than a classifier.)
It would have been obvious before the effective filing date of the claimed invention to one of ordinary skill in the art to modify Sharp’s method and system for camera pose estimation to substitute in  Kendall’s improved CNN-based camera re-localization algorithm as they are both directed towards the same field of endeavor. The determination of obviousness is predicated upon the following findings: One skilled in the art would have been motivated to modify the method and system of Sharp for camera pose estimation in order to use Kendall’s improved algorithm and architecture for camera pose estimation using a CNN-based real-time 6-DOF relocalization system. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and programming techniques, without changing a “fundamental” operating principle of Sharp, while the teaching of Kendall continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of a more accurate pose estimation that is more adaptable for different lighting and image acquisition settings. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.

Consider Claim 2. The combination of Sharp and Kendall teaches: 
2. The method of claim 1, wherein the outputting of a camera pose comprises calculating a position of the image data based on a sum of weights of neighboring cells of the selected cell index in the space for positioning. (Sharp: column 10 lines 56-67, column 11 lines 1-20, Starting with the lowest resolution versions of the current and previous depth image frames, a frame to frame energy function is minimized to find a camera pose which best aligns the two frames. The camera pose is refined by repeating this process for Successively higher resolution pairs of the previous and current depth image frames from the pyramids. For example the frame to frame energy function which is an energy of a function of a camera pose registration parameter vector is equal to the Sum over image elements i of the squares of the differences between, the depth component (obtained by applying a matrix transpose T with the unit vectore) of the 3D points in view space of the image elements of the previous frame, times a rigid transformation M(0) from the view space of the previous frame to that of the current frame, and the 3D points in view space of the image element of the current frame, given a function L(x) which represents the projection of a 3D view space coordinate X to a 2D image coordinate; where X is given by the corresponding depth value of the image element of the previous frame times the rigid transformation from the view space of the previous frame to that of the current frame.)

Consider Claim 3. The combination of Sharp and Kendall teaches:
3. The method of claim 1, wherein a cell constituting the space for positioning is mapped to one class, and the prediction result includes confidence for each class, wherein the outputting of a camera pose comprises outputting of the camera pose and confidence corresponding thereto. (Kendall: page 2942, Figure 6, Dataset details and results. We show median performance for PoseNet on all scenes, evaluated on a single 224x224 center crop and 128 uniformly separated dense crops. For comparison we plot the results from SCoRe Forest [20] which uses depth, therefore fails on outdoor scenes. This system regresses pixel-wise world coordinates of the input image at much larger resolution. This requires a dense depth map for training and an extra RANSAC step to determine the camera’s pose. Additionally, we compare to matching the nearest neighbour feature vector representation from PoseNet. This demonstrates our regression PoseNet performs better than a classifier. Figure 7: Localization performance. These figures show our localization accuracy for both position and orientation as a cumulative histogram of errors for the entire testing set. The regression convnet outperforms the nearest neighbour feature matching which demonstrates we regress finer resolution results than given by training. Comparing to the RGB-D SCoRe Forest approach shows that our method is competitive, but outperformed by a more expensive depth approach. Our method does perform better on the hardest few frames, above the 95th percentile, with our worst error lower than the worst error from the SCoRe approach.)

Consider Claim 4. The combination of Sharp and Kendall teaches:
4. The method of claim 3, wherein the obtaining of map information comprises: selecting a class having highest confidence among the confidence for each class, and obtaining a cell index mapped to the selected class; and obtaining map information on the space for the positioning by using the obtained cell index.( Kendall: page 2942, Figure 6, Dataset details and results. We show median performance for PoseNet on all scenes, evaluated on a single 224x224 center crop and 128 uniformly separated dense crops. For comparison we plot the results from SCoRe Forest [20] which uses depth, therefore fails on outdoor scenes. This system regresses pixel-wise world coordinates of the input image at much larger resolution. This requires a dense depth map for training and an extra RANSAC step to determine the camera’s pose. Additionally, we compare to matching the nearest neighbour feature vector representation from PoseNet. This demonstrates our regression PoseNet performs better than a classifier. Figure 7: Localization performance. These figures show our localization accuracy for both position and orientation as a cumulative histogram of errors for the entire testing set. The regression convnet outperforms the nearest neighbour feature matching which demonstrates we regress finer resolution results than given by training. Comparing to the RGB-D SCoRe Forest approach shows that our method is competitive, but outperformed by a more expensive depth approach. Our method does perform better on the hardest few frames, above the 95th percentile, with our worst error lower than the worst error from the SCoRe approach.)

Consider Claim 5. The combination of Sharp and Kendall teaches:
5. The method of claim 1, wherein: the classification network is learned through hard classification-based learning, the hard classification-based learning is performed by converting training data into an index for applying camera pose classification and performing hard labeling for learning on each index, and the hard labeling is performed by setting only one index cell corresponding to a camera pose to "1" and setting the rest to "0".(Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 6. The combination of Sharp and Kendall teaches:
6. The method of claim 1, wherein a cell constituting the space for positioning is mapped to one class, the prediction result includes a score for each class, and the score is an evaluation score calculated based on a loss function, wherein the obtaining of map information comprises: selecting a class having a highest score among the scores for each class, and obtaining a cell index mapped to the selected class; and obtaining map information on the space for the positioning by using the obtained cell index. (Sharp: column 10 lines 56-67, column 11 lines 1-20, Starting with the lowest resolution versions of the current and previous depth image frames, a frame to frame energy function is minimized to find a camera pose which best aligns the two frames. The camera pose is refined by repeating this process for Successively higher resolution pairs of the previous and current depth image frames from the pyramids. For example the frame to frame energy function which is an energy of a function of a camera pose registration parameter vector is equal to the Sum over image elements i of the squares of the differences between, the depth component (obtained by applying a matrix transpose T with the unit vectore) of the 3D points in view space of the image elements of the previous frame, times a rigid transformation M(0) from the view space of the previous frame to that of the current frame, and the 3D points in view space of the image element of the current frame, given a function L(x) which represents the projection of a 3D view space coordinate X to a 2D image coordinate; where X is given by the corresponding depth value of the image element of the previous frame times the rigid transformation from the view space of the previous frame to that of the current frame.)

Consider Claim 7. The combination of Sharp and Kendall teaches:
7. The method of claim 1, wherein: the classification network is learned through soft classification-based learning, the soft classification-based learning is performed by converting training data into indexes for applying camera pose classification and performing soft labeling for learning on each index, and the soft labeling is based on a linear interpolation method. (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 8. The combination of Sharp and Kendall teaches:
8. The method of claim 7, wherein the soft labeling determines a soft label based on a distance and an area of a neighboring cell adjacent to an absolute location. (Kendall: page 2940 section 3.1. Simultaneously learning location and Orientation, We found it was important to randomly initialize the final position regressor layer so that the norm of the weights corresponding to each position dimension was proportional to that dimension’s spatial extent…. Furthermore, other convnets that have been used for regression operate off very large datasets [25, 19]. For localization regression to work off limited data we leverage the powerful representations learned off these large classification datasets by pretraining the weights on these datasets.Sharp: column 6 lines 29-59, The relocalization engine 322 is arranged to deal with the situation where the real-time tracker loses the current location of the mobile environment sensor 300 and relocalizes or finds the current location again. The processing performed by the real-time tracker 316 and/or the dense 3D model formation system324 can, in one example, be executed remotely from the location of the mobile environment sensor 300. For example, the mobile environment sensor 300 can be connected to (or comprise) a computing device having relatively low processing power, and which streams the depth images over a communications network to a server.)


Consider Claim 10. The combination of Sharp and Kendall teaches:
10. The apparatus of claim 9, wherein the processor is configured to calculate a position for the image data based on a sum of weights of neighboring cells of the selected cell index in the space for positioning when performing the operation of outputting a camera pose. (Sharp: column 10 lines 56-67, column 11 lines 1-20, Starting with the lowest resolution versions of the current and previous depth image frames, a frame to frame energy function is minimized to find a camera pose which best aligns the two frames. The camera pose is refined by repeating this process for Successively higher resolution pairs of the previous and current depth image frames from the pyramids. For example the frame to frame energy function which is an energy of a function of a camera pose registration parameter vector is equal to the Sum over image elements i of the squares of the differences between, the depth component (obtained by applying a matrix transpose T with the unit vectore) of the 3D points in view space of the image elements of the previous frame, times a rigid transformation M(0) from the view space of the previous frame to that of the current frame, and the 3D points in view space of the image element of the current frame, given a function L(x) which represents the projection of a 3D view space coordinate X to a 2D image coordinate; where X is given by the corresponding depth value of the image element of the previous frame times the rigid transformation from the view space of the previous frame to that of the current frame.)

Consider Claim 11. The combination of Sharp and Kendall teaches:
11. The apparatus of claim 11, wherein cells constituting the space for positioning are mapped to one class, and the prediction result includes confidence for each class, wherein the processor is configured to output the camera pose and confidence corresponding thereto when performing an operation of outputting a camera pose. (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 12. The combination of Sharp and Kendall teaches:
12. The apparatus of claim 11, wherein when performing the operation of obtaining map information, the processor is configured to perform operation by: selecting a class having highest confidence among the confidence for each class, and obtaining a cell index mapped to the selected class; and obtaining map information on the space for positioning by using the obtained cell index. (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 13. The combination of Sharp and Kendall teaches:
13. The apparatus of claim 9, wherein: the classification network is learned through hard classification-based learning. the hard classification-based learning is performed by converting training data into an index for applying camera pose classification and performing hard labeling for learning on each index. and the hard labeling is performed by setting only one index cell corresponding to a camera pose to "I" and setting the rest to "0".(Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 14. The combination of Sharp and Kendall teaches:
14. The apparatus of claim 9, wherein a cell constituting the space for positioning is mapped to one class, the prediction result includes a score for each class, and the score is an evaluation score calculated based on a loss function, wherein when performing the operation of obtaining map information, the processor is configured to perform operation by: selecting a class having a highest score among the scores for each class, and obtaining a cell index mapped to the selected class: and obtaining map information on the space for the positioning by using the obtained cell index. (Sharp: column 5 lines 30-51, For example, the model may be stored in GPU texture memory or it may be stored as a linear array of memory locations used to represent a 3D Volume. This may be achieved by mapping each voxel (or other 3D image element Such as a group of Voxels) to a memory array index using a linear pitched memory which provides fast, parallel access to the data stored on the parallel computing unit memory. Each Voxel may store a numerical value which may be zero at a surface represented by the model, positive outside objects represented by the model and negative inside objects represented by the model, where the magnitude of the numerical value is related to distance from the closest surface represented by the model. Column 6 lines 1-18, The camera pose engine 318 of the real-time tracker is arranged to compute the camera pose by finding a camera pose which gives a good alignment of a depth map frame with the dense 3D model. It uses an iterative process which may be implemented using one or more graphics processing units, or using multiple CPU cores, in order that the camera pose engine operates in real-time.)

Consider Claim 15. The combination of Sharp and Kendall teaches:
15. The apparatus of claim 9, wherein: the classification network is learned through soft classification-based learning, the soft classification-based learning is performed by converting training data into indexes for applying camera pose classification and performing soft labeling for learning on each index, and the soft labeling is based on a linear interpolation method. (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 15. The combination of Sharp and Kendall teaches:
16. The apparatus of claim 9, wherein the processor is configured to include: a hard classification layer unit configured to output a hard classification result including a camera pose predicted for the image data and confidence corresponding thereto by performing the operation of obtaining a prediction result by using a classification network learned through hard classification-based learning, the operation of obtaining map information, and the operation of outputting a corresponding camera pose; a soft classification layer unit configured to output a soft classification result including a camera pose predicted for the image data by performing the operation of obtaining a prediction result by using a classification network learned through soft classification-based learning, the operation of obtaining map information, and the operation of outputting a corresponding camera pose; and a fusion processing unit configured to output a final camera pose predicted for the image data by converging the hard classification result and the soft classification result. (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 17. The combination of Sharp and Kendall teaches:
17. The apparatus of claim 9, wherein the processor is configured to include: a hard classification layer unit configured to output a hard classification result including a camera pose predicted for the image data and confidence corresponding thereto by performing the operation of obtaining a prediction result by using a classification network learned through hard classification-based learning, the operation of obtaining map information, and the operation of outputting a corresponding camera pose; a regression layer unit configured to output a regression estimation result including a camera pose predicted for the image data through regression estimation; and a fusion processing unit configured to output a final camera pose predicted for the image data by converging the hard classification result and the regression estimation result. (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 18. The combination of Sharp and Kendall teaches:
18. The apparatus of claim 9, wherein the processor is configured to include: a soft classification layer unit configured to output a soft classification result including a camera pose predicted for the image data by performing the operation of obtaining a prediction result by using a classification network learned through soft classification-based learning, the operation of obtaining map information, and the operation of outputting a corresponding camera pose; a regression layer unit configured to output a regression estimation result including a camera pose predicted for the image data through regression estimation; and a fusion processing unit configured to output a final camera pose predicted for the image data by converging the soft classification result and the regression estimation result. (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)

Consider Claim 19. The combination of Sharp and Kendall teaches:
19. The apparatus of claim 9, wherein the processor is configured to include: a hard classification layer unit configured to output a hard classification result including a camera pose predicted for the image data and confidence corresponding thereto by performing the operation of obtaining a prediction result by using a classification network learned through hard classification-based learning, the operation of obtaining map information, and the operation of outputting a corresponding camera pose; a soft classification layer unit configured to output a soft classification result including a camera pose predicted for the image data by performing the operation of obtaining a prediction result by using a classification network learned through soft classification-based learning, the operation of obtaining map information, and the operation of outputting a corresponding camera pose; a regression layer unit configured to output a regression estimation result including a camera pose predicted for the image data through regression estimation; and a fusion processing unit configured to output a final camera pose predicted for the image data by converging the soft classification result, the hard classification result, and the regression estimation result. (Kendall: page 2939-2940 section 3 Model for Deep Regression of Camera Pose, Figure 3: Magnified view of a sequence of training (green) and testing (blue) cameras for King’s College. We show the predicted camera pose in red for each testing frame. The images show the test image (top), the predicted view from our convnet overlaid in red on the input image (middle) and the nearest neighbour training image overlaid in red on the input image (bottom). This shows our system can interpolate camera pose effectively in space between training frames). page 2940 section 3.2 Architecture, • Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4).  Insert another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization. At test time we also normalize the quaternion orientation vector to unit length. We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the camera pose). At test time we evaluate it with both a single center crop and also densely with 128 uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU processing, this results in a computational time increase from 5ms to 95ms per image.)



Conclusion
The prior art made of record in form PTO-892 and not relied upon is considered pertinent to applicant's disclosure. 

    PNG
    media_image1.png
    130
    1279
    media_image1.png
    Greyscale

Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAHMINA ANSARI whose telephone number is 571-270-3379.  The examiner can normally be reached on IFP Flex - Monday through Friday 9 to 5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, SUMATI LEFKOWITZ can be reached on 571-272-3638.  The fax phone numbers for the organization where this application or proceeding is assigned are 571-273-8300 for regular communications and 571-273-8300 for After Final communications. TC 2600’s customer service number is 571-272-2600.
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the receptionist whose telephone number is 571-272-2600.



2662
/Tahmina Ansari/

June 18, 2022
/TAHMINA N ANSARI/Primary Examiner, Art Unit 2662