DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claims 28-54 are pending under this Office action.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 28-50 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-27 of U.S. Patent No. 10,943,120. Although the claims at issue are not identical, they are not patentably distinct from each other because they can read on to each other, see the following mapping table.

Application No. 17/193,568 (Instant Application)
U.S. Patent No. 10,943,120
28. A system comprising: 

one or more imaging devices; one or more processors; and 

one or more computer storage media storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: 





























matching a portion of a current image of a real-world environment with a patch stored by the system, the patch being associated with a first salient point being tracked by the system, the first salient point being included in a prior image of the real-world environment, wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real-world environment; 

accessing respective descriptors for the first salient point and a second salient point identified in the current image, wherein the second salient point represents a second feature of the real-world environment, and wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature; and 

determining a pose associated with the system, the pose being based on the accessed descriptors and the descriptor-based map.
1. A system comprising: 

one or more imaging devices; one or more processors; and 

one or more computer storage media storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: 

obtaining, via the one or more imaging devices, a current image of a real-world environment, the current image including a plurality of points for determining pose; 

accessing a first patch associated with a first salient point which is being tracked by the system, the first salient point being included in a prior image of the real-world environment, wherein the first salient point represents a first feature of the real-world environment; 

projecting the first patch onto the current image, wherein the first salient point is matched with a corresponding one of the plurality of points included in the current image, such that a position at which the first feature is represented in the current image is determined to correspond to the one of the plurality of points; 

extracting a second salient point from the current image, such that the second salient point is tracked by the system, the second salient point representing a second feature of the real-world environment; 

providing respective descriptors for the salient points associated with the current image, the salient points comprising the first salient point and the second salient point; 

matching, based on the descriptors, the salient points associated with the current image with real-world locations specified in a descriptor-based map of the real-world environment, such that real-world locations associated with the first feature and the second feature are identified; and 











determining, based on the matching, a pose associated with the system, the pose indicating at least an orientation of the one or more imaging devices in the real-world environment.
29. The system of claim 28, wherein the portion of the current image is matched with the patch stored by the system based on minimizing a cost function.
3. The system of claim 2, wherein locating the second patch comprises minimizing a difference between the first patch in the previous image and the second patch in the current image.
30. The system of claim 28, wherein the portion of the current image is matched with the patch using information from an inertial measurement unit of the system.
4. The system of claim 2, wherein projecting the first patch onto the current image is based, at least in part, on information from an inertial measurement unit of the system.
31. The system of claim 28, wherein the descriptors are generated based on respective image pixels associated with respective locations of the first salient point and the second salient point in the current image.
13. The system of claim 1, wherein providing descriptors comprises generating descriptors for each of the salient points.
32. The system of claim 28, wherein the descriptors are generated based on respective image areas associated with respective locations of the first salient point and the second salient point in the current image.
1. providing respective descriptors for the salient points associated with the current image, the salient points comprising the first salient point and the second salient point;
33. The system of claim 28, wherein the descriptor-based map includes a plurality of descriptors associated with a plurality of features, the features including the first feature and the second feature.
8. The system of claim 1, wherein matching salient points associated with the current image with real-world locations specified in the map of the real-world environment comprises: accessing the descriptor-based map, the descriptor-based map comprising real-world locations of salient points and associated descriptors; and matching descriptors for salient points of the current image with descriptors for salient points at real-world locations.
34. The system of claim 33, wherein the operations further comprise matching a subset of the plurality of descriptors with the accessed descriptors to identify at least a first descriptor and a second descriptor which match with the accessed descriptors.
7. The system of claim 5, wherein the image area comprises a subset of the current image, and wherein the system is configured to adjust a size associated with the subset based on one or more of processing constraints or differences between one or more prior determined poses.
35. The system of claim 34, wherein the descriptor-based map includes three- dimensional coordinates of salient points associated with the plurality of descriptors, and wherein the pose is based on comparing:

(1) the three-dimensional locations associated with the first descriptor and the second descriptor; and

(2) two-dimensional locations of the first salient point and the second salient point in the current image.
8. The system of claim 1, wherein matching salient points associated with the current image with real-world locations specified in the map of the real-world environment comprises: accessing the descriptor-based map, the descriptor-based map comprising real-world locations of salient points and associated descriptors; and matching descriptors for salient points of the current image with descriptors for salient points at real-world locations.
36. The system of claim 28, wherein the second salient point is extracted from the current image, and wherein extracting comprises:

determining that an image area of the current image has less than a threshold number of salient points being tracked by the system; and extracting one or more additional salient points from the image area, the extracted salient points including the second salient point.
5. The system of claim 1, wherein extracting the second salient point comprises: determining that an image area of the current image has less than a threshold number of salient points projected from the previous image; and extracting one or more additional salient points from the image area, the extracted salient points including the second salient point.
37. The system of claim 28, wherein matching the portion of the current image with the patch comprises projecting the patch onto the current image and refining a location associated with the patch.
1. projecting the first patch onto the current image, wherein the first salient point is matched with a corresponding one of the plurality of points included in the current image, such that a position at which the first feature is represented in the current image is determined to correspond to the one of the plurality of points;
38. The system of claim 28, wherein the operations further comprise: projecting salient points included in the descriptor-based map onto the current image, wherein the projection is based on one or more of an inertial measurement unit, an extended kalman filter, or visual-inertial odometry.
9. The system of claim 8, wherein the operations further comprise: projecting salient points provided in the descriptor-based map onto the current image, wherein the projection is based on one or more of an inertial measurement unit, an extended kalman filter, or visual-inertial odometry.
39. A method implemented by a head-mounted system, the method comprising: 






























matching a portion of a current image of a real-world environment with a patch stored by the system, the patch being associated with a first salient point being tracked by the system, the first salient point being included in a prior image of the real-world environment, wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real- world environment; 

accessing respective descriptors for the first salient point and a second salient point identified in the current image, wherein the second salient point represents a second feature of the real-world environment, and wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature; and 

determining a pose associated with the system, the pose being based on the descriptors and the descriptor-based map.
15. A method comprising: 

obtaining, via one or more imaging devices, a current image of a real-world environment, the current image including a plurality of points for determining pose; 

accessing a first patch associated with a first salient point, the first salient point being included in a prior image of the real-world environment, wherein the first salient point represents a first feature of the real-world environment; 

projecting the first patch onto the current image, wherein the first salient point is matched with a corresponding one of the plurality of points included in the current image, such that a position at which the first feature is represented in the current image is determined to correspond to the one of the plurality of points; 

extracting a second salient point from the current image, the second salient point representing a second feature of the real-world environment; providing respective descriptors for the salient points associated with the current image, the salient points comprising the first salient point and the second salient point; 

matching, based on the descriptors, the salient points associated with the current image with real-world locations specified in a descriptor-based map of the real-world environment, such that real-world locations associated with the first feature and the second feature are identified; and 















determining, based on the matching, a pose associated with the system, the pose indicating at least an orientation of the one or more imaging devices in the real-world environment.
40. The method of claim 39, wherein the portion of the current image is matched with the patch stored by the system based on minimizing a cost function.
17. The method of claim 16, wherein locating the second patch comprises determining a patch in the current image with a minimum of differences with the first patch.
41. The method of claim 39, wherein the portion of the current image is matched with the patch using information from an inertial measurement unit of the system.
18. The method of claim 16, wherein projecting the first patch onto the current image is based, at least in part, on information from an inertial measurement unit of the display device.
42. The method of claim 39, wherein the descriptors are generated based on respective image pixels associated with respective locations of the first salient point and second salient point in the current image.
26. The method of claim 15, wherein providing descriptors comprises generating descriptors for each of the salient points.
43. The method of claim 39, wherein the descriptors are generated based on respective image areas associated with respective locations of the first salient point and second salient point in the current image.
15. providing respective descriptors for the salient points associated with the current image, the salient points comprising the first salient point and the second salient point;
44. The method of claim 39, wherein the descriptor-based map includes a plurality of descriptors associated with a plurality of features, the features including the first feature and the second feature.
22. The method of claim 15, wherein matching salient points associated with the current image with real-world locations specified in the map of the real-world environment comprises: accessing the descriptor-based map, the descriptor-based map comprising real-world locations of salient points and associated descriptors; and matching descriptors for salient points of the current image with descriptors for salient points at real-world locations.
45. The method of claim 44, wherein the method further comprises matching a subset of the plurality of descriptors with the accessed descriptors to identify at least a first descriptor and a second descriptor which match with the accessed descriptors.
21. The method of claim 19, wherein the image area comprises a subset of the current image, and wherein the processors are configured to adjust a size associated with the subset based on one or more of processing constraints or differences between one or more prior determined poses.
46. The method of claim 45, wherein the descriptor-based map includes three- dimensional coordinates of salient points associated with the plurality of descriptors, and wherein the pose is based on comparing:

(1) the three-dimensional locations associated with the first descriptor and the second descriptor; and

(2) two-dimensional locations of the first salient point and the second salient point in the current image.
22. The method of claim 15, wherein matching salient points associated with the current image with real-world locations specified in the map of the real-world environment comprises: accessing the descriptor-based map, the descriptor-based map comprising real-world locations of salient points and associated descriptors; and matching descriptors for salient points of the current image with descriptors for salient points at real-world locations.
47. The method of claim 39, wherein the second salient point is extracted from the current image, and wherein extracting comprises:

determining that an image area of the current image has less than a threshold number of salient points being tracked by the system; and extracting one or more additional salient points from the image area, the extracted salient points including the second salient point.
19. The method of claim 15, wherein extracting the second salient point comprises: determining that an image area of the current image has less than a threshold number of salient points projected from the previous image; and extracting one or more additional salient points from the image area, the extracted salient points including the second salient point.
48. The method of claim 39, wherein matching the portion of the current image with the patch comprises projecting the patch onto the current image and refining a location associated with the patch.
15. projecting the first patch onto the current image, wherein the first salient point is matched with a corresponding one of the plurality of points included in the current image, such that a position at which the first feature is represented in the current image is determined to correspond to the one of the plurality of points;
49.  The method of claim 39, further comprising:

projecting salient points included in the descriptor-based map onto the current image, wherein the projection is based on one or more of an inertial measurement unit, an extended kalman filter, or visual-inertial odometry.
23. The method of claim 22, further comprising: projecting salient points provided in the descriptor-based map onto the current image, wherein the projection is based on one or more of an inertial measurement unit, an extended kalman filter, or visual-inertial odometry.
50. A head-mounted augmented reality display system comprising:

one or more outwardly-facing imaging devices configured to obtain images of a real-world environment;

one or more processors, the processors configured to:

obtain a current image of the real-world environment;

perform frame-to-frame tracking on the current image, such that patch- based salient points included in a previous image are matched with locations in the current image, each of the salient points representing a respective feature of the real-world environment;

perform map-to-frame tracking on the current image, wherein map-to-frame tracking comprises matching descriptors for the patch-based salient points with map-based descriptors stored in a descriptor-based map of the real-world environment, the map-based descriptors being associated with real-world locations of features of the real-world environment; and








determine a pose associated with the display device, the pose indicating at least an orientation of the one or more outwardly-facing imaging devices in the real-world environment.
14. A head-mounted augmented reality display system comprising: 

one or more outwardly-facing imaging devices configured to obtain images of a real-world environment; 

one or more processors, the processors configured to: 

obtain a current image of the real-world environment; 

perform frame-to-frame tracking on the current image, such that patch-based salient points included in a previous image are projected onto the current image, each salient point representing a respective feature of the real-world environment; 


perform map-to-frame tracking on the current image, wherein map-to-frame tracking comprises: 

obtaining descriptors of the salient points in the current image, the salient points comprising the patch-based salient points, and matching the obtained descriptors of the salient points with descriptors stored in a map database, the stored descriptors corresponding to descriptors of features of the real-world environment, and the map database storing real-world locations associated with the features, such that real-world locations of the salient points are identified; and 

determine a pose associated with the display device, the pose indicating at least an orientation of the one or more imaging devices in the real-world environment.




Claim 28 of the instant application is drawn to a system comprising: one or more imaging devices; one or more processors; and one or more computer storage media storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: matching a portion of a current image of a real-world environment with a patch stored by the system, the patch being associated with a first salient point being tracked by the system, the first salient point being included in a prior image of the real-world environment, wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real-world environment; accessing respective descriptors for the first salient point and a second salient point identified in the current image, wherein the second salient point represents a second feature of the real-world environment, and wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature; and determining a pose associated with the system, the pose being based on the accessed descriptors and the descriptor-based map.
While the exact wordings of claim 1 of the ‘120 patent may not be the same as that of claim 28 of the instant application, but there is no significant difference in scope between the claim 28 of the instant application and the claim 1 of the patent ‘120. Therefore, Claim 28 of the instant application cannot be considered patentably distinct over claim 1 of the ‘120 patent.
	


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 28-54 are rejected under 35 U.S.C. 103 as being unpatentable over Zhu, etc. (US 20120169887 A1) in view of Davis, etc. (US 20150286873 A1), further in view of Hummel, etc. (US 20130342671 A1).
Regarding claim 28, Zhu teaches that a system (See Zhu: Fig. 1, and [0049], "Referring now to the drawings, in which like numerals refer to like parts throughout the several views. FIG. 1 shows one exemplary configuration 100 according to an embodiment of the invention. The configuration 100 includes a display 112, a camera 103 and a processing unit, also referred to as a head motion capture unit 114. In operation, the camera 103 captures images that are transported to the head motion capture unit 114 for analysis (e.g., to determine the pose of one or more heads in the images). The display 112 is driven by the head motion capture unit 114") comprising: 
one or more imaging devices (See Zhu: Fig. 1, and [0052], "The camera 103 is an image capture device, e.g., a PlayStation Eye webcam, to capture the motion of an end user as an image sequence 104. It should be noted that there is no need to require a depth camera in this embodiment. The image sequence 104 is a collection of image frames capturing the source motion. An exemplary image sequence 113 is captured from a source motion for a head yaw movement. The image sequence 113 shows that a head has a neutral pose in the top of the image while the head rotates to the left, and the head rotates to the right in the two bottom images"); 
one or more processors (See Zhu: Fig. 1, and [0016], "According to another embodiment, the present invention is a device for determining motion of a head. The device comprises an interface, coupled to a camera, to receive a sequence of images from the camera disposed to look at a user, a memory space for storing code, a processor, coupled to the memory space, executing the code to perform operations of: estimating a position and a size of a head of the user in each of the images using a scale-invariant head tracking technique designed to scan the each of the images at all scales; and determining pose of the head from the position and size of the head"); and 
one or more computer storage media storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations (See Zhu: Fig. 1, and [0016], "According to another embodiment, the present invention is a device for determining motion of a head. The device comprises an interface, coupled to a camera, to receive a sequence of images from the camera disposed to look at a user, a memory space for storing code, a processor, coupled to the memory space, executing the code to perform operations of: estimating a position and a size of a head of the user in each of the images using a scale-invariant head tracking technique designed to scan the each of the images at all scales; and determining pose of the head from the position and size of the head") comprising: 
matching a portion of a current image of a real-world environment with a patch stored by the system (See Zhu: Fig. 4, and [0081], "During the head pose estimation, a set of reference frames are generated in an online manner. Each reference frame includes two pieces of information: head pose and its corresponding SURF features. All of the reference frames are recorded in a central place (e.g., a memory commonly accessible) at 416. The SURF feature matching at 409 between 408 and 416 using the SURF feature description is performed to find the reference frame which has the most number of matched SURF features. The matched SURF feature information is recorded at 410 in which the estimated head pose from the reference frame provides an initial head pose for the model-based head pose estimation at 411"), the patch being associated with a first salient point being tracked by the system, the first salient point being included in a prior image of the real-world environment, wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real-world environment; 
accessing respective descriptors for the first salient point and a second salient point identified in the current image, wherein the second salient point represents a second feature of the real-world environment, and wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature; and 
determining a pose associated with the system, the pose being based on the accessed descriptors and the descriptor-based map (See Zhu: Fig. 4, and [0085], "To minimize the sum of squared distances between the detected facial features at the frame t and projected model vertices obtained from the ray- model intersection, in one embodiment, the analytical Jacobian in Gaussian-Newton iteration which has a second order convergence rate for the model-based head pose estimation is utilized. The model-based head pose estimation method converges in about 10 iterations where the feature point residual errors are less than 2 pixels"; [0088], “After Kalman filtering, the optimal head pose estimation for head model at image frame t, t-1, and corresponding reference frame is obtained.  This is followed by updating the reference frame at 415”; and Fig. 5B, and [0095], “In the application example of Camera position and orientation control 507, it uses an end user's estimated six degrees of freedom head motion, without using any marker and wearing any intrusive device, to control the virtual camera motion through a true one-to-one mapping function.  It provides a natural and immersive interface for an end user to control a virtual camera with his or her head movement so that the camera control becomes simple and intuitive”. Note that the position and orientation of head is used to control the camera position and orientation, which may be corresponding to the pose indicating the imaging device position and orientation).
However, Zhu fails to explicitly disclose that the patch being associated with a first salient point being tracked by the system, the first salient point being included in a prior image of the real-world environment, wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real-world environment; and accessing respective descriptors for the first salient point and a second salient point identified in the current image, wherein the second salient point represents a second feature of the real-world environment, and wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature.
However, Davis teaches that the patch being associated with a first salient point being tracked by the system, the first salient point being included in a prior image of the real-world environment (See Davis: Fig. 20, and [0456], “FIG. 20 shows the location of the remaining salient points, in both the first and second frames. As can be seen, points near the center of the frame closely coincide. Further away, there is some shifting—some due to slightly different scale between the two image frames (e.g., the user moved the camera closer to the subject), and some due to translation (e.g., the user jittered the camera a bit)”), wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real-world environment (See Davis: Fig. 1, and [0422], "The smartphone can use this knowledge about reference salient points on the page being viewed in various ways. For example, it can identify which particular part of the page is being imaged, by matching salient points identified by the database with salient points found within the phone's field of view"); and 
accessing respective descriptors for the first salient point and a second salient point identified in the current image, wherein the second salient point represents a second feature of the real-world environment (See Davis: Fig. 1, and [0420], “In one particular embodiment, location of the smartphone relative to the page is not determined by reference to registration components of the watermark signal. Instead, the decoded watermark payload is sent to a remote server (database), which returns information about the page. Unlike application Ser. No. 13/011,618, however, the returned information is not page layout data exported from the publishing software. Instead, the database returns earlier-stored reference data about salient points (features) that are present on the page”), and wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature.
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to  modify Zhu to have the patch being associated with a first salient point being tracked by the system, the first salient point being included in a prior image of the real-world environment, wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real-world environment; and accessing respective descriptors for the first salient point and a second salient point identified in the current image, wherein the second salient point represents a second feature of the real-world environment as taught by Davis in order to optimize the use of system resources and enhance user experience (See Davis: Fig. 1, and [0252], "Our work has shown that logical sensors help optimize the use of system resources and enhance user experience. In like fashion, logical sensors that provide additional information about user context and device context enable still more complex operations. Information about when, where, how, and by whom a device is used is desirably included in all decisions involving the middleware. Some implementations employ a formal representation of context, and an artificial intelligence-based inference engine. In such arrangements, sensors and RAs may be conceived as knowledge sources"). Zhu teaches a method and system that may track the head pose using feature matching; while Davis teaches a system and method that may recognize the object based on the salient point matching. Therefore, it is obvious to one of ordinary skill in the art to modify Zhu by Davis to recognize the object with the salient point matching. The motivation to modify Zhu by Davis is "Use of known technique to improve similar devices (methods, or products) in the same way".
However, Zhu, modified by Davis, fails to explicitly disclose that wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature.
However, Hummel teaches that wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature (See Hummel: Fig. 1-3, and [0144], "In order to perform template matching, various versions of the template can be generated and stored in a data structure that can be rapidly traversed and pruned during the template matching search. In several embodiments, the set of templates that is used to perform template matching is generated through rotation and scaling of a base finger template. In other embodiments, a single template can be utilized and the image in which the search is being conducted can be scaled and/or rotated to normalize the object size within the image. The basic template can be a synthetic shape chosen based upon template matching performance (as opposed to a shape learnt by analysis of images of fingers). By application of appropriate rotation and scaling, the template matching process can limit the impact of variation in size, orientation, and distance of a finger from the camera(s) on the ability of the image processing system to detect the finger”).
Therefore, it would have been obvious to  one of ordinary skill in the art at the time of the invention was effectively filed to modify Zhu to  have wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature as taught by Hummel in order to limit the impact of variation in size, orientation, and distance of a finger from the camera(s) on the ability of the image processing system to detect the finger (See Hummel: Fig. 3, and [0144], "By application of appropriate rotation and scaling, the template matching process can limit the impact of variation in size, orientation, and distance of a finger from the camera(s) on the ability of the image processing system to detect the finger"). Zhu teaches a method and system that may track the head pose using feature matching; while Hummel teaches a system and method that may track human hands using the patch-based matching algorithms. Therefore, it is obvious to one of ordinary skill in the art to modify Zhu by Hummel to match the features using the patch-based algorithms and store those data in a data structure. The motivation to modify Zhu by Hummel is "Use of known technique to improve similar devices (methods, or products) in the same way".
Regarding claim 29, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu teaches that the system of claim 28, wherein the portion of the current image is matched with the patch stored by the system based on minimizing a cost function (See Zhu: Fig. 4, and [0085], "To minimize the sum of squared distances between the detected facial features at the frame t and projected model vertices obtained from the ray- model intersection, in one embodiment, the analytical Jacobian in Gaussian-Newton iteration which has a second order convergence rate for the model-based head pose estimation is utilized. The model-based head pose estimation method converges in about 10 iterations where the feature point residual errors are less than 2 pixels”).
Regarding claim 30, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Davis teaches that the system of claim 28, wherein the portion of the current image is matched with the patch using information from an inertial measurement unit of the system (See Davis: Fig. 11, and [0391], " (Another group of deblurring techniques does not focus on prior information about features of the captured image, but rather concerns technical attributes about the image capture. For example, the earlier-referenced research team at Microsoft equipped cameras with inertial sensors (e.g., accelerometers and gyroscopes) to collect data about camera movement during image exposure. This movement data was then used in estimating a corrective blur kernel. See Joshi et al, “Image Deblurring Using Inertial Measurement Sensors,” SIGGRAPH '10, Vol 29, No 4, July 2010. (A corresponding patent application is also believed to have been filed, prior to SIGGRAPH.) Although detailed in the context of an SLR with add-on hardware sensors, applicant believes the Microsoft method is suitable for use with smartphones (which increasingly are equipped with 3D accelerometers and gyroscopes; c.f. the Apple iPhone 4).)”). 
Regarding claim 31, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu teaches that the system of claim 28, wherein the descriptors are generated based on respective image pixels associated with respective locations of the first salient point and the second salient point in the current image (See Zhu: Fig. 2, and [0061], "The scale-invariant head descriptor has several distinguishable features that make a head tracker appropriate for game applications. Firstly, the descriptor is defined as a composite feature vector formed from an 8.times.8.times.8 color histogram, where each color channel is quantized into 8 bins, and from a 4.times.4.times.8 gradient histogram where the head image is quantized into 4.times.4 spatial cells and 8 gradient orientations. Both the color histogram and the gradient histogram are scale invariant, and thus the descriptor is optimal for efficient multi-scale searching during tracking. For interactive game applications, such multi-scale searching is a very important feature to successfully track a fast moving head in an image”).
Regarding claim 32, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu teaches that the system of claim 28, wherein the descriptors are generated based on respective image areas associated with respective locations of the first salient point and the second salient point in the current image (See Zhu: Fig. 2, and [0059], "If there is a detection of one or more heads from the image at 204, the initialization stage ends. The process 200 goes to 205 to initiate or update a scale-invariant head descriptor for the image, wherein the detection result is expressed at 207. Subsequently, the process 200 starts a local head searching at 211”).
Regarding claim 33, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu teaches that the system of claim 28, wherein the descriptor-based map includes a plurality of descriptors associated with a plurality of features, the features including the first feature and the second feature (See Zhu: Figs. 5A-B, and [0090], " Referring now to FIG. 5A, it shows a functional diagram 500 in an application scenario in which a scale-invariance head tracker according to one embodiment is used in conjunction with one or more applications. The scale-invariance head tracker unit 501 generates the tracked head position and size that is stored in a space 502. The tracked head position is mapped to a game application's 2D camera position, and the tracked head size is mapped to a game application's camera depth through camera position control unit 503”).
Regarding claim 34, Zhu, Davis and Hummel teach all the features with respect to claim 33 as outlined above. Further, Zhu teaches that the system of claim 33, wherein the operations further comprise matching a subset of the plurality of descriptors with the accessed descriptors to identify at least a first descriptor and a second descriptor which match with the accessed descriptors (See Zhu: Fig. 2, and [0063], "After the initialization at 202 is finished, the process 200 starts the tracking stage. At each of the tracking stage, a region of interest (ROI) is defined as an area occupied by a detected head from 204. Typically, the ROI is larger than the detected head but much smaller than the original image in size. A local head detector 209 is configured to scan this region of interest at two scales based on the tracked head size from the last time stamp. If any local head is detected as decided in 210, the tracked head position and size are updated at 212. Otherwise, a local head searcher is configured to continue to search a local head inside the region of interest at 211. The optimal head position and scale, which has the maximal similarity, is used to update the head position and size 208. The tracked head position and size information 212 is exported to other computational unit at 212, e.g., a module 303 of FIG. 3 and a module 401 of FIG. 4, which will be described below”).
Regarding claim 35, Zhu, Davis and Hummel teach all the features with respect to claim 34 as outlined above. Further, Zhu teaches that the system of claim 34, wherein the descriptor-based map includes three- dimensional coordinates of salient points associated with the plurality of descriptors, and wherein the pose is based on comparing:
(1) the three-dimensional locations associated with the first descriptor and the second descriptor (See Zhu: Figs. 5A-B, and [0094], "FIG. 5B shows an exemplary functional diagram 510 for an application scenario in which the 3D head pose tracker 505 is configured to generate the 3D head position and orientation 506, and provides immersive experience through authentic 3D camera control and character gaze control with the tracked head poses”); and 
(2) two-dimensional locations of the first salient point and the second salient point in the current image (See Zhu: Figs. 5A-B, and [0090], "Referring now to FIG. 5A, it shows a functional diagram 500 in an application scenario in which a scale-invariance head tracker according to one embodiment is used in conjunction with one or more applications. The scale-invariance head tracker unit 501 generates the tracked head position and size that is stored in a space 502. The tracked head position is mapped to a game application's 2D camera position, and the tracked head size is mapped to a game application's camera depth through camera position control unit 503”).
Regarding claim 36, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu and Davis teaches that the system of claim 28, wherein the second salient point is extracted from the current image, and wherein extracting comprises:
determining that an image area of the current image has less than a threshold number of salient points being tracked by the system (See Davis: Fig. 1, and [0427], “In one particular arrangement, the database also returns scale and rotation data, related to salient point information provided to the smartphone. For example, the database may return a numeric value useful to indicate which direction is towards the top of the imaged object (i.e., vertical). This value can express, e.g., the angle between vertical, and a line between the first- and last-listed salient points. Similarly, the database may return a numeric value indicating the distance—in inches—between the first- and last-listed salient points, in the scale with which the object (e.g., newspaper) was originally printed. (These simple illustrations are exemplary only, but serve to illustrate the concepts.)”); and 
extracting one or more additional salient points from the image area, the extracted salient points including the second salient point (See Zhu: Fig. 2, and [0062], "Secondly, the color histogram is able to make a head distinct from background, and it is robust toward partial occlusion. A gradient histogram is able to remember the appearance information that is special to the tracked person, which makes it robust toward surrounding distracting heads and temporary occlusion. The similarity between two scale-invariant descriptors is defined as weighted summation of normalized correlations for the color histogram and for the gradient histogram. A composite feature vector formed from the color and gradient histograms, along with the weighted similarity criteria, makes it robust under partial occlusion, temporary occlusion, and surrounding distracting heads. The weights for similarity from the color histogram and similarity from the gradient histogram are determined automatically based on the robustness of the tracker under partial and temporary occlusion”).
Regarding claim 37, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu teaches that the system of claim 28, wherein matching the portion of the current image with the patch comprises projecting the patch onto the current image and refining a location associated with the patch (See Zhu: Fig. 4, and [0084], " For matched feature points at 410, the mode vertices are obtained by back-projecting the feature points in the reference frame, and performing ray-model intersection where 3D head model is transformed based on the initial pose estimated at the reference frame. (do not understand)”).
Regarding claim 38, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu teaches that the system of claim 28, wherein the operations further comprise: projecting salient points included in the descriptor-based map onto the current image (See Zhu: Fig. 4, and [0083], “For the matched feature points stored at 406, the model vertices are obtained by back-projecting the feature points in the image frame at t-1 and performing the ray-model intersection, where the 3D head model is transformed based on the initial pose estimated at frame t-1 (do not understand)”), wherein the projection is based on one or more of an inertial measurement unit, an extended kalman filter, or visual-inertial odometry (See Zhu: Fig. 4, and [0075], “FIG. 4 shows a flowchart or process 400 for operations in a head pose tracker unit configured to perform 3D head pose tracking. Depending on implementation, the process 400 may be implemented in software, hardware or in combination of both. According to one embodiment, some of the features, benefits or advantages in the process 400 include: (1) robust feature point tracking with large head movement compensation based on scale-invariant head tracking; (2) model-based head pose estimation with sparse optical flow; (3) model-based head pose estimation with reference frames; (4) head pose estimation result integrated with Kalman filtering”).
Regarding claim 39, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu, Davis and Hummel teach that a method implemented by a head-mounted system, the method (See Zhu: Fig. 1, and [0049], "Referring now to the drawings, in which like numerals refer to like parts throughout the several views. FIG. 1 shows one exemplary configuration 100 according to an embodiment of the invention. The configuration 100 includes a display 112, a camera 103 and a processing unit, also referred to as a head motion capture unit 114. In operation, the camera 103 captures images that are transported to the head motion capture unit 114 for analysis (e.g., to determine the pose of one or more heads in the images). The display 112 is driven by the head motion capture unit 114") comprising: 
matching a portion of a current image of a real-world environment with a patch stored by the system (See Hummel: Fig. 1-3, and [0144], "In order to perform template matching, various versions of the template can be generated and stored in a data structure that can be rapidly traversed and pruned during the template matching search. In several embodiments, the set of templates that is used to perform template matching is generated through rotation and scaling of a base finger template. In other embodiments, a single template can be utilized and the image in which the search is being conducted can be scaled and/or rotated to normalize the object size within the image. The basic template can be a synthetic shape chosen based upon template matching performance (as opposed to a shape learnt by analysis of images of fingers). By application of appropriate rotation and scaling, the template matching process can limit the impact of variation in size, orientation, and distance of a finger from the camera(s) on the ability of the image processing system to detect the finger”), the patch being associated with a first salient point being tracked by the system, the first salient point being included in a prior image of the real-world environment (See Davis: Fig. 20, and [0456], “FIG. 20 shows the location of the remaining salient points, in both the first and second frames. As can be seen, points near the center of the frame closely coincide. Further away, there is some shifting—some due to slightly different scale between the two image frames (e.g., the user moved the camera closer to the subject), and some due to translation (e.g., the user jittered the camera a bit)”), wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real- world environment (See Davis: Fig. 1, and [0422], "The smartphone can use this knowledge about reference salient points on the page being viewed in various ways. For example, it can identify which particular part of the page is being imaged, by matching salient points identified by the database with salient points found within the phone's field of view"); 
accessing respective descriptors for the first salient point and a second salient point identified in the current image, wherein the second salient point represents a second feature of the real-world environment (See Davis: Fig. 1, and [0420], “In one particular embodiment, location of the smartphone relative to the page is not determined by reference to registration components of the watermark signal. Instead, the decoded watermark payload is sent to a remote server (database), which returns information about the page. Unlike application Ser. No. 13/011,618, however, the returned information is not page layout data exported from the publishing software. Instead, the database returns earlier-stored reference data about salient points (features) that are present on the page”), and wherein the system stores a descriptor-based map of the real-world environment indicating real-world locations associated with the first feature and the second feature (See Hummel: Fig. 1-3, and [0144], "In order to perform template matching, various versions of the template can be generated and stored in a data structure that can be rapidly traversed and pruned during the template matching search. In several embodiments, the set of templates that is used to perform template matching is generated through rotation and scaling of a base finger template. In other embodiments, a single template can be utilized and the image in which the search is being conducted can be scaled and/or rotated to normalize the object size within the image. The basic template can be a synthetic shape chosen based upon template matching performance (as opposed to a shape learnt by analysis of images of fingers). By application of appropriate rotation and scaling, the template matching process can limit the impact of variation in size, orientation, and distance of a finger from the camera(s) on the ability of the image processing system to detect the finger”); and 
determining a pose associated with the system, the pose being based on the descriptors and the descriptor-based map (See Zhu: Fig. 4, and [0085], "To minimize the sum of squared distances between the detected facial features at the frame t and projected model vertices obtained from the ray- model intersection, in one embodiment, the analytical Jacobian in Gaussian-Newton iteration which has a second order convergence rate for the model-based head pose estimation is utilized. The model-based head pose estimation method converges in about 10 iterations where the feature point residual errors are less than 2 pixels"; [0088], “After Kalman filtering, the optimal head pose estimation for head model at image frame t, t-1, and corresponding reference frame is obtained.  This is followed by updating the reference frame at 415”; and Fig. 5B, and [0095], “In the application example of Camera position and orientation control 507, it uses an end user's estimated six degrees of freedom head motion, without using any marker and wearing any intrusive device, to control the virtual camera motion through a true one-to-one mapping function.  It provides a natural and immersive interface for an end user to control a virtual camera with his or her head movement so that the camera control becomes simple and intuitive”. Note that the position and orientation of head is used to control the camera position and orientation, which may be corresponding to the pose indicating the imaging device position and orientation). 
Regarding claim 40, Zhu, Davis and Hummel teach all the features with respect to claim 39 as outlined above. Further, Zhu teaches that the method of claim 39, wherein the portion of the current image is matched with the patch stored by the system based on minimizing a cost function (See Zhu: Fig. 4, and [0085], "To minimize the sum of squared distances between the detected facial features at the frame t and projected model vertices obtained from the ray- model intersection, in one embodiment, the analytical Jacobian in Gaussian-Newton iteration which has a second order convergence rate for the model-based head pose estimation is utilized. The model-based head pose estimation method converges in about 10 iterations where the feature point residual errors are less than 2 pixels”). 
Regarding claim 41, Zhu, Davis and Hummel teach all the features with respect to claim 39 as outlined above. Further, Davis teaches that the method of claim 39, wherein the portion of the current image is matched with the patch using information from an inertial measurement unit of the system (See Davis: Fig. 11, and [0391], " (Another group of deblurring techniques does not focus on prior information about features of the captured image, but rather concerns technical attributes about the image capture. For example, the earlier-referenced research team at Microsoft equipped cameras with inertial sensors (e.g., accelerometers and gyroscopes) to collect data about camera movement during image exposure. This movement data was then used in estimating a corrective blur kernel. See Joshi et al, “Image Deblurring Using Inertial Measurement Sensors,” SIGGRAPH '10, Vol 29, No 4, July 2010. (A corresponding patent application is also believed to have been filed, prior to SIGGRAPH.) Although detailed in the context of an SLR with add-on hardware sensors, applicant believes the Microsoft method is suitable for use with smartphones (which increasingly are equipped with 3D accelerometers and gyroscopes; c.f. the Apple iPhone 4).)”).
Regarding claim 42, Zhu, Davis and Hummel teach all the features with respect to claim 39 as outlined above. Further, Zhu teaches that the method of claim 39, wherein the descriptors are generated based on respective image pixels associated with respective locations of the first salient point and second salient point in the current image (See Zhu: Fig. 2, and [0061], "The scale-invariant head descriptor has several distinguishable features that make a head tracker appropriate for game applications. Firstly, the descriptor is defined as a composite feature vector formed from an 8.times.8.times.8 color histogram, where each color channel is quantized into 8 bins, and from a 4.times.4.times.8 gradient histogram where the head image is quantized into 4.times.4 spatial cells and 8 gradient orientations. Both the color histogram and the gradient histogram are scale invariant, and thus the descriptor is optimal for efficient multi-scale searching during tracking. For interactive game applications, such multi-scale searching is a very important feature to successfully track a fast moving head in an image”).
Regarding claim 43, Zhu, Davis and Hummel teach all the features with respect to claim 39 as outlined above. Further, Zhu teaches that the method of claim 39, wherein the descriptors are generated based on respective image areas associated with respective locations of the first salient point and second salient point in the current image (See Zhu: Fig. 2, and [0059], "If there is a detection of one or more heads from the image at 204, the initialization stage ends. The process 200 goes to 205 to initiate or update a scale-invariant head descriptor for the image, wherein the detection result is expressed at 207. Subsequently, the process 200 starts a local head searching at 211”).
Regarding claim 44, Zhu, Davis and Hummel teach all the features with respect to claim 39 as outlined above. Further, Zhu teaches that the method of claim 39, wherein the descriptor-based map includes a plurality of descriptors associated with a plurality of features, the features including the first feature and the second feature (See Zhu: Figs. 5A-B, and [0090], " Referring now to FIG. 5A, it shows a functional diagram 500 in an application scenario in which a scale-invariance head tracker according to one embodiment is used in conjunction with one or more applications. The scale-invariance head tracker unit 501 generates the tracked head position and size that is stored in a space 502. The tracked head position is mapped to a game application's 2D camera position, and the tracked head size is mapped to a game application's camera depth through camera position control unit 503”).
Regarding claim 45, Zhu, Davis and Hummel teach all the features with respect to claim 44 as outlined above. Further, Zhu teaches that the method of claim 44, wherein the method further comprises matching a subset of the plurality of descriptors with the accessed descriptors to identify at least a first descriptor and a second descriptor which match with the accessed descriptors (See Zhu: Fig. 2, and [0063], "After the initialization at 202 is finished, the process 200 starts the tracking stage. At each of the tracking stage, a region of interest (ROI) is defined as an area occupied by a detected head from 204. Typically, the ROI is larger than the detected head but much smaller than the original image in size. A local head detector 209 is configured to scan this region of interest at two scales based on the tracked head size from the last time stamp. If any local head is detected as decided in 210, the tracked head position and size are updated at 212. Otherwise, a local head searcher is configured to continue to search a local head inside the region of interest at 211. The optimal head position and scale, which has the maximal similarity, is used to update the head position and size 208. The tracked head position and size information 212 is exported to other computational unit at 212, e.g., a module 303 of FIG. 3 and a module 401 of FIG. 4, which will be described below”).
Regarding claim 46, Zhu, Davis and Hummel teach all the features with respect to claim 45 as outlined above. Further, Zhu teaches that the method of claim 45, wherein the descriptor-based map includes three- dimensional coordinates of salient points associated with the plurality of descriptors, and wherein the pose is based on comparing:
(1) the three-dimensional locations associated with the first descriptor and the second descriptor (See Zhu: Figs. 5A-B, and [0094], "FIG. 5B shows an exemplary functional diagram 510 for an application scenario in which the 3D head pose tracker 505 is configured to generate the 3D head position and orientation 506, and provides immersive experience through authentic 3D camera control and character gaze control with the tracked head poses”); and 
(2) two-dimensional locations of the first salient point and the second salient point in the current image (See Zhu: Figs. 5A-B, and [0090], "Referring now to FIG. 5A, it shows a functional diagram 500 in an application scenario in which a scale-invariance head tracker according to one embodiment is used in conjunction with one or more applications. The scale-invariance head tracker unit 501 generates the tracked head position and size that is stored in a space 502. The tracked head position is mapped to a game application's 2D camera position, and the tracked head size is mapped to a game application's camera depth through camera position control unit 503”).
Regarding claim 47, Zhu, Davis and Hummel teach all the features with respect to claim 39 as outlined above. Further, Zhu and Davis teaches that the method of claim 39, wherein the second salient point is extracted from the current image, and wherein extracting comprises: 
determining that an image area of the current image has less than a threshold number of salient points being tracked by the system (See Davis: Fig. 1, and [0427], “In one particular arrangement, the database also returns scale and rotation data, related to salient point information provided to the smartphone. For example, the database may return a numeric value useful to indicate which direction is towards the top of the imaged object (i.e., vertical). This value can express, e.g., the angle between vertical, and a line between the first- and last-listed salient points. Similarly, the database may return a numeric value indicating the distance—in inches—between the first- and last-listed salient points, in the scale with which the object (e.g., newspaper) was originally printed. (These simple illustrations are exemplary only, but serve to illustrate the concepts.)”); and 
extracting one or more additional salient points from the image area, the extracted salient points including the second salient point (See Zhu: Fig. 2, and [0062], "Secondly, the color histogram is able to make a head distinct from background, and it is robust toward partial occlusion. A gradient histogram is able to remember the appearance information that is special to the tracked person, which makes it robust toward surrounding distracting heads and temporary occlusion. The similarity between two scale-invariant descriptors is defined as weighted summation of normalized correlations for the color histogram and for the gradient histogram. A composite feature vector formed from the color and gradient histograms, along with the weighted similarity criteria, makes it robust under partial occlusion, temporary occlusion, and surrounding distracting heads. The weights for similarity from the color histogram and similarity from the gradient histogram are determined automatically based on the robustness of the tracker under partial and temporary occlusion”). 
Regarding claim 48, Zhu, Davis and Hummel teach all the features with respect to claim 39 as outlined above. Further, Zhu teaches that the method of claim 39, wherein matching the portion of the current image with the patch comprises projecting the patch onto the current image and refining a location associated with the patch (See Zhu: Fig. 4, and [0084], " For matched feature points at 410, the mode vertices are obtained by back-projecting the feature points in the reference frame, and performing ray-model intersection where 3D head model is transformed based on the initial pose estimated at the reference frame. (do not understand)”).
Regarding claim 49, Zhu, Davis and Hummel teach all the features with respect to claim 39 as outlined above. Further, Zhu teaches that the method of claim 39, further comprising:
projecting salient points included in the descriptor-based map onto the current image (See Zhu: Fig. 4, and [0083], “For the matched feature points stored at 406, the model vertices are obtained by back-projecting the feature points in the image frame at t-1 and performing the ray-model intersection, where the 3D head model is transformed based on the initial pose estimated at frame t-1 (do not understand)”), wherein the projection is based on one or more of an inertial measurement unit, an extended kalman filter, or visual-inertial odometry (See Zhu: Fig. 4, and [0075], “FIG. 4 shows a flowchart or process 400 for operations in a head pose tracker unit configured to perform 3D head pose tracking. Depending on implementation, the process 400 may be implemented in software, hardware or in combination of both. According to one embodiment, some of the features, benefits or advantages in the process 400 include: (1) robust feature point tracking with large head movement compensation based on scale-invariant head tracking; (2) model-based head pose estimation with sparse optical flow; (3) model-based head pose estimation with reference frames; (4) head pose estimation result integrated with Kalman filtering”).
Regarding claim 50, Zhu, Davis and Hummel teach all the features with respect to claim 28 as outlined above. Further, Zhu, Davis and Hummel teach that a head-mounted augmented reality display system (See Zhu: Fig. 1, and [0049], "Referring now to the drawings, in which like numerals refer to like parts throughout the several views. FIG. 1 shows one exemplary configuration 100 according to an embodiment of the invention. The configuration 100 includes a display 112, a camera 103 and a processing unit, also referred to as a head motion capture unit 114. In operation, the camera 103 captures images that are transported to the head motion capture unit 114 for analysis (e.g., to determine the pose of one or more heads in the images). The display 112 is driven by the head motion capture unit 114") comprising:
one or more outwardly-facing imaging devices configured to obtain images of a real-world environment (See Zhu: Fig. 1, and [0052], "The camera 103 is an image capture device, e.g., a PlayStation Eye webcam, to capture the motion of an end user as an image sequence 104. It should be noted that there is no need to require a depth camera in this embodiment. The image sequence 104 is a collection of image frames capturing the source motion. An exemplary image sequence 113 is captured from a source motion for a head yaw movement. The image sequence 113 shows that a head has a neutral pose in the top of the image while the head rotates to the left, and the head rotates to the right in the two bottom images");
one or more processors (See Zhu: Fig. 1, and [0016], "According to another embodiment, the present invention is a device for determining motion of a head. The device comprises an interface, coupled to a camera, to receive a sequence of images from the camera disposed to look at a user, a memory space for storing code, a processor, coupled to the memory space, executing the code to perform operations of: estimating a position and a size of a head of the user in each of the images using a scale-invariant head tracking technique designed to scan the each of the images at all scales; and determining pose of the head from the position and size of the head"), the processors configured to:
obtain a current image of the real-world environment (See Zhu: Fig. 1, and [0049], “Referring now to the drawings, in which like numerals refer to like parts throughout the several views. FIG. 1 shows one exemplary configuration 100 according to an embodiment of the invention. The configuration 100 includes a display 112, a camera 103 and a processing unit, also referred to as a head motion capture unit 114. In operation, the camera 103 captures images that are transported to the head motion capture unit 114 for analysis (e.g., to determine the pose of one or more heads in the images). The display 112 is driven by the head motion capture unit 114”);
perform frame-to-frame tracking on the current image, such that patch- based salient points included in a previous image are matched with locations in the current image (See Zhu: Fig. 4, and [0077], “Subsequently, the head tracker defines the region of interest at two successive frames both for optical flow based feature tracking and for SURF based feature tracking with reference frames. The optical flow based feature tracking and head pose estimation start with corner extraction at 402 within the region of interest at a frame t-1 and at the frame t. The resulting feature points are recorded at 403 and 404 respectively”), each of the salient points representing a respective feature of the real-world environment (See Davis: Fig. 20, and [0456], “FIG. 20 shows the location of the remaining salient points, in both the first and second frames. As can be seen, points near the center of the frame closely coincide. Further away, there is some shifting—some due to slightly different scale between the two image frames (e.g., the user moved the camera closer to the subject), and some due to translation (e.g., the user jittered the camera a bit)”), wherein the matching is usable to identify the first salient point in the current image, and wherein the first salient point represents a first feature of the real-world environment (See Davis: Fig. 1, and [0422], "The smartphone can use this knowledge about reference salient points on the page being viewed in various ways. For example, it can identify which particular part of the page is being imaged, by matching salient points identified by the database with salient points found within the phone's field of view");
perform map-to-frame tracking on the current image (See Zhu: Figs. 5A-B, and [0091], “In one embodiment, the camera position control unit 503 may employ a linear mapping function which maps the tracked head position and size to the camera position in its local XY dimension and local camera depth respectively. An end user then chooses the appropriate scaling coefficients in the linear mapping function to exaggerate or diminish the head movement so as to achieve a natural way to control the camera motion for immersive experiences”), wherein map-to-frame tracking comprises matching descriptors for the patch-based salient points with map-based descriptors stored in a descriptor-based map of the real-world environment, the map-based descriptors being associated with real-world locations of features of the real-world environment (See Hummel: Fig. 1-3, and [0144], "In order to perform template matching, various versions of the template can be generated and stored in a data structure that can be rapidly traversed and pruned during the template matching search. In several embodiments, the set of templates that is used to perform template matching is generated through rotation and scaling of a base finger template. In other embodiments, a single template can be utilized and the image in which the search is being conducted can be scaled and/or rotated to normalize the object size within the image. The basic template can be a synthetic shape chosen based upon template matching performance (as opposed to a shape learnt by analysis of images of fingers). By application of appropriate rotation and scaling, the template matching process can limit the impact of variation in size, orientation, and distance of a finger from the camera(s) on the ability of the image processing system to detect the finger”); and
determine a pose associated with the display device, the pose indicating at least an orientation of the one or more outwardly-facing imaging devices in the real-world environment (See Zhu: Fig. 4, and [0085], "To minimize the sum of squared distances between the detected facial features at the frame t and projected model vertices obtained from the ray- model intersection, in one embodiment, the analytical Jacobian in Gaussian-Newton iteration which has a second order convergence rate for the model-based head pose estimation is utilized. The model-based head pose estimation method converges in about 10 iterations where the feature point residual errors are less than 2 pixels"; [0088], “After Kalman filtering, the optimal head pose estimation for head model at image frame t, t-1, and corresponding reference frame is obtained.  This is followed by updating the reference frame at 415”; and Fig. 5B, and [0095], “In the application example of Camera position and orientation control 507, it uses an end user's estimated six degrees of freedom head motion, without using any marker and wearing any intrusive device, to control the virtual camera motion through a true one-to-one mapping function.  It provides a natural and immersive interface for an end user to control a virtual camera with his or her head movement so that the camera control becomes simple and intuitive”. Note that the position and orientation of head is used to control the camera position and orientation, which may be corresponding to the pose indicating the imaging device position and orientation).
Regarding claim 51, Zhu, Davis and Hummel teach all the features with respect to claim 50 as outlined above. Further, Zhu teaches that the head-mounted augmented reality display system of claim 50, wherein matching patch-based salient points with locations in the current image is based on minimizing a cost function (See Zhu: Fig. 4, and [0085], "To minimize the sum of squared distances between the detected facial features at the frame t and projected model vertices obtained from the ray- model intersection, in one embodiment, the analytical Jacobian in Gaussian-Newton iteration which has a second order convergence rate for the model-based head pose estimation is utilized. The model-based head pose estimation method converges in about 10 iterations where the feature point residual errors are less than 2 pixels”).
Regarding claim 52, Zhu, Davis and Hummel teach all the features with respect to claim 50 as outlined above. Further, Zhu teaches that the head-mounted augmented reality display system of claim 50, wherein a subset of the map-based descriptors are determined to match with the descriptors for the patch- based salient points (See Zhu: Fig. 2, and [0063], "After the initialization at 202 is finished, the process 200 starts the tracking stage. At each of the tracking stage, a region of interest (ROI) is defined as an area occupied by a detected head from 204. Typically, the ROI is larger than the detected head but much smaller than the original image in size. A local head detector 209 is configured to scan this region of interest at two scales based on the tracked head size from the last time stamp. If any local head is detected as decided in 210, the tracked head position and size are updated at 212. Otherwise, a local head searcher is configured to continue to search a local head inside the region of interest at 211. The optimal head position and scale, which has the maximal similarity, is used to update the head position and size 208. The tracked head position and size information 212 is exported to other computational unit at 212, e.g., a module 303 of FIG. 3 and a module 401 of FIG. 4, which will be described below”).
Regarding claim 53, Zhu, Davis and Hummel teach all the features with respect to claim 52 as outlined above. Further, Zhu teaches that the head-mounted augmented reality display system of claim 52, wherein the pose is based on comparing (1) real-world locations associated with the subset of the map-based descriptors (See Zhu: Figs. 5A-B, and [0094], "FIG. 5B shows an exemplary functional diagram 510 for an application scenario in which the 3D head pose tracker 505 is configured to generate the 3D head position and orientation 506, and provides immersive experience through authentic 3D camera control and character gaze control with the tracked head poses”) and (2) two-dimensional locations of the patch-based salient points in the current image (See Zhu: Figs. 5A-B, and [0090], "Referring now to FIG. 5A, it shows a functional diagram 500 in an application scenario in which a scale-invariance head tracker according to one embodiment is used in conjunction with one or more applications. The scale-invariance head tracker unit 501 generates the tracked head position and size that is stored in a space 502. The tracked head position is mapped to a game application's 2D camera position, and the tracked head size is mapped to a game application's camera depth through camera position control unit 503”).
Regarding claim 54, Zhu, Davis and Hummel teach all the features with respect to claim 52 as outlined above. Further, Davis teaches that the head-mounted augmented reality display system of claim 52, further comprising an inertial measurement unit (IMU), wherein the processors are configured to use the IMU to perform frame-to-frame tracking (See Davis: Fig. 11, and [0391], “(Another group of deblurring techniques does not focus on prior information about features of the captured image, but rather concerns technical attributes about the image capture. For example, the earlier-referenced research team at Microsoft equipped cameras with inertial sensors (e.g., accelerometers and gyroscopes) to collect data about camera movement during image exposure. This movement data was then used in estimating a corrective blur kernel. See Joshi et al, “Image Deblurring Using Inertial Measurement Sensors,” SIGGRAPH '10, Vol 29, No 4, July 2010. (A corresponding patent application is also believed to have been filed, prior to SIGGRAPH.) Although detailed in the context of an SLR with add-on hardware sensors, applicant believes the Microsoft method is suitable for use with smartphones (which increasingly are equipped with 3D accelerometers and gyroscopes; c.f. the Apple iPhone 4).)”).








Conclusion


Any inquiry concerning this communication or earlier communications from the examiner should be directed to GORDON G LIU whose telephone number is (571)270-0382. The examiner can normally be reached Monday - Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GORDON G LIU/Primary Examiner, Art Unit 2612