DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph for lack of antecedent basis.

In regards to independent claim 1, the limitation recites “the lowest” in line 21, in which no previous instance of “a lowest” has been provided, and thus there is insufficient antecedent basis for this limitation in the claim.
In regards to independent claim 10, the limitation recites “the highest” in line 25, in which no previous instance of “the highest” has been provided, and thus there is insufficient antecedent basis for this limitation in the claim.
In regards to dependent claims 2-9 and 11-16
Claim 7 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
In regards to dependent claim 7, the limitation discloses “blends with the second texture to a higher degree" and “performing the adjustment in step (f) to a greater degree”, in lines 4 and 6-7, as the language is unclear and subjective as to what would be considered a higher or greater degree for determining level of blend between the background and second texture and how to adjust the parameters to achieve a higher degree for the different images, and thus the limitation has been rendered indefinite. The Examiner suggests providing a clearer scope as to what exactly would be deemed greater or higher for these specific limitations regarding a range of values, a formula, or further steps that clearly define what qualifies as higher and greater degree’s.  

Allowable Subject Matter
Claims 7 would be objected to once the 35 U.S.C. 112(b) rejections listed above are resolved. The following is a statement of reasons for the indication of allowable subject matter:
In regards to dependent claim 7, none of the cited prior art alone or in combination provides motivation to teach “wherein the method further comprises: (i) determining that the first background image blends with the second texture to a higher degree than the second background image; and (j) performing the adjustment in step (f) to a greater degree for the first and second 2D images than for the third and fourth 2D images” as the references only teach use of object detection algorithms via machine learning where a background can be changed from a first to second image, however the references do not explicitly detail the steps for determining a degree as to which the blending is performed and further adjusting the algorithm parameters for subsequent images that depend on an initial generated set, in conjunction with limitations of claim 1 from which it depends.
In addition, there is no teaching, suggestion, or motivation found in the current references and none that can be inferred from the examiner’s own knowledge with respect to the current limitation.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


Claims 1, 2, 4, 8-11 and 14-18 are rejected under 35 U.S.C. 103 as being unpatentable over Kehl (2017 “SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again”, hereinafter referenced “Kehl”) in view of Szeto (US 2018/0137366 A1, hereinafter referenced “Szeto”).

In regards to claim 1. Kehl discloses a non-transitory computer readable medium that embodies instructions that cause one or more processors to perform a method (Kehl, Abstract) comprising: 
-(a) selecting a 3D model corresponding to an object (Kehl, Fig. 3 and “3.2 Training Stage” section page 1532; Reference shows an object in the center having multiple 2D projections captured of it); 
-(b) generating a first 2D image of the 3D model in a first pose (Kehl, Fig. 3 and “3.2 Training Stage” section page 1532; Reference discloses determining transformations regarding closest sampled discrete viewpoint and in-plane rotation as well as set its four corner values to the tightest fit around the mask as a regression target. We show some training images in Figure 2. ); 
-(c) generating a second 2D image of the 3D model in the first pose, the second 2D image having a different texture on the 3D model than the first 2D image (Kehl, Fig. 3 and “3.2 Training Stage” section page 1532; Reference discloses determining transformations regarding closest sampled discrete viewpoint and in-plane rotation as well as set its four corner values to the tightest fit around the mask as a regression target. We show some training images in Figure 2. ); 
-(d) determining, using an algorithm, a first location of a first feature on the 3D model in the first 2D image and a second location of a second feature on the 3D model in the second 2D image (Kehl, “3.1 Network architecture” section page 1532; Reference discloses specifically, each of these six feature maps is convolved with prediction kernels that are supposed to regress localized detections from feature map positions. Let (ws, hs, cs) be the width, height and channel depth at scale s. For each scale, we train a 3×3×cs kernel that provides for each feature map location the scores for object ID, discrete viewpoint and in-plane rotation. The feature map having the different locations in relation to the detected or captured object interpreted as determining, using an algorithm, a first location of a first feature on the 3D model in the first 2D image and a second location of a second feature on the 3D model in the second 2D image); 
(e) calculating a difference based on the first location and the second location (Kehl, “3.1 Network architecture” section page 1532; Reference discloses specifically, each of these six feature maps is convolved with prediction kernels that are supposed to regress localized detections from feature map positions. Let (ws, hs, cs) be the width, height and channel depth at scale s. For each scale, we train a 3×3×cs kernel that provides for each feature map location the scores for object ID, discrete viewpoint and in-plane rotation. Since we introduce a discretization error by this grid, we create Bs bounding boxes at each location with different aspect ratios. Creating bounding boxes at different aspect ratios for the feature map locations interpreted as the calculating a difference based on first and second locations); 



Kehl does not explicitly disclose but Szeto teaches
-(f) adjusting parameters representing the algorithm based on the calculated difference (Szeto, paragraph [0041]; Reference discloses the association unit 113 optimizes a pose represented by a rigid body conversion matrix included in view parameters on the basis of the view and the depth map so that re-projection errors are minimized on a virtual plane (in this case, a plane corresponding to an imaging surface of the imaging section 40) on the basis of 3D model points obtained by inversely converting the 2D model points, and image points corresponding to the 2D model points. Optimization, that is, refinement of the pose is performed through iterative computations using, for example, the Gauss-Newton method. Paragraph [0042] discloses the pose tracking process according to the present embodiment is based on tracking of features (feature points) on the real object OB1 appearing in a captured image acquired by the imaging section 40. Tracking pose features from different image frames and optimization of the pose to reduce errors interpreted as parameter adjustment for the algorithm based on calculated difference ); 
Szeto, paragraph [0041]; Reference discloses optimization, that is, refinement of the pose is performed through iterative computations using, for example, the Gauss-Newton method. If the pose is optimized (refined), the image contour and the contour of the 2D model are aligned with each other on the display section 20 with higher accuracy. Reference discloses use of iterative method for optimization and puts no limit on the number of iterations thus encompassing the “at least twice” concept); 
-and (h) storing, in a memory, the parameters representing the algorithm, the parameters causing the difference in (e) to be the lowest among the iteration or lower than or equal to a threshold (Szeto, paragraphs [0030] and [0031]; Reference at paragraph [0030] discloses the storage unit 120 includes a 3D model storage portion 121, a created data storage portion 122, and a captured image database 123 (captured image DB 123). Paragraph [0031] discloses as details of data stored in the created data storage portion 122 will be described later, the created data storage portion 122 stores association data in which 2D model data corresponding to a predetermined view of a 3D model, appearance data of the real object OB1 imaged by the imaging section 40, and the predetermined view are associated with each other (i.e. algorithm parameters). Paragraph [0007] previously discloses generating a template using a 3D model corresponding to the real object and appearance information of the real object in the case where the number of the feature elements is equal to or greater than a threshold value (i.e. causing the difference to be equal to a threshold).
Kehl and Szeto are combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl to include the object detection algorithm features of Szeto in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses increasing accuracy of object detection applicable to improving object detection systems such as those taught in Kehl.

In regards to claim 2. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 1.
Kehl further discloses
-wherein in the second 2D image, the 3D model is textureless (Kehl, “4.1 Single Object Scenario” section, page 1535; Reference discloses 2D detection and 6D hypothesis refinement processes in which the system “performance for objects of smaller scale such as ’ape’, ’duck’ and ’cat’ is worse and we observed a drop both in recall and precision….The lower precision, on the other hand, stems from the fact that these objects are textureless and of uniform color which increases confusion with the heavy scene clutter. Thus image pose capture for a textureless object provides a textureless 3D model)

In regards to claim 4. The non-transitory computer readable medium according to claim 1.
Kehl further discloses
-wherein classification information for the first 2D image and second 2D image is included in the algorithm (Kehl, “3.1 Network Architecture” section page 1532; Reference discloses the choice of viewpoint classification over pose regression is deliberate…early experimentation showed clearly that the classification approach is more reliable for the task of detecting poses…The decomposition of a 6D pose in viewpoint and in-plane rotation is elegant and allows us to tackle the problem more naturally…simultaneous scoring of all views allows us to parse multiple detections at a given image location, e.g. by accepting all viewpoints above a certain threshold. Interpreted as the multiple captured image views having viewpoint scoring included thus being classification info for first second etc. images included in the overall algorithm process).

In regards to claim 8. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 1.
Kehl further discloses
-wherein in steps (d) and (e), multiple keypoints as the feature are used, and the calculated difference is based on an aggregation of the multiple keypoints (Kehl, “3.1 Network architecture” section page 1532; Reference discloses specifically, each of these six feature maps is convolved with prediction kernels that are supposed to regress localized detections from feature map positions. Let (ws, hs, cs) be the width, height and channel depth at scale s. For each scale, we train a 3×3×cs kernel that provides for each feature map location the scores for object ID, discrete viewpoint and in-plane rotation. Since we introduce a discretization error by this grid, we create Bs bounding boxes at each location with different aspect ratios. Creating bounding boxes at different aspect ratios for the feature map locations interpreted as the calculating a difference based on first and second locations. Fig. 3 illustrates each point representing the different viewpoints to be used as the formula 1 relates to the calculation used for training the data set inclusive of summation of the bounding boxes in relation to the viewpoint or multiple keypoints).

In regards to claim 9. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 1.
Kehl further discloses
-wherein the calculation in step (e) is a loss function (Kehl, “3.2 Training Stage” section, page 1533; Reference discloses the formula 1 used for the training stage which incorporates discrete views and in plane rotations).

In regards to claim 10. Kehl discloses a non-transitory computer readable medium that embodies instructions that cause one or more processors to perform a method (Kehl, Abstract) comprising: 
Kehl, Fig. 3 and “3.2 Training Stage” section page 1532; Reference shows an object in the center having multiple 2D projections captured of it); 
-(b) generating a first 2D image of the 3D model superimposed on a first background image in a first pose, with a centroid of the 3D model being located at a first position relative to the first background image (Kehl, Figs. 2 and 3 and “3.2 Training Stage” section page 1532; Reference discloses determining transformations regarding closest sampled discrete viewpoint and in-plane rotation as well as set its four corner values to the tightest fit around the mask as a regression target. We show some training images in Figure 2 which show the 3D models superimposed on background images. Page 1533 discloses in the figure 4 description determining the 2D object centroid with respect to a given distance as the objects are shown in fig. 2); 
-(c) generating a second 2D image of the 3D model superimposed on the first background image, with the centroid of the 3D model being located at the first position relative to the first background image, the 3D model having a second pose in the second 2D image different from the first pose (Kehl, Figs. 2 and 3 and “3.2 Training Stage” section page 1532; Reference discloses determining transformations regarding closest sampled discrete viewpoint and in-plane rotation as well as set its four corner values to the tightest fit around the mask as a regression target. We show some training images in Figure 2 which show the 3D models superimposed on background images. Page 1533 discloses in the figure 4 description determining the 2D object centroid with respect to a given distance as the objects are shown in fig. 2); 
Kehl, “3.1 Network architecture” section page 1532; Reference discloses specifically, each of these six feature maps is convolved with prediction kernels that are supposed to regress localized detections from feature map positions. Let (ws, hs, cs) be the width, height and channel depth at scale s. For each scale, we train a 3×3×cs kernel that provides for each feature map location the scores for object ID, discrete viewpoint and in-plane rotation. The feature map having the different locations in relation to the detected or captured object interpreted as determining, using an algorithm, a first location of a first feature on the 3D model in the first 2D image and a second location of a second feature on the 3D model in the second 2D image); 
-(e) calculating a difference based on the first location and the second location (Kehl, “3.1 Network architecture” section page 1532; Reference discloses specifically, each of these six feature maps is convolved with prediction kernels that are supposed to regress localized detections from feature map positions. Let (ws, hs, cs) be the width, height and channel depth at scale s. For each scale, we train a 3×3×cs kernel that provides for each feature map location the scores for object ID, discrete viewpoint and in-plane rotation. Since we introduce a discretization error by this grid, we create Bs bounding boxes at each location with different aspect ratios. Creating bounding boxes at different aspect ratios for the feature map locations interpreted as the calculating a difference based on first and second locations);  



Kehl does not explicitly disclose but Szeto teaches
-(f) adjusting parameters representing the algorithm based on the calculated difference (Szeto, paragraph [0041]; Reference discloses the association unit 113 optimizes a pose represented by a rigid body conversion matrix included in view parameters on the basis of the view and the depth map so that re-projection errors are minimized on a virtual plane (in this case, a plane corresponding to an imaging surface of the imaging section 40) on the basis of 3D model points obtained by inversely converting the 2D model points, and image points corresponding to the 2D model points. Optimization, that is, refinement of the pose is performed through iterative computations using, for example, the Gauss-Newton method. Paragraph [0042] discloses the pose tracking process according to the present embodiment is based on tracking of features (feature points) on the real object OB1 appearing in a captured image acquired by the imaging section 40. Tracking pose features from different image frames and optimization of the pose to reduce errors interpreted as parameter adjustment for the algorithm based on calculated difference ); 
-(g) iterating steps (d) to (f) at least twice (Szeto, paragraph [0041]; Reference discloses optimization, that is, refinement of the pose is performed through iterative computations using, for example, the Gauss-Newton method. If the pose is optimized (refined), the image contour and the contour of the 2D model are aligned with each other on the display section 20 with higher accuracy. Reference discloses use of iterative method for optimization and puts no limit on the number of iterations thus encompassing the “at least twice” concept);
-and (h) storing, in a memory, parameters representing the algorithm, the parameters causing the difference in (e) to be the highest among the iteration or higher than or equal to a threshold (Szeto, paragraphs [0030] and [0031]; Reference at paragraph [0030] discloses the storage unit 120 includes a 3D model storage portion 121, a created data storage portion 122, and a captured image database 123 (captured image DB 123). Paragraph [0031] discloses as details of data stored in the created data storage portion 122 will be described later, the created data storage portion 122 stores association data in which 2D model data corresponding to a predetermined view of a 3D model, appearance data of the real object OB1 imaged by the imaging section 40, and the predetermined view are associated with each other (i.e. algorithm parameters). Paragraph [0007] previously discloses generating a template using a 3D model corresponding to the real object and appearance information of the real object in the case where the number of the feature elements is equal to or greater than a threshold value (i.e. causing the difference to be equal to a threshold).
Kehl and Szeto are combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl to include the object detection algorithm features of Szeto in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses increasing accuracy of object detection applicable to improving object detection systems such as those taught in Kehl.

In regards to claim 11. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 10.
Kehl further discloses
-wherein the 3D model is textureless in the first 2D image and the second 2D image (Kehl, “4.1 Single Object Scenario” section, page 1535; Reference discloses 2D detection and 6D hypothesis refinement processes in which the system “performance for objects of smaller scale such as ’ape’, ’duck’ and ’cat’ is worse and we observed a drop both in recall and precision….The lower precision, on the other hand, stems from the fact that these objects are textureless and of uniform color which increases confusion with the heavy scene clutter. Thus image pose capture for a textureless object provides a textureless 3D model).

In regards to claim 14. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 10.
Kehl further discloses
Kehl, “3.1 Network Architecture” section page 1532; Reference discloses the choice of viewpoint classification over pose regression is deliberate…early experimentation showed clearly that the classification approach is more reliable for the task of detecting poses…The decomposition of a 6D pose in viewpoint and in-plane rotation is elegant and allows us to tackle the problem more naturally…simultaneous scoring of all views allows us to parse multiple detections at a given image location, e.g. by accepting all viewpoints above a certain threshold. Interpreted as the multiple captured image views having viewpoint scoring included thus being classification info for first second etc. images included in the overall algorithm process).

In regards to claim 15. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 10.
Kehl does not disclose but Szeto teaches
-wherein the second pose is the first pose flipped 180﮿ in one direction (Szeto, paragraphs [0027] and [0033]; Reference at paragraph [0027] discloses the real object OB1 is disposed on the specific axis, and thus the imaging section 40 can image the real object OB1 while being rotated by 360 degrees. Paragraph [0033] discloses in a case where the real object OB1 is imaged by the imaging section 40, the association unit 113 associates a contour of the imaged real object OB1 with the contour of the 2D model at a timing of receiving a predetermined command from a user, so as to estimate a pose of the imaged real object OB1.  Thus this process of capturing the object in 360 degrees and estimating pose provides multiple poses with a flipped direction).
Kehl and Szeto are combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl to include the object detection algorithm features of Szeto in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses increasing accuracy of object detection applicable to improving object detection systems such as those taught in Kehl.

In regards to claim 16. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 10.
Kehl further discloses
-wherein in steps (d) and (e), multiple keypoints as the feature are used, and the calculated difference is based on an aggregation of the multiple keypoints (Kehl, “3.1 Network architecture” section page 1532; Reference discloses specifically, each of these six feature maps is convolved with prediction kernels that are supposed to regress localized detections from feature map positions. Let (ws, hs, cs) be the width, height and channel depth at scale s. For each scale, we train a 3×3×cs kernel that provides for each feature map location the scores for object ID, discrete viewpoint and in-plane rotation. Since we introduce a discretization error by this grid, we create Bs bounding boxes at each location with different aspect ratios. Creating bounding boxes at different aspect ratios for the feature map locations interpreted as the calculating a difference based on first and second locations. Fig. 3 illustrates each point representing the different viewpoints to be used as the formula 1 relates to the calculation used for training the data set inclusive of summation of the bounding boxes in relation to the viewpoint or multiple keypoints).

In regards to claim 17. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 10.
Kehl further discloses
-wherein the calculation in step (e) is a loss function (Kehl, “3.2 Training Stage” section, page 1533; Reference discloses the formula 1 used for the training stage which incorporates discrete views and in plane rotations).

In regards to claim 18. Kehl discloses a non-transitory computer readable medium storing instructions to cause one or more processors (Kehl, Abstract) to: train an object detection model with a training dataset so as to derive, by regression, a pose of an object from an image of a 3D model corresponding to the object, the training dataset containing image sets of the 3D model at respective poses, each image set Kehl, Fig. 1 description, page 1531; Reference discloses Schematic overview of the SSD-style network prediction. We feed our network with a 299 × 299 RGB image and produce six feature maps at different scales from the input image using branches from InceptionV4 (i.e. training data set). Each map is then convolved with trained prediction kernels of shape (4 + C + V + R) to determine object class (i.e. object detection model), 2D bounding box as well as scores for possible viewpoints and in-plane rotations that are parsed to build 6D pose hypotheses. Thereby, C denotes the number of object classes, V the number of viewpoints and R the number of in-plane rotation classes.,  the training dataset containing image sets of the 3D model at respective poses, each image set including at least one first image of the 3D model at a pose with texture and a second image of the 3D model at the pose without texture (i.e. by regression deriving a pose of an object from an image of a 3D model corresponding to the object)) and a second image of the 3D model at the pose without texture  (Kehl, “4.1 Single Object Scenario” section, page 1535; Reference discloses 2D detection and 6D hypothesis refinement processes in which the system “performance for objects of smaller scale such as ’ape’, ’duck’ and ’cat’ is worse and we observed a drop both in recall and precision….The lower precision, on the other hand, stems from the fact that these objects are textureless and of uniform color which increases confusion with the heavy scene clutter. Thus image pose capture for a textureless object provides a textureless 3D model).
Kehl does not disclose but Szeto teaches
-(3D model Pose) with texture (Szeto, paragraph [0041]; Reference discloses a pose represented by a view where the two contours are aligned with each other substantially matches the pose of the real object OB1 relative to the imaging section 40. Image information of the real object OB1 in the captured image is stored as appearance data in association with the pose. The appearance data according to the present embodiment includes texture information (information regarding an appearance such as an edge, a pattern, or a color) of an outer surface of the real object OB1 imaged by the imaging section 40 in the pose.)
Kehl and Szeto are combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl to include the object detection algorithm features of Szeto in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses increasing accuracy of object detection applicable to improving object detection systems such as those taught in Kehl.


Claims 3 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Kehl (2017 “SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again”) in view of Szeto (US 2018/0137366 A1) as applied to claim 1 above, and further in view of Jiang (2019 “CNN-Based Non-contact Detection of Food Level in Bottles from RGB Images”, hereinafter referenced “Jiang”)

In regards to claim 3. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 1.
Kehl and Szeto does not explicitly disclose but Jiang teaches
-wherein the 3D model has a first texture in the first 2D image and a second texture different from the first texture in the second 2D image (Jiang, “1. Introduction” section, page 203 and “4. Classification using CNN’s” section page 207; Reference at page 203 discloses we augment the training sets used in learning the CNNs by (i) attaching physically printed labels with synthetic textures to the training bottles to provide invariance to label shape and texture, (ii) interchanging the contents of the training bottles to strengthen the invariance of the CNN to food color, and (iii) altering the intensities of images in random blocks in regions of the label and bottle border to prevent overfitting to bottle geometry, label shape, and label appearance. Reference at page 207 discloses the training bottles termed 'Syn' have labels with synthetic texture added; the bottles in the 'RAN' set have random image alterations, and the bottles in the 'Int' set have bottles with interchanged liquids; including bottles with synthetically-generated data on their labels maps to having bottles in the training data with algorithmically-chosen labels, and the 'Syn' bottles have a different texture from the 'Int' bottles (i.e. the images of bottles having different textures interpreted as wherein the 3D model has a first texture in the first 2D image and a second texture different from the first texture in the second 2D image).
Kehl and Szeto are combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl to include the object detection algorithm features of Szeto in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses increasing accuracy of object detection applicable to improving object detection systems such as those taught in Kehl.
Kehl and Jiang are also combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl, in view of the object detection algorithm features of Szeto, to include the non-contact image detection features of Jiang in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses. Further incorporating the non-contact image detection features of Jiang for using deep convolutional networks trained on RGB images and including data with printed labels having synthetic textures for increasing object detection accuracy applicable to improving object detection systems such as those taught in Kehl and Jiang.

In regards to claim 12. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 10.
Kehl and Szeto does not disclose but Jiang teaches
-wherein the 3D model has a same texture in both the first 2D image and the second 2D image (Jiang, “1. Introduction” section, page 203 and “4. Classification using CNN’s” section page 207; Reference at page 203 discloses we augment the training sets used in learning the CNNs by (i) attaching physically printed labels with synthetic textures to the training bottles to provide invariance to label shape and texture, (ii) interchanging the contents of the training bottles to strengthen the invariance of the CNN to food color, and (iii) altering the intensities of images in random blocks in regions of the label and bottle border to prevent overfitting to bottle geometry, label shape, and label appearance. Page 207 discloses the training bottles termed 'Syn' have labels with synthetic texture added; the bottles in the 'RAN' set have random image alterations, and the bottles in the 'Int' set have bottles with interchanged liquids; including bottles with synthetically-generated data on their labels maps to having bottles in the training data with algorithmically-chosen labels. Thus images relating to 'Syn' bottles have the same texture regarding the synthetic data set.).
Kehl and Szeto are combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl to include the object detection algorithm features of Szeto in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses increasing accuracy of object detection applicable to improving object detection systems such as those taught in Kehl.
Kehl and Jiang are also combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl, in view of the object detection algorithm features of Szeto, to include the non-contact image detection features of Jiang in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses. Further incorporating the non-contact image detection features of Jiang for using deep convolutional networks trained on RGB images and including data with printed labels having synthetic textures for increasing object detection accuracy applicable to improving object detection systems such as those taught in Kehl and Jiang.


Claims 5 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over Kehl (2017 “SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again”) in view of Szeto (US 2018/0137366 A1) as applied to claim 1 above, and further in view of Rad (US 2018/0137644 A1, hereinafter referenced “Rad”)

In regards to claim 5. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 1.
Kehl and Szeto does not explicitly disclose but Rad teaches
Rad, paragraph [0096]; Reference discloses Furthermore, to be robust to clutter and scale changes, the segmented objects can be scaled by a factor of s ϵ [S1, S2] and the background can be changed by a patch extracted from a randomly selected image from a dataset of available backgrounds (e.g., from an ImageNet dataset) Selecting images from a dataset of available background images for generating the training images for the object detection training interpreted as a selected background for blending with texture of the multiple images used for training).
Kehl and Szeto are combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl to include the object detection algorithm features of Szeto in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses increasing accuracy of object detection applicable to improving object detection systems such as those taught in Kehl.
Kehl and Rad are also combinable because they are in the same field of endeavor regarding object detection features. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl, in view of the object detection algorithm features of Szeto, to include the object pose estimation features of Rad in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses. Further incorporating object pose estimation features of Rad allows for using of bounding boxes for determining location and orientation of objects in real images to facilitate more effective object detection operation in various applications, applicable to improving object detection systems such as those taught in Kehl and Jiang.

In regards to claim 6. Kehl in view of Szeto teach the non-transitory computer readable medium according to claim 1.
Kehl further discloses
-wherein the method further comprises: Kehl, Fig. 3 and “3.2 Training Stage” section page 1532; Reference discloses determining transformations regarding closest sampled discrete viewpoint and in-plane rotation as well as set its four corner values to the tightest fit around the mask as a regression target. We show some training images in Figure 2 (i.e. process for generating multiple images)), 
Kehl does not explicitly disclose but Szeto teaches


and (k) performing steps (d) through (h) using the first and second 2D images and separately the third and fourth 2D images (Szeto, paragraph [0041]; Reference discloses optimization, that is, refinement of the pose is performed through iterative computations using, for example, the Gauss-Newton method. If the pose is optimized (refined), the image contour and the contour of the 2D model are aligned with each other on the display section 20 with higher accuracy. Reference discloses use of iterative method for optimization and puts no limit on the number of iterations thus encompassing performing the steps multiple times for at least 4 images as claimed).
Kehl and Szeto does not disclose but Rad teaches
(i) inserting a first background image into the first and second 2D images (Rad, paragraph [0096]; Reference discloses Furthermore, to be robust to clutter and scale changes, the segmented objects can be scaled by a factor of s ϵ [S1, S2] and the background can be changed by a patch extracted from a randomly selected image from a dataset of available backgrounds (e.g., from an ImageNet dataset) Selecting images from a dataset of available background images for generating the training images for the object detection training interpreted as a selected background for multiple images used for training);
-but with a second background image different from the first background image (Rad, paragraph [0096]; Reference discloses Furthermore, to be robust to clutter and scale changes, the segmented objects can be scaled by a factor of s ϵ [S1, S2] and the background can be changed (i.e. different background) by a patch extracted from a randomly selected image from a dataset of available backgrounds (e.g., from an ImageNet dataset) Selecting images from a dataset of available background images for generating the training images for the object detection training interpreted as a selected background for blending with texture of the multiple images generated for training);
Kehl and Rad are also combinable because they are in the same field of endeavor regarding object detection features. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl, in view of the object detection algorithm features of Szeto, to include the object pose estimation features of Rad in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses. Further incorporating object pose estimation features of Rad allows for using of bounding boxes for determining location and orientation of objects in real images to facilitate more effective object detection operation in various applications, applicable to improving object detection systems such as those taught in Kehl and Jiang.

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Kehl (2017 “SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again”) in view of Szeto (US 2018/0137366 A1) in view of Jiang (2019 “CNN-Based Non-contact Detection of Food Level in Bottles from RGB Images”) as applied to claim 12 above, and further in view of Rad (US 2018/0137644 A1)

In regards to claim 13. Kehl in view of Szeto in further view of Jiang teach the non-transitory computer readable medium according to claim 12.
Kehl and Szeto does not disclose but Rad teaches
-wherein a background in the first 2D image and the second 2D image is selected to blend with the texture (Rad, paragraph [0096]; Reference discloses Furthermore, to be robust to clutter and scale changes, the segmented objects can be scaled by a factor of s ϵ [S1, S2] and the background can be changed by a patch extracted from a randomly selected image from a dataset of available backgrounds (e.g., from an ImageNet dataset) Selecting images from a dataset of available background images for generating the training images for the object detection training interpreted as a selected background for blending with texture of the multiple images used for training).
Kehl and Szeto are combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl to include the object detection algorithm features of Szeto in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses increasing accuracy of object detection applicable to improving object detection systems such as those taught in Kehl.
Kehl and Jiang are also combinable because they are in the same field of endeavor regarding training for object detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl, in view of the object detection algorithm features of Szeto, to include the non-contact image detection features of Jiang in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses. Further incorporating the non-contact image detection features of Jiang for using deep convolutional networks trained on RGB images and including data with printed labels having synthetic textures for increasing object detection accuracy applicable to improving object detection systems such as those taught in Kehl and Jiang.
Kehl and Rad are also combinable because they are in the same field of endeavor regarding object detection features. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the 3D detection and 6D pose estimation features of Kehl, in view of the object detection algorithm features of Szeto in further view of the non-contact image detection features of Jiang, to include the object pose estimation features of Rad in order to provide the user with a system that allows for use of a deep network for object detection that can accurately deal with 3D models and 6D pose estimation by assuming an RGB image as unique input at test time as taught by Kehl while incorporating the object detection algorithm features of Szeto in order to incorporate tracking functions for deriving the pose of the real object included in image frames in the video sequence in forward and/or backward directions and determining appearance information obtained from the corresponding image frame so that the appearance information and the data of the 2D model are associated with corresponding tracked or derived poses. Further incorporating the non-contact image detection features of Jiang for using deep convolutional networks trained on RGB images and including data with printed labels having synthetic textures for increasing object detection accuracy. Adding the object pose estimation features of Rad allows for using of bounding boxes for determining location and orientation of objects in real images to facilitate more effective object detection operation in various applications, applicable to improving object detection systems such as those taught in Kehl, Jiang, and Rad.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See the Notice of References Cited (PTO-892)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TERRELL M ROBINSON whose telephone number is (571)270-3526. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached on 571-272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/TERRELL M ROBINSON/Examiner, Art Unit 2619