PNG
    media_image1.png
    340
    340
    media_image1.png
    Greyscale
United States Patent and Trademark Office    
        
            
                                
            
        
    

Commissioner for Patents
United States Patent and Trademark Office
P.O. Box 1450
Alexandria, VA 22313-1450
www.uspto.gov











BEFORE THE PATENT TRIAL AND APPEAL BOARD


Application Number: 17/018,141
Filing Date: 11 Sep 2020
Appellant(s): FANUC CORPORATION



__________________
John A. Miller
For Appellant


EXAMINER’S ANSWER





This is in response to the appeal brief filed 6/15/2022.
(1) Grounds of Rejection to be Reviewed on Appeal
Every ground of rejection set forth in the Office action dated 3/4/2022 from which the appeal is taken is being maintained by the examiner except for the grounds of rejection (if any) listed under the subheading “WITHDRAWN REJECTIONS.”  New grounds of rejection (if any) are provided under the subheading “NEW GROUNDS OF REJECTION.”
The following ground(s) of rejection are applicable to the appealed claims.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-5 and 7-21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation “the pixels in each object in the segmentation image have the same label and the pixels of different objects in the segmentation image have different labels” in lines 7-9, and “the 2D image” in line 14. It is indefinite and unclear which objects have the same label (i.e. same object?) since pixels of different objects have different labels. It is also indefinite and unclear which 2D image is being referred to in line 14 (i.e. 2D RGB image, 2D segmentation image, 2D cropped image).
Claims 2-5, 7-9, and 10-12 are dependent on claim 1 and are therefore rejected under 112(b) for the same reasons as set forth above. 
Claim 13 recites the limitation “the pixels in each object in the segmentation image have the same label and the pixels of different objects in the segmentation image have different labels” in lines 9-11. It is indefinite and unclear which objects have the same label (i.e. same object?) since pixels of different objects have different labels. 
Claims 14-17 are dependent on claim 13 and are therefore rejected under 112(b) for the same reasons as set forth above. 
Claim 18 recites the limitation “the pixels in each object in the segmentation image have the same label and the pixels of different objects in the segmentation image have different labels” in lines 8-10. It is indefinite and unclear which objects have the same label (i.e. same object?) since pixels of different objects have different labels.
Claims 19-21 are dependent on claim 18 and are therefore rejected under 112(b) for the same reasons as set forth above. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 10-13 and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over “SegICP: Integrated Deep Semantic Segmentation and Pose Estimation” by Wong et al. (hereinafter “Wong”) in view of “Deep Polarization Cues for Transparent Object Segmentation” by Kalra et al. (hereinafter “Kalra”) and further in view of “Patch-based 3D Human Pose Refinement” by Wan et al. (hereinafter “Wan”).
Regarding claim 1, Wong et al. teaches, a method for obtaining a 3D pose of objects in a group of objects, said method comprising (As shown in Fig. 2, a 6-DOF pose (i.e., includes 3D pose) is estimated for each object): 
obtaining a 2D red-green-blue (RGB) color image of the objects using a camera (As shown in Fig. 2, a camera RGB image of the objects is obtained and used as an input to the CNN; As shown in Fig. 4, the motion capture system consists of an RGB-D camera);
generating a 2D segmentation image of the RGB image by performing an image segmentation process that extracts features from the RGB image and assigns a label to pixels in the segmentation image so that the pixels in each object in the segmentation image have the same label (As shown in Fig. 2, there is pixel-level semantic segmentation on the RGB image (i.e. 2D image) in which similar objects are labeled with the same color (i.e. two objects are labeled with blue color); Pg. 3, left-hand column: RGB frames are first passed through a CNN which outputs a segmented mask with pixel-wise semantic object labels; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud from the scene cloud. The identity of each segmented object (the object’s semantic label) predicted by SegNet is then used to retrieve its corresponding 3D mesh model from the object model library) 
separating the segmentation image into a plurality of 2D cropped images where each cropped image includes one of the objects (As shown in Fig. 2, the point cloud is cropped using the segmentation; Pg. 3, left-hand column: performs 3D point cloud matching against cropped versions of these mesh models… This mask is then used to crop the corresponding point cloud, generating individual point clouds for each detected object); 
estimating the 3D pose of each object in each cropped image (Pg. 3, left-hand column: this mask is then used to crop the corresponding point cloud, generating individual point clouds for each detected object. ICP is used to register each object’s point cloud with its full point cloud database model and estimate the pose of the object with respect to the sensor; As shown in Fig. 2, the pose is estimated for each object based on point cloud alignment to get the best estimated object pose; As shown in Fig. 3, the best fit score is the pose estimation for each object in the cropped image) that includes extracting a plurality of features on the object from the 2D image (As shown in Fig. 2, a 2D RGB image is captured by a camera and the 3D pose is estimated after segmentation of objects and their features with a convolutional neural network; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud from the scene cloud and the RGB image (i.e. 2D image) is passed through the CNN to output a mask which is used to generate points (i.e. features) for each object). 
Wong does not expressly disclose the following limitations: and the pixels of different objects in the segmentation image have different labels including objects that have a same or similar shape; and combining the 3D poses into a single pose image.
However, Kalra teaches, and the pixels of different objects in the segmentation image have different labels including objects that have a same or similar shape (As shown in Fig. 4 and Fig. 6, an instance segmentation mask is output in which similar objects (i.e. spherical objects) each have a different label (i.e. pixels of different objects have different colors)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include having different labels for different objects that have same or similar shape in a segmentation image as taught by Kalra into the method for obtaining a 3D pose of objects as taught by Wong to improve segmentation of transparent objects (Kalra, Pg. 8605, right-hand column).
The combination of Wong and Kalra does not expressly disclose the following limitation: and combining the 3D poses into a single pose image.
However, Wan teaches, and combining the 3D poses into a single pose image (As shown in Fig. 1, the residual 3D pose and the initial pose estimate are combined to create a final 3D refined pose estimate image).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include combining the 3D poses into a single pose image as taught by Wan into the combined method for obtaining a 3D pose of objects of Wong and Kalra to improve the accuracy of 3D poses through refinement methods (Wan, Pg. 1).
Regarding claim 2, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 1. 
Kalra in the combination further teaches, wherein generating a segmentation image includes using a deep learning mask R-CNN (convolutional neural network) (As shown in Pg. 8599, right-hand column, there is a deep learning framework for segmentation of transparent objects which includes a mask R-CNN; Pg. 8600, right-hand column; Fig. 4).
Regarding claim 10, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 1. 
Kalra in the combination further teaches, wherein the objects are transparent (As shown in Pg. 8600, left-hand column, there is a method for transparent object instance segmentation; see Fig. 7, transparent object bin picking).
Regarding claim 11, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 1. 
Wong in the combination further teaches, wherein the group of objects includes objects having different shapes (As shown in Fig. 2, a 6-DOF pose (i.e., includes 3D pose) is estimated for each object and the objects in the image are of different shapes).
Regarding claim 12, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 1. 
The combination of Wong, Kalra, and Wan further teaches, wherein the method is employed in a robot system (Wong, As shown in Fig. 2, a 6-DOF pose (i.e., includes 3D pose) is estimated for each object and the colored pixels in the segmented image are detected by the Kinect1 mounted on top of a PR2 robot) and the objects are being picked up by a robot (Kalra, As shown in Fig. 7, a UR3 robotic arm picks transparent objects from a bin with a suction cup gripper).
Regarding claim 13, Wong teaches, a method for obtaining a 3D pose (As shown in Fig. 2, a 6-DOF pose (i.e., includes 3D pose) is estimated for each object),
said method comprising: obtaining a 2D red-green-blue (RGB) color image of the objects using a camera (As shown in Fig. 2, a camera RGB image of the objects is obtained and used as an input to the CNN; As shown in Fig. 4, the motion capture system consists for an RGB-D camera);
 generating a segmentation image of the RGB image by performing an image segmentation process using a deep learning convolutional neural network that extracts features from the RGB image and assigns a label to pixels in the segmentation image so that the pixels in each object  in the segmentation image have the same label (As shown in Fig. 2, there is pixel-level semantic segmentation on the RGB image in which similar objects are labeled with the same color; Pg. 3, left-hand column: RGB frames are first passed through a CNN which outputs a segmented mask with pixel-wise semantic object labels; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud from the scene cloud. The identity of each segmented object (the object’s semantic label) predicted by SegNet is then used to retrieve its corresponding 3D mesh model from the object model library);
separating the segmentation image into a plurality of cropped images where each cropped image includes one of the objects (As shown in Fig. 2, the point cloud is cropped using the segmentation; Pg. 3, left-hand column: performs 3D point cloud matching against cropped versions of these mesh models… This mask is then used to crop the corresponding point cloud, generating individual point clouds for each detected object); 
estimating the 3D pose of each object in each cropped image (Pg. 3, left-hand column: this mask is then used to crop the corresponding point cloud, generating individual point clouds for each detected object. ICP is used to register each object’s point cloud with its full point cloud database model and estimate the pose of the object with respect to the sensor; As shown in Fig. 2, the pose is estimated for each object based on point cloud alignment to get the best estimated object pose; As shown in Fig. 3, the best fit score is the pose estimation for each object in the cropped image) that includes extracting a plurality of features on the object from the 2D image (As shown in Fig. 2, an 2D RGB image is captured by a camera and the 3D pose is estimated after segmentation of objects and their features with a convolutional neural network; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud from the scene cloud and the RGB image (i.e. 2D image) is passed through the CNN to output a mask which is used to generate points (i.e. features) for each object); 
wherein obtaining a color image, generating a segmentation image, separating the segmentation image, estimating a 3D pose of each object (As seen shown in Fig. 2, the RGB image obtained is segmented, cropped, and the estimated object pose is determined for the PR2 robot system).
Wong does not expressly disclose the following limitations: of transparent objects in a group of transparent objects to allow a robot to pick up the objects, and the pixels of different objects in the segmentation image have different labels including objects that have a same or similar shape; and combining the 3D poses into a single pose image, and combining the 3D poses are performed each time an object is picked up from the group of objects by the robot.
However, Kalra teaches, of transparent objects in a group of transparent objects to allow a robot to pick up the objects (As shown in Fig. 7, a UR3 robotic arm picks transparent objects from a bin with a suction cup gripper), 
and the pixels of different objects in the segmentation image have different labels including objects that have a same or similar shape (As shown in Fig. 4 and Fig. 6, an instance segmentation mask is output in which similar objects (i.e. spherical objects) each have a different label (i.e. pixels of different objects have different colors));
performed each time an object is picked up from the group of objects by the robot (As shown in Pg. 8605, right-hand column, there is a pose estimation component during bin picking; in which the robot arm picks the objects; As shown in Pg. 8606, it is determined how many times the robotic arm misses certain picks (i.e. object picking is performed multiple times by the robotic arm)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include transparent objects being picked by a robot, a robot picking up objects multiple times, and having different labels for different objects that have same or similar shape in a segmentation image, as taught by Kalra into the method for obtaining a 3D pose of objects as taught by Wong to improve segmentation of transparent objects (Kalra, Pg. 8605, right-hand column).
The combination of Wong and Kalra does not expressly disclose the following limitation: and combining the 3D poses into a single pose image; and combining the 3D poses.
However, Wan teaches, and combining the 3D poses into a single pose image (As shown in Fig. 1, the residual 3D pose and the initial pose estimate are combined to create a final 3D refined pose estimate image);
and combining 3D poses (As shown in Fig. 1, the residual 3D pose and the initial pose estimate are combined to create a final 3D refined pose estimate image).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include combining the 3D poses and combining the 3D poses into a single pose image as taught by Wan into the combined method for obtaining a 3D pose of objects of Wong and Kalra to improve the accuracy of 3D poses through refinement methods (Wan, Pg. 1).
Regarding claim 17, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 13.
Wong in the combination further teaches, wherein the camera is a 2D camera or a 3D camera (As shown in Fig. 2, a camera RGB image of the objects is obtained and used as an input to the CNN; As shown in Fig. 4, the motion capture system consists for an RGB-D camera).
Regarding claim 18, Wong teaches, a system for obtaining a 3D pose of objects in a group of objects, said system comprising (As shown in Fig. 2, a 6-DOF pose (i.e., includes 3D pose) is estimated for each object and the colored pixels in the segmented image are detected by the Kinect1 mounted on top of a PR2 robot): 
a camera that provides a 2D red-green-blue (RGB) color image of the objects (As shown in Fig. 2, a camera RGB image of the objects is obtained and used as an input to the CNN; As shown in Fig. 4, the motion capture system consists for an RGB-D camera); 
a deep learning convolutional neural network that generates a segmentation image of the objects by performing an image segmentation process that extracts features from the RGB image and assigns a label to pixels in the segmentation image so that the pixels in each object in the segmentation image have the same label (As shown in Fig. 2, there is pixel-level semantic segmentation on the RGB image in which similar objects are labeled with the same color; Pg. 3, left-hand column: RGB frames are first passed through a CNN which outputs a segmented mask with pixel-wise semantic object labels; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud from the scene cloud. The identity of each segmented object (the object’s semantic label) predicted by SegNet is then used to retrieve its corresponding 3D mesh model from the object model library)
means for separating the segmentation image into a plurality of cropped images where each cropped image includes one of the objects (As shown in Fig. 2, the point cloud is cropped using the segmentation; Pg. 3, left-hand column: performs 3D point cloud matching against cropped versions of these mesh models… This mask is then used to crop the corresponding point cloud, generating individual point clouds for each detected object);
 means for estimating the 3D pose of each object in each cropped image (Pg. 3, left-hand column: this mask is then used to crop the corresponding point cloud, generating individual point clouds for each detected object. ICP is used to register each object’s point cloud with its full point cloud database model and estimate the pose of the object with respect to the sensor; As shown in Fig. 2, the pose is estimated for each object based on point cloud alignment to get the best estimated object pose; As shown in Fig. 3, the best fit score is the pose estimation for each object in the cropped image) that includes extracting a plurality of features on the object from the 2D image  (As shown in Fig. 2, an 2D RGB image is captured by a camera and the 3D pose is estimated after segmentation of objects and their features with a convolutional neural network; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud from the scene cloud and the RGB image (i.e. 2D image) is passed through the CNN to output a mask which is used to generate points (i.e. features) for each object). 
Wong does not expressly disclose the following limitations: and the pixels of different objects in the segmentation image have different labels including objects that have a same or similar shape; and means for combining the 3D poses into a single pose image.
However, Kalra teaches, and the pixels of different objects in the segmentation image have different labels including objects that have a same or similar shape (As shown in Fig. 4 and Fig. 6, an instance segmentation mask is output in which similar objects (i.e. spherical objects) each have a different label (i.e. pixels of different objects have different colors)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include having different labels for different objects that have same or similar shape in a segmentation image as taught by Kalra into the method for obtaining a 3D pose of objects as taught by Wong to improve segmentation of transparent objects (Kalra, Pg. 8605, right-hand column).
The combination of Wong and Kalra does not expressly disclose the following limitation: and means for combining the 3D poses into a single pose image.
However, Wan teaches, and means for combining the 3D poses into a single pose image (As shown in Fig. 1, the residual 3D pose and the initial pose estimate are combined to create a final 3D refined pose estimate image).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include combining the 3D poses into a single pose image as taught by Wan into the combined method for obtaining a 3D pose of objects of Wong and Kalra to improve the accuracy of 3D poses through refinement methods (Wan, Pg. 1).

Claims 3-4 are rejected under 35 U.S.C. 103 as being unpatentable over “SegICP: Integrated Deep Semantic Segmentation and Pose Estimation” by Wong et al. (hereinafter “Wong”) in view of “Deep Polarization Cues for Transparent Object Segmentation” by Kalra et al. (hereinafter “Kalra”) and further in view of “Patch-based 3D Human Pose Refinement” by Wan et al. (hereinafter “Wan”) and “An Object Detector based on Multiscale Sliding Window Search using a Fully Pipelined Binarized CNN on an FPGA” by Nakahara et al. (hereinafter “Nakahara”).
Regarding claim 3, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 1. 
Wong in the combination further teaches, wherein generating a segmentation image includes (As shown in Fig. 2, there is pixel-level semantic segmentation on the RGB image).
The combination of Wong, Kalra, and Wan does not expressly disclose the following limitations: providing a plurality of bounding boxes, aligning the bounding boxes to the extracted features and providing a bounding box image that includes bounding boxes surrounding the objects.
However, Nakahara teaches, providing a plurality of bounding boxes (As shown in Fig. 1, bounding boxes are placed around each object detected), 
aligning the bounding boxes to the extracted features and providing a bounding box image that includes bounding boxes surrounding the objects (As shown in Fig, 7, the images are extracted by the sliding window which includes features of the objects and bounding boxes are added around the objects of interest; Pg. 171, right-hand column: the classifier extracts the image that is equal to or larger than the threshold as a region of interest (ROI) candidate).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include providing bounding boxes around the objects as taught by Nakahara into the combined method for obtaining a 3D pose of objects of Wong, Kalra, and Wan to improve detection of the presence of an object (Nakahara, Pg. 172).
Regarding claim 4, the combination of Wong, Kalra, Wan, and Nakahara teaches the limitations as explained above in claim 3. 
The combination of Wong, Kalra, Wan, and Nakahara further teaches, wherein generating a segmentation image includes (Wong, As shown in Fig. 2, there is pixel-level semantic segmentation on the RGB image) determining a probability that an object exists in each bounding box (Nakahara, As shown in Fig. 1, the classification probability is determined and object detection with bounding boxes includes both class probabilities and localization; Pg. 172, left-hand column: Fig. 8 shows a computation of a sliding window. For object detection, the specified hardware detects the presence of the object in a bounding box at various positions and scales in the image… To actually classify whether the object exists or not in the window, one can use any of the thousands of classification methods proposed. In this paper, we use BCNN to provide higher accuracy classification).

Claims 5, 14 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over “SegICP: Integrated Deep Semantic Segmentation and Pose Estimation” by Wong et al. (hereinafter “Wong”) in view of “Deep Polarization Cues for Transparent Object Segmentation” by Kalra et al. (hereinafter “Kalra”) and further in view of “Patch-based 3D Human Pose Refinement” by Wan et al. (hereinafter “Wan”), “An Object Detector based on Multiscale Sliding Window Search using a Fully Pipelined Binarized CNN on an FPGA” by Nakahara et al. (hereinafter “Nakahara”), and Shellshear (US 2015/0178568 A1).
Regarding claim 5, the combination of Wong, Kalra, Wan, and Nakahara teaches the limitations as explained above in claim 3. 
Wong in the combination further teaches, wherein generating a segmentation image includes (As shown in Fig. 2, there is pixel-level semantic segmentation on the RGB image).
The combination of Wong, Kalra, Wan, and Nakahara does not expressly disclose the following limitations: removing pixels from each bounding box in the bounding box image that are not associated with an object.
However, Shellshear teaches, removing pixels from each bounding box in the bounding box image that are not associated with an object (Para. 0104: remove the effects of pixels inside the initial bounding box that do not correspond to the object).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include removing pixels from the bounding box that are not associated with the object as taught by Shellshear into the combined method for obtaining a 3D pose of objects of by Wong, Kalra, Wan, and Nakahara to avoid including background pixels in an image for improved object tracking (Shellshear, Para. 0014).
Regarding claim 14, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 13.
Wong in the combination further teaches, generating a segmentation image includes (As shown in Fig. 2, there is pixel-level semantic segmentation on the RGB image).
The combination of Wong, Kalra, and Wan does not expressly disclose the following limitations: providing a plurality of vertically aligned bounding boxes having the same orientation, aligning the bounding boxes to the extracted features using a sliding window template, providing a bounding box image that includes bounding boxes surrounding the objects, determining a probability that an object exists in each bounding box, removing pixels from each bounding box that are not associated with an object and identifying a center pixel of each object in the bounding boxes.
However, Nakahara teaches, providing a plurality of vertically aligned bounding boxes having the same orientation (As shown in Fig. 1, vertical bounding boxes are placed around each object detected and have the same orientation), 
aligning the bounding boxes to the extracted features using a sliding window template (As shown in Fig, 7, the images are extracted by the sliding window which includes features of the objects and bounding boxes are added around the objects of interest; Pg. 171, right-hand column: the classifier extracts the image that is equal to or larger than the threshold as a region of interest (ROI) candidate), 
providing a bounding box image that includes bounding boxes surrounding the objects (As shown in Fig, 7, the images are extracted by the sliding window which includes features of the objects and bounding boxes are added around the objects of interest), 
determining a probability that an object exists in each bounding box (As shown in Fig. 1, the classification probability is determined and object detection with bounding boxes includes both class probabilities and localization; Pg. 172, left-hand column: Fig. 8 shows a computation of a sliding window. For object detection, the specified hardware detects the presence of the object in a bounding box at various positions and scales in the image… To actually classify whether the object exists or not in the window, one can use any of the thousands of classification methods proposed. In this paper, we use BCNN to provide higher accuracy classification).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include providing bounding boxes around the objects and determining a probability that an object exists in the bounding boxes as taught by Nakahara into the combined method for obtaining a 3D pose of objects of Wong, Kalra, and Wan to improve object detection and recognition accuracy (Nakahara, Pg. 172).
The combination of Wong, Kalra, Wan, and Nakahara does not expressly disclose the following limitations: removing pixels from each bounding box that are not associated with an object and identifying a center pixel of each object in the bounding boxes.
However, Shellshear teaches removing pixels from each bounding box that are not associated with an object (Para. 0104: remove the effects of pixels inside the initial bounding box that do not correspond to the object) and identifying a center pixel of each object in the bounding boxes (Para. 0016: the bounding box centroid is modified; Para. 0159: An average bounding box is then created and used as the bC(AVERAGE) track prediction for the current frame. The determination of track velocity is done by making use of the Track Position List to estimate centroid movement of the object over time; Note: the centroid is the geometric center of the object and is tracked in the bounding box over time).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include removing pixels from the bounding box that are not associated with the object and identifying a center of the object in the bounding pox as taught by Shellshear into the combined method for obtaining a 3D pose of objects of  Wong, Kalra, Wan, and Nakahara to improve object tracking (Shellshear, Para. 0014).
Regarding claim 19, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 18.
Wong in the combination further teaches, wherein the deep learning neural network (Pg. 3, left-hand column: RGB frames are first passed through a CNN which outputs a segmented mask with pixel-wise semantic object labels; See claim 18 above).
The combination of Wong, Kalra, and Wan does not expressly disclose the following limitations: provides a plurality of vertically aligned bounding boxes having the same orientation, aligns the bounding boxes to the extracted features using a sliding window template, provides a bounding box image that includes bounding boxes surrounding the objects, determines a probability that an object exists in each bounding box, removes pixels from each bounding box that are not associated with an object and identifies a center pixel of each object in the bounding boxes.
However, Nakahara teaches, provides a plurality of vertically aligned bounding boxes having the same orientation (As shown in Fig. 1, vertical bounding boxes are placed around each object detected and have the same orientation), 
aligns the bounding boxes to the extracted features using a sliding window template (As shown in Fig, 7, the images are extracted by the sliding window which includes features of the objects and bounding boxes are added around the objects of interest; Pg. 171, right-hand column: the classifier extracts the image that is equal to or larger than the threshold as a region of interest (ROI) candidate), 
provides a bounding box image that includes bounding boxes surrounding the objects (As shown in Fig, 7, the images are extracted by the sliding window which includes features of the objects and bounding boxes are added around the objects of interest), 
determines a probability that an object exists in each bounding box (As shown in Fig. 1, the classification probability is determined and object detection with bounding boxes includes both class probabilities and localization; Pg. 172, left-hand column: Fig. 8 shows a computation of a sliding window. For object detection, the specified hardware detects the presence of the object in a bounding box at various positions and scales in the image… To actually classify whether the object exists or not in the window, one can use any of the thousands of classification methods proposed. In this paper, we use BCNN to provide higher accuracy classification).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include providing bounding boxes around the objects and determining a probability that an object exists in the bounding boxes as taught by Nakahara into the combined method for obtaining a 3D pose of objects of Wong, Kalra, and Wan to improve object detection and recognition accuracy (Nakahara, Pg. 172).
The combination of Wong, Kalra, Wan, and Nakahara does not expressly disclose the following limitations: removes pixels from each bounding box that are not associated with an object and identifies a center pixel of each object in the bounding boxes.
However, Shellshear teaches, removes pixels from each bounding box that are not associated with an object (Para. 0104: remove the effects of pixels inside the initial bounding box that do not correspond to the object) and identifies a center pixel of each object in the bounding boxes (Para. 0016: the bounding box centroid is modified; Para. 0159: An average bounding box is then created and used as the bC(AVERAGE) track prediction for the current frame. The determination of track velocity is done by making use of the Track Position List to estimate centroid movement of the object over time; Note: the centroid is the geometric center of the object and is tracked in the bounding box over time).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include removing pixels from the bounding box that are not associated with the object and identifying a center of the object in the bounding pox as taught by Shellshear into the combined method for obtaining a 3D pose of objects of  Wong, Kalra, Wan, and Nakahara to improve object tracking (Shellshear, Para. 0014).

Claims 7-8, 15 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over “SegICP: Integrated Deep Semantic Segmentation and Pose Estimation” by Wong et al. (hereinafter “Wong”) in view of “Deep Polarization Cues for Transparent Object Segmentation” by Kalra et al. (hereinafter “Kalra”) and further in view of “Patch-based 3D Human Pose Refinement” by Wan et al. (hereinafter “Wan”) and Kim et al. (US 2020/0247321 A1, hereinafter “Kim”).
Regarding claim 7, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 1.
Wong in the combination further teaches, wherein estimating the 3D pose of each object includes using a neural network for extracting the features (As seen in Fig. 2, an 2D RGB image is captured by a camera and the 3D pose is estimated after segmentation of objects and their features with a convolutional neural network; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud from the scene cloud),
providing a feature point image (As shown in Fig. 2, step 4, each object in the image has a point cloud (i.e. feature point image),
 and estimating the 3D pose of the object using the feature point image (As shown in Fig. 2, the pose is estimated for each object based on point cloud (i.e. feature) alignment to get the best estimated object pose).
The combination of Wong, Kalra, and Wan does not expressly disclose the following limitations: generating a heatmap for each of the extracted features that identify a probability of a location of a feature point on the object, that combines the feature points from the heatmaps and the 2D image.
However, Kim teaches, generating a heatmap for each of the extracted features that identify a probability of a location of a feature point on the object (Para. 0015: (i) generate each of one or more feature tensors by extracting one or more features from each of the upper body image and the lower body image via a feature extractor, (ii) generate each of one or more keypoint heatmaps and one or more part affinity fields corresponding to each of the feature tensors via a keypoint heatmap & part affinity field extractor; Para. 0058: each of the part affinity fields may be a map showing connections of a specific keypoint with other keypoints, and may be a map representing each of mutual connection probabilities of each of the keypoints in each of keypoint heatmap pairs. And, a meaning of the "heatmap" may represent a combination of heat and a map, which may graphically show various information that can be expressed by colors as heat-like distribution on an image; Note: the colors of a heatmap represent the probability of a location), 
that combines the feature points from the heatmaps and the 2D image (Para. 0015: (iii) extract one or more keypoints from each of the keypoint heatmaps and group each of the extracted keypoints by referring to each of the part affinity fields, and thus generate the body keypoints corresponding to the driver, via a keypoint grouping layer; As shown in Fig. 3, the heatmaps and part affinity fields are combined into a result image).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include generating a heatmap for the extracted features that identify the probability of a location of a feature point and combining the feature points from the heatmaps and the 2D image into a feature point image as taught by Kim into the combined method for obtaining a 3D pose of objects of Wong, Kalra, and Wan to determine the highest points on each of the key point heatmaps during pose estimation (Kim, Para. 0018). 
Regarding claim 8, the combination of Wong, Kalra, Wan, and Kim teaches the limitations as explained above in claim 7.
Wong in the combination further teaches, wherein estimating the 3D pose of each object includes comparing the feature point image to a 3D virtual model of the object  (As shown in Fig. 2, the pose is estimated for each object based on point cloud (i.e. feature) alignment to get the best estimated object pose; As shown in Fig. 3 and Pg. 3, the best fit score is the pose estimation from matching the 3D point clouds of the object to a 3D mesh model from the object model library).
Regarding claim 15, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 13.
Wong in the combination further teaches, wherein estimating the 3D pose of each object includes using a neural network for extracting the features (As seen in Fig. 2, an 2D RGB image is captured by a camera and the 3D pose is estimated after segmentation of objects and their features with a convolutional neural network; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud from the scene cloud),
providing a feature point image (As shown in Fig. 2, step 4, each object in the image has a point cloud (i.e. feature point image),
and estimating the 3D pose of the object using the feature point image by comparing the feature point image to a 3D virtual model of the object (As shown in Fig. 2, the pose is estimated for each object based on point cloud (i.e. feature) alignment to get the best estimated object pose; As shown in Fig. 3 and Pg. 3, the best fit score is the pose estimation from matching the 3D point clouds of the object to a 3D mesh model from the object model library).
The combination of Wong, Kalra, and Wan does not expressly disclose the following limitations: generating a heatmap for each of the extracted features that identify a probability of a location of a feature point on the object, that combines the feature points from the heatmaps and the 2D image.
However, Kim teaches, generating a heatmap for each of the extracted features that identify a probability of a location of a feature point on the object (Para. 0015: (i) generate each of one or more feature tensors by extracting one or more features from each of the upper body image and the lower body image via a feature extractor, (ii) generate each of one or more keypoint heatmaps and one or more part affinity fields corresponding to each of the feature tensors via a keypoint heatmap & part affinity field extractor; Para. 0058: each of the part affinity fields may be a map showing connections of a specific keypoint with other keypoints, and may be a map representing each of mutual connection probabilities of each of the keypoints in each of keypoint heatmap pairs. And, a meaning of the "heatmap" may represent a combination of heat and a map, which may graphically show various information that can be expressed by colors as heat-like distribution on an image; Note: the colors of a heatmap represent the probability of a location), 
that combines the feature points from the heatmaps and the 2D image (Para. 0015: (iii) extract one or more keypoints from each of the keypoint heatmaps and group each of the extracted keypoints by referring to each of the part affinity fields, and thus generate the body keypoints corresponding to the driver, via a keypoint grouping layer; As shown in Fig. 3, the heatmaps and part affinity fields are combined into a result image).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include generating a heatmap for the extracted features that identify the probability of a location of a feature point and combining the feature points from the heatmaps and the 2D image into a feature point image as taught by Kim into the combined method for obtaining a 3D pose of objects of Wong, Kalra, and Wan to determine the highest points on each of the key point heatmaps during pose estimation (Kim, Para. 0018). 
Regarding claim 20, the combination of Wong, Kalra, and Wan teaches the limitations as explained above in claim 18.
Wong in the combination further teaches, wherein the means for estimating the 3D pose of each object uses a neural network (As shown in Fig. 2, a 6-DOF pose (i.e., includes 3D pose) is estimated for each object; Pg. 3, left-hand column: RGB frames are first passed through a CNN which outputs a segmented mask with pixel-wise semantic object labels; Pg. 3, left-hand column: the resulting segmentation is used to extract each object’s 3D point cloud (i.e. features) from the scene cloud), 
provides a feature point image (As shown in Fig. 2, step 4, each object in the image has a point cloud (i.e. feature point image),
and estimates the 3D pose of the object using the feature point image by comparing the feature point image to a 3D virtual model of the object (As shown in Fig. 2, the pose is estimated for each object based on point cloud (i.e. feature) alignment to get the best estimated object pose; As shown in Fig. 3 and Pg. 3, the best fit score is the pose estimation from matching the 3D point clouds of the object to a 3D mesh model from the object model library).
The combination of Wong, Kalra, and Wan does not expressly disclose the following limitations: generates a heatmap for each of the extracted features that identify a probability of a location of a feature point on the object, that combines the feature points from the heatmaps and the 2D image.
However, Kim teaches, generates a heatmap for each of the extracted features that identify a probability of a location of a feature point on the object (Para. 0015: (i) generate each of one or more feature tensors by extracting one or more features from each of the upper body image and the lower body image via a feature extractor, (ii) generate each of one or more keypoint heatmaps and one or more part affinity fields corresponding to each of the feature tensors via a keypoint heatmap & part affinity field extractor; Para. 0058: each of the part affinity fields may be a map showing connections of a specific keypoint with other keypoints, and may be a map representing each of mutual connection probabilities of each of the keypoints in each of keypoint heatmap pairs. And, a meaning of the "heatmap" may represent a combination of heat and a map, which may graphically show various information that can be expressed by colors as heat-like distribution on an image; Note: the colors of a heatmap represent the probability of a location), 
that combines the feature points from the heatmaps and the 2D image (Para. 0015: (iii) extract one or more keypoints from each of the keypoint heatmaps and group each of the extracted keypoints by referring to each of the part affinity fields, and thus generate the body keypoints corresponding to the driver, via a keypoint grouping layer; As shown in Fig. 3, the heatmaps and part affinity fields are combined into a result image).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include generating a heatmap for the extracted features that identify the probability of a location of a feature point and combining the feature points from the heatmaps and the 2D image into a feature point image as taught by Kim into the combined method for obtaining a 3D pose of objects of Wong, Kalra, and Wan to determine the highest points on each of the key point heatmaps during pose estimation (Kim, Para. 0018). 

Claims 9, 16 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over “SegICP: Integrated Deep Semantic Segmentation and Pose Estimation” by Wong et al. (hereinafter “Wong”) in view of “Deep Polarization Cues for Transparent Object Segmentation” by Kalra et al. (hereinafter “Kalra”) and further in view of “Patch-based 3D Human Pose Refinement” by Wan et al. (hereinafter “Wan”), Kim et al. (US 2020/0247321 A1, hereinafter “Kim”), and “Real-Time Seamless Single Shot 6D Object Pose Prediction” by Tekin et al. (hereinafter “Tekin”).
Regarding claim 9, the combination of Wong, Kalra, Wan, and Kim teaches the limitations as explained above in claim 8. 
The combination of Wong, Kalra, Wan, and Kim does not expressly disclose the following limitation: wherein estimating the 3D pose of each object includes using a perspective-n-point algorithm.
However, Tekin teaches, wherein estimating the 3D pose of each object includes using a perspective-n-point algorithm (Abstract: the object’s 6D pose is then estimated using a PNP algorithm; Pg. 2, right-hand column: Given the 2D coordinate predictions, we calculate the object’s 6D pose using a PnP algorithm; see Pgs. 4-5, section “3.3. Pose Prediction; Note: 6D pose includes 3D pose).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include estimating the 3D pose using a PNP algorithm as taught by Tekin into the combined method for obtaining a 3D pose of objects of Wong, Kalra, Wan, and Kim to estimate the pose in augmented reality, virtual reality, and robotics applications (Tekin, Pg. 1).
Regarding claim 16, the combination of Wong, Kalra, Wan, and Kim teaches the limitations as explained above in claim 15.
The combination of Wong, Kalra, Wan, and Kim does not expressly disclose the following limitation: wherein estimating the 3D pose of each object includes using a perspective-n-point algorithm.
However, Tekin teaches, wherein estimating the 3D pose of each object includes using a perspective-n-point algorithm (Abstract: the object’s 6D pose is then estimated using a PNP algorithm; Pg. 2, right-hand column: Given the 2D coordinate predictions, we calculate the object’s 6D pose using a PnP algorithm; see Pgs. 4-5, section “3.3. Pose Prediction; Note: 6D pose includes 3D pose).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include estimating the 3D pose using a PNP algorithm as taught by Tekin into the combined method for obtaining a 3D pose of objects of Wong, Kalra, Wan, and Kim to estimate the pose in augmented reality, virtual reality, and robotics applications (Tekin, Pg. 1).
Regarding claim 21, The combination of Wong, Kalra, Wan, and Kim teaches the limitations as explained above in claim 20.
The combination of Wong, Kalra, Wan, and Kim does not expressly disclose the following limitation: wherein the means for estimating the 3D pose of each object uses a perspective-n-point algorithm.
However, Tekin teaches, wherein the means for estimating the 3D pose of each object uses a perspective-n-point algorithm (Abstract: the object’s 6D pose is then estimated using a PNP algorithm; Pg. 2, right-hand column: Given the 2D coordinate predictions, we calculate the object’s 6D pose using a PnP algorithm; see Pgs. 4-5, section “3.3. Pose Prediction; Note: 6D pose includes 3D pose).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include estimating the 3D pose using a PNP algorithm as taught by Tekin into the combined method for obtaining a 3D pose of objects of Wong, Kalra, Wan, and Kim to estimate the pose in augmented reality, virtual reality, and robotics applications (Tekin, Pg. 1).

WITHDRAWN REJECTIONS
The following grounds of rejection are not presented for review on appeal because they have been withdrawn by the examiner.  
Regarding claim 1, the 112(b) rejection for only the following limitation has been withdrawn: “where each cropped image includes one of the objects” in line 12 and “the object” in line 14. 
Regarding claim 3, the 112(b) rejection for only the following limitation has been withdrawn: “the objects” in line 4. 
Regarding claim 10, the 112(b) rejection for only the following limitation has been withdrawn: “the objects” in line 1.
Regarding claim 12, the 112(b) rejection for only the following limitation has been withdrawn: “the objects” in line 2.
Regarding claim 13, the 112(b) rejection for only the following limitation has been withdrawn: “where each cropped image includes one of the objects” in line 13.
Regarding claim 14, the 112(b) rejection for only the following limitation has been withdrawn: “the objects” in line 5. 
Regarding claim 18, the 112(b) rejection for only the following limitation has been withdrawn: “where each cropped image includes one of the objects” in line 12.
Regarding claim 19, the 112(b) rejection for only the following limitation has been withdrawn: “the objects” in line 5. 

 (2) Response to Argument
A.  Appellant argues claims 1-5 and 7-21 are not indefinite under 35 U.S.C. 112(b).
The examiner respectfully disagrees. Claims 1, 13, and 18 state that each object has the same label but also states that different objects have different labels, which contradict each other. It is unclear if the Appellant is trying to claim same objects have the same label based on the present claim language. Claim 1 also states that features on the objects from the 2D image are extracted. It is unclear from which 2D image the object features are being extracted, since claim 1 has three different 2D images (a 2D RGB color image, a 2D segmentation image, and a 2D cropped image). 
Dependent claims 2-5, 7-12, 14-17, and 19-21 are rejected under section 112(b) for the same reasons as set forth above. 
Thus, the rejections are proper and maintained.

B. Appellant argues the cited references do not teach assigning a different label to pixels of different objects in the segmentation image including objects that have the same or similar shape in independent claim 1, 3, and 18.
The examiner respectfully disagrees. Appellant refers to Wong not teaching the above limitation, however the examiner referred to Karla in the final office action (see above) to teach this limitation. Karla teaches in Figs. 4 and 6 there is an instance segmentation mask in which similarly shaped objects, such as different spherical objects, have a different color. The different colors of the different spherical objects show each spherical object has a different label. The claim calls for different objects with a same or similar shape having different labels, not each object that has a same or similar shape being assigned different labels. Therefore, Karla teaches the above limitation.
In response to appellant's argument that the references fail to show certain features of appellant’s invention, it is noted that the features upon which applicant relies (i.e., instance segmentation; only a single feature extraction network is required and there is no need for an attention neural network for channel fusion) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
The claim language does not limit interpretation of the claim to instance segmentation or a single feature extraction network that does not use an attention neural network for channel fusion. Additionally, Kalra does teach in Pg. 3 a single CNN is used to output a segmented mask to extract each object’s 3D point cloud (i.e. object features).
Thus, the rejections are proper and maintained.

C. Appellant argues the cited references do not teach estimating the 3D pose of each object in each cropped image that includes extracting a plurality of features on the object from the 2D image in independent claim 1, 3, and 18.
The examiner respectfully disagrees. Wong teaches in Pg. 3 that a segmentation mask is used to crop the point cloud to generate individual point clouds for each detected object and estimate the pose of the object. Pg. 3 of Wong also discusses extracting the object’s 3D point cloud and the RGB image is passed through the CNN to output a segmentation mask to generate the points for each object. Fig. 2 of Wong also shows a 2D RGB image is captured and the 3D pose of each object is estimated after segmentation of the objects. In Wong, the cropped image is interpreted as the cropped point cloud from segmentation and the points generated for each object are interpreted as the features of the object.
In response to appellant's argument that the references fail to show certain features of appellant’s invention, it is noted that the features upon which applicant relies (i.e., the 3D pose of transparent objects cannot be determined using point clouds) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
The claim language does not limit interpretation of the claim to exclude determining the 3D pose of transparent objects using point clouds. 

D. Appellant argues the cited references do not teach cropping a segmentation image so that the 3D pose of each object in the segmentation image is separately obtained.
The examiner respectfully disagrees. As previously stated in point C above, Wong teaches in Pg. 3 that a segmentation mask is used to crop the point cloud to generate individual point clouds for each detected object and estimate the pose of the object. In Wong, the cropped image is interpreted as the cropped point cloud from segmentation. Appellant refers to Karla not teaching point D above, however the examiner referred to Wong in the final office action (see above) to teach this point. The examiner believes point D is referring to the “separating the segmentation…from the 2D image” limitations in the claims, as the phrasing “cropping a segmentation image so that the 3D pose of each object in the segmentation image is separately obtained” is not expressly used in the claims. 

E. Appellant argues the cited references Kalra and Wong cannot be combined in independent claims 1, 13, and 18.
Appellant’s arguments have been fully considered but they are not persuasive. Appellant states that is not possible to combine the Kalra instance image segmentation process with the Wong semantic segmentation process. The examiner respectfully disagrees. 
In response to applicant’s argument that there is no teaching, suggestion, or motivation to combine the references, the examiner recognizes that obviousness may be established by combining or modifying the teachings of the prior art to produce the claimed invention where there is some teaching, suggestion, or motivation to do so found either in the references themselves or in the knowledge generally available to one of ordinary skill in the art.  See In re Fine, 837 F.2d 1071, 5 USPQ2d 1596 (Fed. Cir. 1988), In re Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992), and KSR International Co. v. Teleflex, Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007).  In this case, Kalra teaches in Pg. 8605 attention-fusion improves transparent object segmentation. Therefore, it is obvious to one of ordinary skill in the art to use instance segmentation techniques as taught by Karla to improve upon transparent object detection methods such as semantic segmentation as taught by Wong.

For the above reasons, it is believed that the rejections should be sustained.
Respectfully submitted,
/Daniella M. DiGuglielmo/Examiner, Art Unit 2664  
8/10/2022      
                                                                                                                                                                                                Conferees:

/NAY A MAUNG/Supervisory Patent Examiner, Art Unit 2664                                                                                                                                                                                                        
/EDWARD F URBAN/Supervisory Patent Examiner, Art Unit 2665                                                                                                                                                                                                        


Requirement to pay appeal forwarding fee.  In order to avoid dismissal of the instant appeal in any application or ex parte reexamination proceeding, 37 CFR 41.45 requires payment of an appeal forwarding fee within the time permitted by 37 CFR 41.45(a), unless appellant had timely paid the fee for filing a brief required by 37 CFR 41.20(b) in effect on March 18, 2013.