DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-27 are pending in Instant Application.

Priority
Examiner acknowledges Applicant’s claim to priority benefits of 62/878,659 filed 07/25/2019.

Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on 07/24/2020, 05/26/2021, 07/22/202, 08/23/2021, 05/03/2022, and 08/15/2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement(s) is/are being considered if signed and initialed by the Examiner.

Claim Objections
Claim(s) 22 is objected to because of the following informalities: acronym “DNN” is not defined in the claim. 
Appropriate correction is required.


	
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103(a) are summarized as follows:
1.	Determining the scope and contents of the prior art.
2.	Ascertaining the differences between the prior art and the claims at issue.
3.	Resolving the level of ordinary skill in the pertinent art.
4.	Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 3-9, 13-18, and 20-22 are rejected under 35 U.S.C. 103(a) as being unpatentable over Banerjee et al. (USPGPub 2020/0301013) in view of Ryan (USPGPub 2019/0026588).	As per claim 1, Banerjee discloses a method comprising: 	applying, to a neural network, data representative of one or more images of an environment (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); 	generating, using the neural network and based at least in part on the data: 		a first output representative of, for each pixel of a plurality of pixels, one or more classifications corresponding to one or more detected objects in the environment (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); and 		a second output representative of, for each pixel of the plurality of pixels, one or more values representative of an association between the pixel and one or more instances of the one or more detected objects (see at least paragraph 0009; wherein encoding the projected depth data comprises encoding respective depth values into three-channel color information to generate the encoded projected depth data. Such examples relate to so-called JET encoding which is a coloring scheme which converts the distance value J at each pixel i into three channels, for example, each with 8 bit values); 	generating, based at least in part on the second output, one or more bounding shapes corresponding to the one or more instances of the one or more detected objects (see at least paragraph 0007; wherein fused hybrid data are fed into a convolutional neural network to leam features and then fed through fully connected layers to detect and classify objects (class score) and predict respective bounding boxes for the objects); 	associating the one or more classifications with the one or more bounding shapes (see at least paragraph 0131; wherein RPN is a convolution layer whose output layer is connected to a classification layer which classifies the object and to a regression layer which predict the coordinates of the bounding box). Banerjee do not explicitly mention performing one or more operations by an autonomous machine based at least in part on the one or more classifications associated with the one or more bounding shapes.	However Ryan does disclose:	performing one or more operations by an autonomous machine based at least in part on the one or more classifications associated with the one or more bounding shapes (see at least paragraph 0008; wherein generates, by the processor, a bounding box around the element; projects, by the processor, segments of the element onto the bounding box to obtain a depth image; and classifies the object by providing the depth image to a machine learning model and receiving a classification output that classifies the element as an object for assisting in control of the autonomous vehicle).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Ryan with the teachings as in Banerjee. The motivation for doing so would have been to improve the AV system by increasing the accuracy of classifying an object sensed in a vehicles surroundings, see Ryan paragraph 0004.	As per claim 3, Banerjee discloses further comprising generating the data by stitching together, using an image stitching technique, image data from a plurality of images captured by a plurality of cameras of an ego-actor (see at least paragraph 0001; wherein object detection and/or classification based on fused sensor data of different sensors).  	As per claim 4, Banerjee discloses wherein the data representative of the one or more images corresponds to a multi-channel tensor with a first channel storing color data and a second channel storing range data (see at least paragraph 0020; wherein a vehicle, the vehicle comprising a LiDAR to capture a depth image data of the vehicle's environment, a camera to capture a color image data of the vehicle's environment, processing circuitry configured to generate a projection of the depth image data onto the color image data, and to encode the projection of the depth image data to three-channel information to generate an encoded projection of the depth image data, and one or more convolutional neural networks configured to detect or classify objects in the vehicle's environment based on the color image data and the encoded projection of the depth image data).  	As per claim 5, Banerjee discloses further comprising generating the multi-channel tensor by projecting at least one of a LiDAR point cloud or a RADAR point cloud into a range image, and storing a portion of data representative of the range image in the second channel of the multi- channel tensor (see at least paragraph 0020; wherein a vehicle, the vehicle comprising a LiDAR to capture a depth image data of the vehicle's environment, a camera to capture a color image data of the vehicle's environment, processing circuitry configured to generate a projection of the depth image data onto the color image data, and to encode the projection of the depth image data to three-channel information to generate an encoded projection of the depth image data, and one or more convolutional neural networks configured to detect or classify objects in the vehicle's environment based on the color image data and the encoded projection of the depth image data).  	As per claim 6, Banerjee discloses wherein: the neural network comprises a common trunk connected to a class confidence head and an instance regression head; the class confidence head includes a classification channel for each supported class; and the instance regression head includes a plurality of regression channels, each regression channel regressing at least one of: a location, geometry, or orientation corresponding to a bounding shape of the one or more bounding shapes (see at least paragraph 0131; wherein at each sliding window location, the network predict multiple region proposals which outputs a score and a bounding box per anchor. RPN is a convolution layer whose output layer is connected to a classification layer which classifies the object and to a regression layer which predict the coordinates of the bounding box. At each sliding window location, K regions are proposed, classification layer having 2k outputs (objects or not objects) and regression layer having 4k outputs (coordinates of the bounding box). The RPN is also translational invariant. An anchor with Intersection over Union (IoU) greater than 0.7 with any ground truth bounding box can be given an object label or positive label. All other anchors are given not an object label or negative label and anchors with IoU less than 0.3 with ground truth bounding box are given a negative label).  	As per claim 7, Banerjee discloses wherein: the neural network comprises a common trunk connected to a class confidence head and a second head; the class confidence head computes the first output representative of the one or more classifications (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); Page 97	the second head computes the second output representing the one or more values (see at least paragraph 0009; wherein encoding the projected depth data comprises encoding respective depth values into three-channel color information to generate the encoded projected depth data. Such examples relate to so-called JET encoding which is a coloring scheme which converts the distance value J at each pixel i into three channels, for example, each with 8 bit values); and 	the method further comprises co-training the class confidence head and the second head together (see at least paragraph 0097; wherein machine learning techniques with neural networks where labelled data can be used for training and evaluating the neural network).  	As per claim 8, Banerjee discloses wherein the neural network further generates, based at least in part on the data and for each pixel of the plurality of pixels, a third output representative of a distance from the pixel to a corresponding object represented by the pixel (see at least paragraph 0009; wherein JET encoding which is a coloring scheme which converts the distance value J at each pixel i into three channels, for example, each with 8 bit values. This can be achieved by using linear interpolation. In other examples, encoding the projected depth data comprises encoding respective depth values into three channels comprising horizontal disparity, height above ground, and angle to gravity to generate the encoded projected depth data. These examples relate to so-called HHA encoding which converts the distance value J at each pixel i into three channels horizontal disparity, height above ground, and angle to gravity (HHA)). 	As per claim 9, Banerjee discloses wherein the neural network further generates, based at least in part on the data and using an instance clustering head with a channel for each of a plurality of instances to distinguish, a confidence map representing pixels that belong to a corresponding instance of the plurality of instances (see at least paragraphs 0113; wherein the HHA features can be extracted from the up-sampled depth map. HHA encoding converts the distance value J at each pixel i into 3 channels horizontal disparity, height above ground, and angle to gravity (HHA) as hi, h.sub.2, and a. HHA encodes the properties like geocentric pose which will be harder for the neural networks to learn from the limited depth data).  	As per claim 13, Banerjee discloses a method comprising: 	generating, using at least one camera of an ego-actor in an environment, image data representing a scene of the environment (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); 	performing panoptic segmentation by predicting, using a neural network and based at least in part on the image data, a first output representing a class segmentation of the scene and a second output representing one or more regressed values of one or more unique instances in the scene (see at least paragraph 0131; wherein at each sliding window location, the network predict multiple region proposals which outputs a score and a bounding box per anchor. RPN is a convolution layer whose output layer is connected to a classification layer which classifies the object and to a regression layer which predict the coordinates of the bounding box. At each sliding window location, K regions are proposed, classification layer having 2k outputs (objects or not objects) and regression layer having 4k outputs (coordinates of the bounding box). The RPN is also translational invariant. An anchor with Intersection over Union (IoU) greater than 0.7 with any ground truth bounding box can be given an object label or positive label. All other anchors are given not an object label or negative label and anchors with IoU less than 0.3 with ground truth bounding box are given a negative label); 	generating, based at least in part on the second output, one or more bounding shapes corresponding to the one or more unique instances (see at least paragraph 0007; wherein fused hybrid data are fed into a convolutional neural network to leam features and then fed through fully connected layers to detect and classify objects (class score) and predict respective bounding boxes for the objects); 	associating, based at least in part on the first output, classes with the one or more bounding shapes (see at least paragraph 0131; wherein RPN is a convolution layer whose output layer is connected to a classification layer which classifies the object and to a regression layer which predict the coordinates of the bounding box). Banerjee do not explicitly mention performing one or more operations by the ego-actor based at least in part on the one or more bounding shapes and the classes.  	However Ryan does disclose:	performing one or more operations by the ego-actor based at least in part on the one or more bounding shapes and the classes (see at least paragraph 0008; wherein generates, by the processor, a bounding box around the element; projects, by the processor, segments of the element onto the bounding box to obtain a depth image; and classifies the object by providing the depth image to a machine learning model and receiving a classification output that classifies the element as an object for assisting in control of the autonomous vehicle).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Ryan with the teachings as in Banerjee. The motivation for doing so would have been to improve the AV system by increasing the accuracy of classifying an object sensed in a vehicles surroundings, see Ryan paragraph 0004.	As per claim 14, Banerjee discloses wherein: the neural network comprises a common trunk connected to a class confidence head and an instance regression head; the class confidence head includes a classification channel for each supported class; and the instance regression head includes a regression channel for each of the one or more regressed values, each regression channel regressing a particular type of location, geometry, or orientation corresponding to a unique instance of the one or more unique instances in the scene (see at least paragraph 0131; wherein at each sliding window location, the network predict multiple region proposals which outputs a score and a bounding box per anchor. RPN is a convolution layer whose output layer is connected to a classification layer which classifies the object and to a regression layer which predict the coordinates of the bounding box. At each sliding window location, K regions are proposed, classification layer having 2k outputs (objects or not objects) and regression layer having 4k outputs (coordinates of the bounding box). The RPN is also translational invariant. An anchor with Intersection over Union (IoU) greater than 0.7 with any ground truth bounding box can be given an object label or positive label. All other anchors are given not an object label or negative label and anchors with IoU less than 0.3 with ground truth bounding box are given a negative label).  	As per claim 15, Banerjee discloses wherein: the neural network comprises a common trunk connected to a class confidence head and a second head;Page 100 NVIDIA Matter No.: 19-SC-0222US02the class confidence head is configured to predict the class segmentation (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); and the method further comprises co-training the class confidence head and the second head together (see at least paragraph 0097; wherein machine learning techniques with neural networks where labelled data can be used for training and evaluating the neural network).  	As per claim 16, Banerjee discloses wherein the neural network comprises a depth head configured to predict a third output representing a distance from each pixel to a corresponding object represented by the pixel (see at least paragraph 0009; wherein JET encoding which is a coloring scheme which converts the distance value J at each pixel i into three channels, for example, each with 8 bit values. This can be achieved by using linear interpolation. In other examples, encoding the projected depth data comprises encoding respective depth values into three channels comprising horizontal disparity, height above ground, and angle to gravity to generate the encoded projected depth data. These examples relate to so-called HHA encoding which converts the distance value J at each pixel i into three channels horizontal disparity, height above ground, and angle to gravity (HHA)).  	As per claim 17, Banerjee discloses a method comprising: 	applying, to a neural network, data representing one or more images of an environment from a perspective of an image sensor of an ego-actor (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); 	generating, using the neural network and based at least in part on the data: 	a first output representing one or more first classifications of one or more detected objects in the scene into one or more supported classes (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); and 	a second output representing one or more second classifications of the one or more detected objects in the scene into one or more unique instances of a supported class of the one or more supported classes (see at least paragraph 0009; wherein encoding the projected depth data comprises encoding respective depth values into three-channel color information to generate the encoded projected depth data. Such examples relate to so-called JET encoding which is a coloring scheme which converts the distance value J at each pixel i into three channels, for example, each with 8 bit values); 	generating, based at least in part on the second output, at least one bounding shape corresponding to the one or more unique instances (see at least paragraph 0007; wherein fused hybrid data are fed into a convolutional neural network to leam features and then fed through fully connected layers to detect and classify objects (class score) and predict respective bounding boxes for the objects). Banerjee do not explicitly mention performing one or more operations by the ego-actor based at least in part on the at least one bounding shape.  	However Ryan does disclose:	performing one or more operations by the ego-actor based at least in part on the at least one bounding shape (see at least paragraph 0008; wherein generates, by the processor, a bounding box around the element; projects, by the processor, segments of the element onto the bounding box to obtain a depth image; and classifies the object by providing the depth image to a machine learning model and receiving a classification output that classifies the element as an object for assisting in control of the autonomous vehicle).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Ryan with the teachings as in Banerjee. The motivation for doing so would have been to improve the AV system by increasing the accuracy of classifying an object sensed in a vehicles surroundings, see Ryan paragraph 0004.	As per claim 18, Banerjee discloses wherein the neural network comprises an instance clustering head with a channel for each of a plurality of unique instances to distinguish, and each channel is configured to predict a confidence map representing pixels that belong to a corresponding instance of the plurality of unique instances (see at least paragraphs 0113; wherein the HHA features can be extracted from the up-sampled depth map. HHA encoding converts the distance value J at each pixel i into 3 channels horizontal disparity, height above ground, and angle to gravity (HHA) as hi, h.sub.2, and a. HHA encodes the properties like geocentric pose which will be harder for the neural networks to learn from the limited depth data).  	As per claim 20, Banerjee discloses wherein: the neural network comprises a common trunk connected to a class confidence head and a second head; the class confidence head computes the first output representing the one or more first classifications (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); and the method further comprises co-training the class confidence head and the second head together (see at least paragraph 0097; wherein machine learning techniques with neural networks where labelled data can be used for training and evaluating the neural network).  	As per claim 21, Banerjee discloses a method of operating a vehicle, the vehicle including a sensor for providing sensor data depicting static elements and moving objects in an environment in which the vehicle is operated, the vehicle further including computing hardware and/or software implementing a neural network (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data), the method comprising the neural network performing at least these steps: Page 102Non-provisional ApplicationSHB Matter No.: 41651.343679 NVIDIA Matter No.: 19-SC-0222US02 	receiving the sensor data from the sensor (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); 	classifying the static elements according to the sensor data (see at least paragraph 0054; wherein the image data and the encoded projected depth data (hybrid data) can be fed into one or more convolutional neural networks configured to detect and/or classify objects in the scene based on the image data and the encoded projected depth data); 	detecting the moving objects according to the sensor data (see at least abstract; wherein object detection in a scene is based on lidar data and radar data of the scene). maneuvering the vehicle in response to the classified static elements and the detected moving objects. Banerjee do not explicitly mention maneuvering the vehicle in response to the classified static elements and the detected moving objects.  	However Ryan does disclose:	maneuvering the vehicle in response to the classified static elements and the detected moving objects (see at least paragraph 0008; wherein generates, by the processor, a bounding box around the element; projects, by the processor, segments of the element onto the bounding box to obtain a depth image; and classifies the object by providing the depth image to a machine learning model and receiving a classification output that classifies the element as an object for assisting in control of the autonomous vehicle).   	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Ryan with the teachings as in Banerjee. The motivation for doing so would have been to improve the AV system by increasing the accuracy of classifying an object sensed in a vehicles surroundings, see Ryan paragraph 0004.	As per claim 22, Banerjee discloses further comprising the steps of: operating the neural network as a DNN (see at least paragraph 0153; wherein deep neural networks).  
Claim 2 is rejected under 35 U.S.C. 103(a) as being unpatentable over Banerjee et al. (USPGPub 2020/0301013), in view of Ryan (USPGPub 2019/0026588), and further in view of Douillard et al. (USPGPub 2020/0193606).	As per claim 2, Banerjee and Ryan do not explicitly mention wherein the first output representative of the one or more classifications includes a plurality of confidence maps corresponding to a plurality of supported classes of the neural network, the plurality of supported classes including a navigable space and animate objects, wherein the method further comprises generating a segmentation mask demarcating at least the navigable space and the animate objects.	However Douillard does disclose:	wherein the first output representative of the one or more classifications includes a plurality of confidence maps corresponding to a plurality of supported classes of the neural network, the plurality of supported classes including a navigable space and animate objects, wherein the method further comprises generating a segmentation mask demarcating at least the navigable space and the animate objects (see at least paragraph 0058; wherein the classification module 316 may classify one or more objects, including but not limited to cars, buildings, pedestrians, bicycles, trees, free space, occupied space, street signs, lane markings, etc).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Douillard with the teachings as in Banerjee and Ryan. The motivation for doing so would have been to improve a functioning of a computing device by converting data into one or more formats that improves performance of segmentation and/or classification of objects represented in the data, see Douillard paragraph 0031.

Claims 10, 19, and 23-27 are rejected under 35 U.S.C. 103(a) as being unpatentable over Banerjee et al. (USPGPub 2020/0301013), in view of Ryan (USPGPub 2019/0026588), and further in view of McCormac et al. (USPGPub 2021/0166426).	As per claim 10, Banerjee discloses wherein: the neural network comprises an instance clustering head computes the second output (see at least paragraphs 0113; wherein the HHA features can be extracted from the up-sampled depth map. HHA encoding converts the distance value J at each pixel i into 3 channels horizontal disparity, height above ground, and angle to gravity (HHA) as hi, h.sub.2, and a. HHA encodes the properties like geocentric pose which will be harder for the neural networks to learn from the limited depth data); 	the second output corresponds to one or more confidence maps representative of pixels of the plurality of pixels that belong to locally unique instances (see at least paragraphs 0113; wherein the HHA features can be extracted from the up-sampled depth map. HHA encoding converts the distance value J at each pixel i into 3 channels horizontal disparity, height above ground, and angle to gravity (HHA) as hi, h.sub.2, and a. HHA encodes the properties like geocentric pose which will be harder for the neural networks to learn from the limited depth data). Banerjee and Ryan do not explicitly mention the method further comprises: performing a connected-component analysis of the one or more confidence maps to detect a set of the locally unique instances for each of a plurality of clusters of instances; and determining globally unique instances from the set of locally unique instances for each of the plurality of clusters of instances.	However McCormac does disclose:	the method further comprises: performing a connected-component analysis of the one or more confidence maps to detect a set of the locally unique instances for each of a plurality of clusters of instances (see at least paragraph 0103; wherein the object recognition pipeline may be trained on labelled image data. In a second operation 720, the mask output of the object recognition pipeline is fused with depth data associated with the frames of video data to generate a map of object instances. The map of object instances may comprise a set of 3D object volumes for respective objects detected within the environment. These 3D object volumes may comprise volume elements (e.g. voxels) that have associated surface-distance metric values, such as TSDF values. An object pose estimate may be defined for each object instance that indicates how the 3D object volume may be mapped to a model space for the environment, e.g. from a local coordinate system for the object (an “object frame”) to a global coordinate system for the environment (a “world frame”)); and 	determining globally unique instances from the set of locally unique instances for each of the plurality of clusters of instances (see at least paragraph 0103; wherein the object recognition pipeline may be trained on labelled image data. In a second operation 720, the mask output of the object recognition pipeline is fused with depth data associated with the frames of video data to generate a map of object instances. The map of object instances may comprise a set of 3D object volumes for respective objects detected within the environment. These 3D object volumes may comprise volume elements (e.g. voxels) that have associated surface-distance metric values, such as TSDF values. An object pose estimate may be defined for each object instance that indicates how the 3D object volume may be mapped to a model space for the environment, e.g. from a local coordinate system for the object (an “object frame”) to a global coordinate system for the environment (a “world frame”)).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in McCormac with the teachings as in Banerjee and Ryan. The motivation for doing so would have been to generate a map of object instances that may be used by a robotic device to navigate and/or interact with its environment, see McCormac paragraph 0002.	As per claim 19, Banerjee discloses wherein: the neural network comprises an instance clustering head computes the second output (see at least paragraphs 0113; wherein the HHA features can be extracted from the up-sampled depth map. HHA encoding converts the distance value J at each pixel i into 3 channels horizontal disparity, height above ground, and angle to gravity (HHA) as hi, h.sub.2, and a. HHA encodes the properties like geocentric pose which will be harder for the neural networks to learn from the limited depth data); 	the second output corresponds to one or more confidence maps corresponding to the one or more second classifications and representing pixels that belong to locally unique instances (see at least paragraphs 0113; wherein the HHA features can be extracted from the up-sampled depth map. HHA encoding converts the distance value J at each pixel i into 3 channels horizontal disparity, height above ground, and angle to gravity (HHA) as hi, h.sub.2, and a. HHA encodes the properties like geocentric pose which will be harder for the neural networks to learn from the limited depth data). Banerjee and Ryan do not explicitly mention the method further comprises: performing a connected-component analysis of the one or more confidence maps to detect a set of the locally unique instances for each of a plurality of clusters of instances; and determining globally unique instances from the set of locally unique instances for each of the plurality of clusters of instances.	However McCormac does disclose:	the method further comprises: performing a connected-component analysis of the one or more confidence maps to detect a set of the locally unique instances for each of a plurality of clusters of instances (see at least paragraph 0103; wherein the object recognition pipeline may be trained on labelled image data. In a second operation 720, the mask output of the object recognition pipeline is fused with depth data associated with the frames of video data to generate a map of object instances. The map of object instances may comprise a set of 3D object volumes for respective objects detected within the environment. These 3D object volumes may comprise volume elements (e.g. voxels) that have associated surface-distance metric values, such as TSDF values. An object pose estimate may be defined for each object instance that indicates how the 3D object volume may be mapped to a model space for the environment, e.g. from a local coordinate system for the object (an “object frame”) to a global coordinate system for the environment (a “world frame”)); and 		determining globally unique instances from the set of locally unique instances for each of the plurality of clusters of instances (see at least paragraph 0103; wherein the object recognition pipeline may be trained on labelled image data. In a second operation 720, the mask output of the object recognition pipeline is fused with depth data associated with the frames of video data to generate a map of object instances. The map of object instances may comprise a set of 3D object volumes for respective objects detected within the environment. These 3D object volumes may comprise volume elements (e.g. voxels) that have associated surface-distance metric values, such as TSDF values. An object pose estimate may be defined for each object instance that indicates how the 3D object volume may be mapped to a model space for the environment, e.g. from a local coordinate system for the object (an “object frame”) to a global coordinate system for the environment (a “world frame”)). 	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in McCormac with the teachings as in Banerjee and Ryan. The motivation for doing so would have been to generate a map of object instances that may be used by a robotic device to navigate and/or interact with its environment, see McCormac paragraph 0002.	As per claim 23, Banerjee and Ryan do not explicitly mention wherein the neural network includes a plurality of output heads, the method further comprising: outputting from a first output head a class confidence map representing, for each pixel in the image data, a predicted confidence that the pixel belongs to a particular class.	However McCormac does disclose:	wherein the neural network includes a plurality of output heads, the method further comprising: outputting from a first output head a class confidence map representing, for each pixel in the image data, a predicted confidence that the pixel belongs to a particular class (see at least paragraph 0077; wherein this may be used to determine a class of a detected object. In some cases, a probability or confidence of an object being associated with a particular semantic class is compared against a threshold (such as a 50% confidence level) before accepting that an object has indeed been detected. A bounding box for the detected object may also be output (e.g. a definition of a 2D rectangle in image space), indicating an area that contains the detected object. In such cases, the mask output may be calculated within the bounding box).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in McCormac with the teachings as in Banerjee and Ryan. The motivation for doing so would have been to generate a map of object instances that may be used by a robotic device to navigate and/or interact with its environment, see McCormac paragraph 0002.	As per claim 24, Banerjee discloses further comprising: outputting from a second output head an instance regression prediction representative of each separate instance of an identified object (see at least paragraph 0131; wherein at each sliding window location, the network predict multiple region proposals which outputs a score and a bounding box per anchor. RPN is a convolution layer whose output layer is connected to a classification layer which classifies the object and to a regression layer which predict the coordinates of the bounding box. At each sliding window location, K regions are proposed, classification layer having 2k outputs (objects or not objects) and regression layer having 4k outputs (coordinates of the bounding box). The RPN is also translational invariant. An anchor with Intersection over Union (IoU) greater than 0.7 with any ground truth bounding box can be given an object label or positive label. All other anchors are given not an object label or negative label and anchors with IoU less than 0.3 with ground truth bounding box are given a negative label).  	As per claim 25, Banerjee discloses further comprising: performing the instance regression prediction relative to vertices of detected instances (see at least paragraph 0131; wherein at each sliding window location, the network predict multiple region proposals which outputs a score and a bounding box per anchor. RPN is a convolution layer whose output layer is connected to a classification layer which classifies the object and to a regression layer which predict the coordinates of the bounding box. At each sliding window location, K regions are proposed, classification layer having 2k outputs (objects or not objects) and regression layer having 4k outputs (coordinates of the bounding box). The RPN is also translational invariant. An anchor with Intersection over Union (IoU) greater than 0.7 with any ground truth bounding box can be given an object label or positive label. All other anchors are given not an object label or negative label and anchors with IoU less than 0.3 with ground truth bounding box are given a negative label).  	As per claim 26, Banerjee discloses further comprising: outputting from a third output head an object confidence map representing, for each pixel in the image data, a predicted confidence that the pixel belongs to a given instance of an identified object (see at least paragraph 0131; wherein at each sliding window location, the network predict multiple region proposals which outputs a score and a bounding box per anchor. RPN is a convolution layer whose output layer is connected to a classification layer which classifies the object and to a regression layer which predict the coordinates of the bounding box. At each sliding window location, K regions are proposed, classification layer having 2k outputs (objects or not objects) and regression layer having 4k outputs (coordinates of the bounding box). The RPN is also translational invariant. An anchor with Intersection over Union (IoU) greater than 0.7 with any ground truth bounding box can be given an object label or positive label. All other anchors are given not an object label or negative label and anchors with IoU less than 0.3 with ground truth bounding box are given a negative label).  	As per claim 27, Banerjee discloses further comprising: post-processing the outputted class confidence map and instance regression prediction to identify bounding shapes and class labels for the environment (see at least paragraph 0097; wherein use the fused RGBD information from the camera and LiDAR sensors (D being the depth information for each pixel in the camera image) for object detection. One approach for object detection is to use machine learning techniques with neural networks where labelled data can be used for training and evaluating the neural network. Labelled data in the context of object recognition is manually labelling the bounding boxes for each object of interest in an image and assigning a class label for each object of interest. Data labelling manually is very expensive and time consuming process as each image needs to be augmented with bounding box and class label information).

Claims 11-12 are rejected under 35 U.S.C. 103(a) as being unpatentable over Banerjee et al. (USPGPub 2020/0301013), in view of Ryan (USPGPub 2019/0026588), and further in view of Shen et al. (USPGPub 2020/0175326).	As per claim 11, Banerjee discloses wherein:Page 98 NVIDIA Matter No.: 19-SC-0222US02the first output of the neural network corresponds to a first tensor storing one or more confidence maps representing the one or more classifications corresponding to the one or more detected objects (see at least paragraphs 0113; wherein the HHA features can be extracted from the up-sampled depth map. HHA encoding converts the distance value J at each pixel i into 3 channels horizontal disparity, height above ground, and angle to gravity (HHA) as hi, h.sub.2, and a. HHA encodes the properties like geocentric pose which will be harder for the neural networks to learn from the limited depth data); 	wherein the second output of the neural network corresponds to a second tensor with one or more channels, each of the one or more channels regressing a location, geometry, or orientation corresponding to a bounding shape of the one or more bounding shapes (see at least paragraphs 0113; wherein the HHA features can be extracted from the up-sampled depth map. HHA encoding converts the distance value J at each pixel i into 3 channels horizontal disparity, height above ground, and angle to gravity (HHA) as hi, h.sub.2, and a. HHA encodes the properties like geocentric pose which will be harder for the neural networks to learn from the limited depth data). Banerjee and Ryan do not explicitly mention the generating the one or more bounding shapes is based at least in part on the first tensor and the second tensor.	However Shen does disclose:	the generating the one or more bounding shapes is based at least in part on the first tensor and the second tensor (see at least paragraph 0039; wherein the output may include a set of labeled boxes (windows) surrounding objects detected in each image (e.g., optionally labeled with the object class, object pose, or other object parameter)). 	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Shen with the teachings as in Banerjee and Ryan. The motivation for doing so would have been to enhance objection detection from a vehicle, see Shen paragraph 0002.	As per claim 12, Banerjee and Ryan do not explicitly mention wherein: the first output of the neural network corresponds to a first tensor storing one or more confidence maps representing the one or more classifications corresponding to the one or more detected objects; the second output of the neural network corresponds to a second tensor with one or more channels, each of the one or more channels regressing at least one of: a location, geometry, or orientation corresponding to a bounding shape of the one or more bounding shapes; and the generating the one or more bounding shapes includes: generating candidate bounding shapes based at least in part on the first tensor and the second tensor; and removing duplicate candidates from the candidate bounding shapes by performing at least one of filtering or clustering of the candidate bounding shapes.	However Shen does disclose:	wherein: the first output of the neural network corresponds to a first tensor storing one or more confidence maps representing the one or more classifications corresponding to the one or more detected objects (see at least paragraph 0039; wherein the output may include a set of labeled boxes (windows) surrounding objects detected in each image (e.g., optionally labeled with the object class, object pose, or other object parameter)); 	the second output of the neural network corresponds to a second tensor with one or more channels, each of the one or more channels regressing at least one of: a location, geometry, or orientation corresponding to a bounding shape of the one or more bounding shapes (see at least paragraph 0039; wherein the output may include a set of labeled boxes (windows) surrounding objects detected in each image (e.g., optionally labeled with the object class, object pose, or other object parameter)); and the generating the one or more bounding shapes includes: 	generating candidate bounding shapes based at least in part on the first tensor and the second tensor (see at least paragraph 0039; wherein the output may include a set of labeled boxes (windows) surrounding objects detected in each image (e.g., optionally labeled with the object class, object pose, or other object parameter)); and 	removing duplicate candidates from the candidate bounding shapes by performing at least one of filtering or clustering of the candidate bounding shapes (see at least paragraph 0045; wherein since the first bounding box may, as an example, more closely adhere to a contour of the vehicle, the second bounding box may be removed as a duplicate).  	Therefore it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Shen with the teachings as in Banerjee and Ryan. The motivation for doing so would have been to enhance objection detection from a vehicle, see Shen paragraph 0002.

Relevant Art
The prior art made of record and not relied upon are considered pertinent to applicant’s disclosure:	USPGPub 2021/0082181 – Provides an object detection technology, and particularly to a method for object detection, an intelligent driving method, an apparatus for object detection, an electronic device and a computer storage medium..	USPGPub 2020/0013219 – Provides a method that enables removing unfilled space across segmentation masks in a convenient manner, and cleaning high confidence image mask and low confidence image mask.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAHMOUD S ISMAIL whose telephone number is (571)272-1326. The examiner can normally be reached M - F: 9:00AM- 5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jelani Smith can be reached on 571-270-3969. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MAHMOUD S ISMAIL/Primary Examiner, Art Unit 3662