DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 4/20/2020, 6/30/2021 and 8/13/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-15 and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over US Patent Application Publication No. 20200082219 (Li et al.) in view of Patent Application Publication No. 20200050887 (Guatam et al).
	Regarding claim 1, Li et al. discloses: “a method for image processing ([0072]: “FIG. 5 illustrates a method 500 for panoptic segmentation”), comprising: identifying a target frame (FIG. 3: 306) and a reference frame (FIG. 3: 304; [0064]: “The multi-stream network 310 receives a data stream from the first sensor 306 and/or the second sensor 304”) from a video ([0064]: “The data stream may include multiple frames, such as image frames”); and generating panoptic segmentation information for the target frame based on the feature matrix” (ABSTRACT: “generating an instance map and a semantic map from the input. The method further includes generating the panoptic map from the instance map and the semantic map based on a binary mask”; [0025]: “In conventional panoptic segmentation networks, various approaches are used to obtain and combine (e.g., fuse) information from the instance map with information from the segmentation map”).  
	However, Li et al. does not clearly disclose the remaining limitations of claim 1.  To that end, Guatam et al. discloses: “generating target features (FIG. 6B: 604; [0109]: “Feature extraction unit 604 may apply various filters, such as convolutional, residual, and pooling layers, as some examples, to image 602 to extract an initial set of feature maps”) for the target frame (FIG. 6B: 602; [0109]: “the feature extraction unit 604 takes input 602 as input”) and reference features (FIG. 6B: 624; [0115]: “feature extraction units 604 and 624”) for the reference frame (FIG. 6B: 622; [0114]: “input respective captured images 602 and 622, which may be from different perspectives”); combining the target features (FIG. 6B: 604) and the reference features (FIG. 6B: 624) to produce fused features (FIG. 6B: 616; [0115]: “fusion layer 616 after first feature extraction units 604 and 624”; [0116]: “fusion layers 616 and 618 may take a first feature map from a first perspective and combine the feature map from the first perspective with a summarized value of a second feature map from a second perspective. Further, fusion layers 616 and 618 may take the second feature map from the second perspective and combine the feature map from the second perspective with a summarized value of the feature map from the first perspective”) for the target frame” (FIG. 6B: 602; [0109]: “the feature extraction unit 604 takes input 602 as input”); generating a feature matrix (FIG. 6A: 606, 608; FIG. 6B: 606, 608; [0110]: “ROI pooling layer 608 may use a function such as a max pooling function to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent. After generating these small ROI feature maps, ROI pooling layer 608 may input the ROI feature maps into one or more fully-connected layers (not pictured) to generate an ROI vector”; FIG. 6A: 610; FIG. 6B:  610; [0111]: “Finally, the outputs of the fully-connected layers (e.g., the ROI vector, etc.) are inputted into second stage feature extraction layer 610, which generates two outputs: (1) class prediction scores, and (2) refined bounding boxes 614”) comprising a correspondence between objects ([0130]: “perform object detection on the image from the first perspective and the second image from the second perspective by cross-referencing data related to the first image and the second image between the first and second perspectives with the object detector”) from the reference features (FIG. 6B: 622, 624, 626, 628; [0114] :First and second two-stage object detectors 660 and 680 may each input respective captured images 602 and 622, which may be from different perspectives”) and objects from the fused features” (FIG. 6B: 602, 606, 616, 610, 626, 630; FIG. 9: 906). It is respectfully submitted that it would have been obvious to one of ordinary skill in the art at the time of the invention to combine Li et al. with the invention of Guatam et al. in order to generate feature information from input images/frames and to further fuse and cross-reference feature information to improve detection confidence and accuracy (e.g., see Guatam et al. @ [0095]).
	Regarding claim 13, Li et al. discloses: “training an artificial neural network (ANN) for video segmentation (FIG. 2: 208, 210, 216; [0052]: “The panoptic segmentation network may be an artificial neural network), comprising: identifying a target frame (FIG. 3: 306) and a reference frame (FIG. 3: 304; [0064]: “The multi-stream network 310 receives a data stream from the first sensor 306 and/or the second sensor 304”) for each of the plurality of video clips ([0064]: “The data stream may include multiple frames, such as image frames”); and generating predicted panoptic segmentation information for the target frame based on the feature matrix” (ABSTRACT: “generating an instance map and a semantic map from the input. The method further includes generating the panoptic map from the instance map and the semantic map based on a binary mask”; [0025]: “In conventional panoptic segmentation networks, various approaches are used to obtain and combine (e.g., fuse) information from the instance map with information from the segmentation map”); identifying a training set (FIG. 5: 500; [0072]: “a panoptic segmentation network is trained to generate a binary mask based on a training input labeled with object instances”) comprising a plurality of video clips and original panoptic segmentation information for each of the plurality of video clips (FIG. 5: 502; [0073]: “training, at block 502, the panoptic segmentation network receives an input from one or more sensors of a vehicle. For example, the input is an RGB image. The sensors may include a RGB camera, an RGB-D camera, LIDAR, RADAR, and the like”); and comparing the predicted panoptic segmentation information to the original panoptic segmentation information ( [0072]: “A binary mask generated by the fusion model may be compared against a ground truth mask based on labeled object instances”; FIG. 5: 510, 512; [0076]: “At block 510, the panoptic segmentation network generates the binary mask based on the input, the instance map, and the semantic map”); and updating the ANN based on the comparison” ([0072]: “The panoptic segmentation network may be an artificial neural network as discussed herein”).
	In addition, Guatam et al. discloses: “generating target features (FIG. 6B: 604; [0109]: “Feature extraction unit 604 may apply various filters, such as convolutional, residual, and pooling layers, as some examples, to image 602 to extract an initial set of feature maps”) for the target frame (FIG. 6B: 602; [0109]: “the feature extraction unit 604 takes input 602 as input”) and reference features (FIG. 6B: 624; [0115]: “feature extraction units 604 and 624”) for the reference frame (FIG. 6B: 622; [0114]: “input respective captured images 602 and 622, which may be from different perspectives”); combining the target features (FIG. 6B: 604) and the reference features (FIG. 6B: 624) to produce fused features (FIG. 6B: 616; [0115]: “fusion layer 616 after first feature extraction units 604 and 624”; [0116]: “fusion layers 616 and 618 may take a first feature map from a first perspective and combine the feature map from the first perspective with a summarized value of a second feature map from a second perspective. Further, fusion layers 616 and 618 may take the second feature map from the second perspective and combine the feature map from the second perspective with a summarized value of the feature map from the first perspective”) for the target frame” (FIG. 6B: 602; [0109]: “the feature extraction unit 604 takes input 602 as input”); generating a feature matrix (FIG. 6A: 606, 608; FIG. 6B: 606, 608; [0110]: “ROI pooling layer 608 may use a function such as a max pooling function to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent. After generating these small ROI feature maps, ROI pooling layer 608 may input the ROI feature maps into one or more fully-connected layers (not pictured) to generate an ROI vector”; FIG. 6A: 610; FIG. 6B:  610; [0111]: “Finally, the outputs of the fully-connected layers (e.g., the ROI vector, etc.) are inputted into second stage feature extraction layer 610, which generates two outputs: (1) class prediction scores, and (2) refined bounding boxes 614”) comprising a correspondence between objects ([0130]: “perform object detection on the image from the first perspective and the second image from the second perspective by cross-referencing data related to the first image and the second image between the first and second perspectives with the object detector”) from the reference features (FIG. 6B: 622, 624, 626, 628; [0114] :First and second two-stage object detectors 660 and 680 may each input respective captured images 602 and 622, which may be from different perspectives”) and objects from the fused features” (FIG. 6B: 602, 606, 616, 610, 626, 630; FIG. 9: 906).
	Regarding claim 17, Li et al. discloses: “an apparatus for image processing (FIG. 3: 300; [0055]: “FIG. 3 is a diagram illustrating an example of a hardware implementation for a panoptic segmentation system 300”), comprising: a segmentation component (FIG. 3: 308; [0064]: “As shown in FIG. 3, the panoptic segmentation network 308 may include a multi-stream network 310 and a fusion network 312”) configured to generate panoptic segmentation information for the target frame (FIG. 3: 306) based on the feature matrix (ABSTRACT: “generating an instance map and a semantic map from the input. The method further includes generating the panoptic map from the instance map and the semantic map based on a binary mask”; [0025]: “In conventional panoptic segmentation networks, various approaches are used to obtain and combine (e.g., fuse) information from the instance map with information from the segmentation map”); a semantic head (FIG. 3: 308) configured to classify each pixel of the target frame (FIG. 3: 306) based on the fused features (FIG. 3: 312); and the classification of each pixel of the target frame” ([0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	In addition, Guatam et al. discloses: “an encoder (FIG. 6B: 604, 624) configured to generate target features (FIG. 6B: 604; [0109]: “Feature extraction unit 604 may apply various filters, such as convolutional, residual, and pooling layers, as some examples, to image 602 to extract an initial set of feature maps”) for a target frame (FIG. 6B: 602; [0109]: “the feature extraction unit 604 takes input 602 as input”) and reference features (FIG. 6B: 624; [0115]: “feature extraction units 604 and 624”) for a reference frame of a video(FIG. 6B: 622; [0114]: “input respective captured images 602 and 622, which may be from different perspectives”):a fusion component (FIG. 6B: 616) configured to combine the target features (FIG. 6B: 604) and the reference features (FIG. 6B: 624) to produce fused features (FIG. 6B: 616; [0115]: “fusion layer 616 after first feature extraction units 604 and 624”; [0116]: “fusion layers 616 and 618 may take a first feature map from a first perspective and combine the feature map from the first perspective with a summarized value of a second feature map from a second perspective. Further, fusion layers 616 and 618 may take the second feature map from the second perspective and combine the feature map from the second perspective with a summarized value of the feature map from the first perspective”) for the target frame (FIG. 6B: 602; [0109]: “the feature extraction unit 604 takes input 602 as input”); a track head (FIG. 6A: 606, 608; FIG. 6B: 606, 608) configured to generate a feature matrix ([0110]: “ROI pooling layer 608 may use a function such as a max pooling function to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent. After generating these small ROI feature maps, ROI pooling layer 608 may input the ROI feature maps into one or more fully-connected layers (not pictured) to generate an ROI vector”; FIG. 6A: 610; FIG. 6B:  610; [0111]: “Finally, the outputs of the fully-connected layers (e.g., the ROI vector, etc.) are inputted into second stage feature extraction layer 610, which generates two outputs: (1) class prediction scores, and (2) refined bounding boxes 614”) comprising a correspondence between objects ([0130]: “perform object detection on the image from the first perspective and the second image from the second perspective by cross-referencing data related to the first image and the second image between the first and second perspectives with the object detector”) from the reference features (FIG. 6B: 622, 624, 626, 628; [0114] :First and second two-stage object detectors 660 and 680 may each input respective captured images 602 and 622, which may be from different perspectives”) and objects from the fused features (FIG. 6B: 602, 606, 616, 610, 626, 630; FIG. 9: 906).	
	With respect to claim 2, Gautam et al. discloses: “combining a plurality of input features to produce the target features (FIG. 6B: 604), wherein each of the plurality of input features (FIG. 6B: 604, 624) has a different resolution, and wherein the target features have a same resolution as the target frame” (FIG. 6B: 602; [0042]:  “Both one and two stage object detectors use convolutional neural networks to generate feature maps at various positions and spatial resolutions”).
	Regarding claim 3, Li et al. discloses: “aligning the reference features with the target features, wherein the fused features are combined based on the aligned reference features” (FIG. 3: 312; [0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	Regarding claim 5, Li et al. discloses: “identifying an object order for the objects from the fused features based on the feature matrix, wherein the panoptic segmentation information is based at least in part on the object order” (FIG. 3: 312; [0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	With respect to claims 6, Gautam et al. discloses: “identifying a bounding box (FIG. 6B: 614) for each of the objects from the fused features” (FIG. 6B: 618; [0111]” Finally, the outputs of the fully-connected layers (e.g., the ROI vector, etc.) are inputted into second stage feature extraction layer 610, which generates two outputs: (1) class prediction scores, and (2) refined bounding boxes 614”; Class prediction scores 612 comprises a set of values in which each value indicates a respective likelihood that a given refined bounding box contains a given class of object ).
	Regarding claim 7, Li et al. discloses: “classifying each of the objects from the fused features” ([0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	With respect to claim 8, Li et al. discloses: “generating a pixel mask for each of the objects from the fused features” ([0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	Regarding claim 9, Li et al. discloses: “classifying each pixel of the target frame (FIG. 3: 306) based on the fused features” (FIG. 3: 312; [0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	With respect to claim 10, Li et al. discloses: “the panoptic segmentation information comprises classification information and instance information for each pixel of the target frame” (FIG. 3: 312; [0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	Regarding claim 11, Li et al. discloses: “the panoptic segmentation information is generated based on an object order for the objects from the fused features, an object classification for each of the objects from the fused features, a pixel mask for each of the objects from the fused features, and a pixel classification for each pixel of the target frame” (FIG. 3: 312; [0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	With respect to claim 12, Li et al. discloses: “sampling a plurality of frames from the video ([0064]: “The data stream may include multiple frames, such as image frames”); and
generating the panoptic segmentation information for each of the plurality of frames” (ABSTRACT: “generating an instance map and a semantic map from the input. The method further includes generating the panoptic map from the instance map and the semantic map based on a binary mask”; [0025]: “In conventional panoptic segmentation networks, various approaches are used to obtain and combine (e.g., fuse) information from the instance map with information from the segmentation map”).
	With respect to claim 14, Li et al. discloses: “sampling a plurality of frames from each of the plurality of video clips ([0064]: “The data stream may include multiple frames, such as image frames”); and generating panoptic segmentation information for each of the plurality of frames” (ABSTRACT: “generating an instance map and a semantic map from the input. The method further includes generating the panoptic map from the instance map and the semantic map based on a binary mask”; [0025]: “In conventional panoptic segmentation networks, various approaches are used to obtain and combine (e.g., fuse) information from the instance map with information from the segmentation map”).
	Regarding claim 15, Li et al. discloses: “the predicted panoptic segmentation information is generated based on an object order for the objects from the fused features, an object classification for each of the objects from the fused features, a pixel mask for each of the objects from the fused features, and a pixel classification for each pixel of the target frame” (FIG. 3: 312; [0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”)..
	With respect to claim 18, Gautami et al. discloses: “a bounding box head (FIG. 6B: 614) configured to identify a bounding box (FIG. 6B: 614) for each of the objects from the fused features” (FIG. 6B: 618; [0111]” Finally, the outputs of the fully-connected layers (e.g., the ROI vector, etc.) are inputted into second stage feature extraction layer 610, which generates two outputs: (1) class prediction scores, and (2) refined bounding boxes 614”; Class prediction scores 612 comprises a set of values in which each value indicates a respective likelihood that a given refined bounding box contains a given class of object ).
	Regarding claim 19, Li et al. discloses: “a mask head configured to generate a pixel mask for each of the objects from the fused features” ([0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	With respect to claim 20, Li et al. discloses: “classify each object from the fused features” ([0066]: “The fusion network 312 extracts information (e.g., features) from the output of each model of the multi-stream network 310. The fusion network 312 also extracts features from the data stream. Based on the extracted features, the fusion network 312 generates a mask (e.g., binary mask). The mask 216 is used to determine whether a pixel is associated with a unique identifiable instance of an object. That is, the mask defines how to merge a segmentation map and an instance map with a single function to generate a panoptic map”).
	In addition, Guatam et al. discloses: “the track head (FIG. 6A: 606, 608; FIG. 6B: 606, 608). 

Claims 4 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. in view of Guatam et al. and US Patent Application Publication No. 20090244093 (Chen et al).
	Claims 4 and 16 are dependent upon claims 1 and 13, respectively.  As discussed above, claims 1 and 13 are disclosed by the combination of Li et al. in view of Guatam et al. Thus, those limitations of claims 4 and 16 that are recited in claims 1 and 13 are also disclosed by the combination of Li et al. and Guatam et al. 
	However, the combination of Li et al. and Guatam et al. does not clearly disclose the remaining limitations of the claims.  To that end, with respect to claim 4, Chen et al. discloses: “applying a spatial-temporal attention module to the target features and the reference features” (FIG. 1: 15a; [0042]: “a spatial-temporal processing module 15a, which does a spatial-temporal processing to guarantee the video is smooth and acceptable and eliminates the artefacts”).  It is respectfully submitted that it would have been obvious to one od ordinary skill in the art at the time of the invention to further modify the combination of Li et al. and Guatam et al. with the invention of Chen et al. in order to eliminate artefacts in the features (e.g., see Chen et al. @ [0042]).
	With respect to claim 16, Chen et al. discloses: “applying a spatial-temporal attention module to the target features and the reference features” (FIG. 1: 15a; [0042]: “a spatial-temporal processing module 15a, which does a spatial-temporal processing to guarantee the video is smooth and acceptable and eliminates the artefacts”).  

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MYRON K WYCHE whose telephone number is (571)272-3390. The examiner can normally be reached 7:30 am - 3:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kathy Wang-Hurst can be reached on 571-270-5371. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Myron Wyche/                        8/13/2022
Primary Examiner                   AU2644