MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving

Guangfeng Jiang1, Jun Liu1*, Yuzhi Wu1 Wenlong Liao2 Tao He3 Pai Peng3*
1Department of Electronic Engineering and Information Science, University of Science and Technology of China, 2Shanghai Jiao Tong University, 3COWAROBOT
AAAI-2024
*Corresponding author
MY ALT TEXT
The framework mainly consists of three parts: image pseudo label generation branch, point cloud pseudo label generation branch, and CSCS module. In the 2D branch, the pseudo 2D masks generated by the IPG module self-supervise the 2D masks. In the 3D branch, the label is refined by the SPG module and PVC module to supervise the 3D Masks.Finally, the pseudo masks from the teacher model are used for cross-supervision in the CSCS module.

Abstract

Instance segmentation is a fundamental research in computer vision, especially in autonomous driving. However, manual mask annotation for instance segmentation is quite time-consuming and costly. To address this problem, some prior works attempt to apply weakly supervised manner by exploring 2D or 3D boxes. However, no one has ever successfully segmented 2D and 3D instances simultaneously by only using 2D box annotations, which could further reduce the annotation cost by an order of magnitude.

Thus, we propose a novel framework called Multimodal Weakly Supervised Instance Segmentation (MWSIS), which incorporates various fine-grained label correction modules for both 2D and 3D modalities, along with a new multimodal cross-supervision approach. In the 2D pseudo label generation branch, the Instance-based Pseudo Mask Generation (IPG) module utilizes predictions for self-supervised correction. Similarly, in the 3D pseudo label generation branch, the Spatial-based Pseudo Label Generation (SPG) module generates pseudo labels by incorporating the spatial prior information of the point cloud. To further refine the generated pseudo labels, the Point-based Voting Label Correction (PVC) module utilizes historical predictions for correction. Additionally, a Ring Segment-based Label Correction (RSC) module is proposed to refine the predictions by leveraging the depth prior information from the point cloud. Finally, the Consistency Sparse Cross-modal Supervision (CSCS) module reduces the inconsistency of multimodal predictions by response distillation.

Particularly, transferring the 3D backbone to downstream tasks not only improves the performance of the 3D detectors, but also outperforms fully supervised instance segmentation with only 5% fully supervised annotations.

On the Waymo dataset, the proposed framework demonstrates significant improvements over the baseline, especially achieving 2.59% mAP and 12.75% mAP increases for 2D and 3D instance segmentation tasks, respectively.

Contributions

1. To the best of our knowledge, we are the first to use the 2D box annotations as the sole external supervision signal to train both image and point cloud instance segmentors simultaneously.

2. We propose various fine-grained label correction modules for different modalities, including instance-based, spatial-based, point-based, and ring segment-based modules. These modules not only enhance the instance segmentation performance, but also improve the quality of the pseudo label.

3. We propose a novel cross-modal supervision method, named CSCS, which exploits the complementary properties of the point cloud and image modalities. This method improves the performance of the segmentors.

4. Our framework can be used as a pre-training method to improve the performance of 3D downstream tasks such as semantic segmentation, instance segmentation, and object detection.

Pseudo Label Quality

Left: Visualizing 3D pseudo labels. Right: Comparisons of IoU obtained with different methods on Waymo validation dataset. SAM means the process of obtaining masks through the use of SAM, where 2D boxes are employed as prompts.

GT Image
Volume-rendered normals

Main Results

2D Instance Segmentation.

GT Image
Volume-rendered normals

3D Instance Segmentation.

GT Image

Ring Segment

To leverage the prior information about the depth variation of the point cloud, we propose the Depth Clustering Segment (DCS) algorithm to segment the point cloud.

As shown in the following images, different colors represent different ring segments.

BibTeX

@misc{jiang2023mwsis,
        title={MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving}, 
        author={Guangfeng Jiang and Jun Liu and Yuzhi Wu and Wenlong Liao and Tao He and Pai Peng},
        year={2023},
        eprint={2312.06988},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }