Weakly-Supervised Monocular 3D Object Detection Empowered by Visual Foundation Models in Autonomous Driving

Abstract

Currently, many methods for the 3d bounding box prediction task in autonomous driving require large amounts of annotated 3D data, which is both time-consuming and expensive to obtain. The WeakM3D uses 2d bounding boxes and Region of Interest (RoI) LiDAR points as weakly-supervised learning label to alleviate this problem. Based on this, I leveraged several Visual Foundation Models to empower the label generation with less human labor. Specifically, I introduced the Segment-Anything Model (SAM) where object segmentation can be performed on arbitrary input images, to achieve "zero-shot" label generation, which released the need for additional pre-training on specific categories to generate labels. Besides, I introduced a text-prompting schema containing Contrastive Language–Image Pretraining (CLIP) to realize “open-vocabulary” label generation, allowing users to use text to manipulate label generation with ideal classes without pre-training. Finally, I introduced Distillation with No Labels (DINO) and integrated it with SAM and text-prompting together to produce high-quality labels without pre-training and any hand-annotated labels. All three label generation methods mentioned above can generate high-quality labels and trained by these labels, the detector showed performance improvement on the KITTI dataset.

Result

Method	ap 40 bev 0.5			ap 40 3d 0.5
	Easy	Mod	Hard	Easy	Mod	Hard
WeakM3D (Official)	60.72	40.32	31.34	53.28	33.3	25.76
WeakM3D (Reproduction)	54.51	35.81	28.29	46.76	29.7	22.95
PointSAM	52.45	36.19	28.98	46.44	30.48	23.96
YOLO-World	58.57	38.2	30.51	52.02	32.34	25.24
LangSAM

Table 1. The performance of the detector evaluated in AP40 with IoU=0.5

Method	ap 11 bev 0.7			ap 11 3d 0.7
	Easy	Mod	Hard	Easy	Mod	Hard
WeakM3D (Official)	26.92	18.57	15.86	18.27	12.95	11.5
WeakM3D (Reproduction)	15.69	11.8	10.24	10.2	7.89	7.34
PointSAM	22.55	15.93	14.78	17.08	12.98	11.58
YOLO-World	22.73	16.01	14.45	15.86	11.48	11.06
LangSAM

Table 2. The performance of the detector evaluated in AP11 with IoU=0.7

Data Preparation

I used KITTI as dataset, all the preparation steps please refer to the Dataset Preparation section of WeakM3D

Implementation details

python=3.8

Pytorch= 2.2

CUDA=11.8

Acknowledgement

The code benefits from:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
AP_calculation		AP_calculation
LangSAM		LangSAM
PointSAM		PointSAM
README_Resource		README_Resource
YOLO_World		YOLO_World
config		config
data/kitti/data_file		data/kitti/data_file
dataloader		dataloader
eval_visual_tools		eval_visual_tools
groundingdino		groundingdino
lib		lib
scripts		scripts
utils		utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Weakly-Supervised Monocular 3D Object Detection Empowered by Visual Foundation Models in Autonomous Driving

Abstract

Result

Data Preparation

Implementation details

Acknowledgement

About

Releases

Packages

Languages

AustinLCP/Weak-VFMs

Folders and files

Latest commit

History

Repository files navigation

Weakly-Supervised Monocular 3D Object Detection Empowered by Visual Foundation Models in Autonomous Driving

Abstract

Result

Data Preparation

Implementation details

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages