Skip to content

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

License

Notifications You must be signed in to change notification settings

drive-bench/toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 

Repository files navigation

English | 简体中文

Are VLMs Ready for Autonomous Driving?
An Empirical Study from the Reliability, Data, and Metric Perspectives

Shaoyuan Xie1     Lingdong Kong2,3     Yuhao Dong2,4     Chonghao Sima2,6
Wenwei Zhang2     Qi Alfred Chen1     Ziwei Liu4     Liang Pan2

1University of California, Irvine     2Shanghai AI Laboratory     3National University of Singapore     4S-Lab, Nanyang Technological University     5The University of Hong Kong

       

About

drivebench
We introduce 🚙 DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs.
Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving.

📝 Updates

Table of Content

📊 Benchmark Comparison

Benchmark Perception Prediction Behavior Planning Robustness Frames QA Logic Evaluation Metrics
(Test) (Test)
BDD-X - - None Language
BDD-OIA - - None F1 Score
nuScenes-QA 36,114 83,337 None Acc
Talk2Car ~1.8k 2,447 None -
nuPrompt ~36k ~6k None AMOTA
DRAMA - ~14k Chain Language
Rank2Tel - - Chain Accuracy, Language
DirveMLLM 880 - None Acc
DriveVLM - - None GPTctx
DriveLM 4,794 15,480 Graph Language, GPT
DriveBench (Ours) 19,200 20,498 Graph Acc, Language, GPT, GPTctx

⚙️ Installation

For details related to installation and environment setups, kindly refer to INSTALL.md.

♨️ Data Preparation

Kindly refer to DATA_PREPAER.md for the details to prepare the datasets.

🚀 Getting Started

To learn more usage about this codebase, kindly refer to GET_STARTED.md.

🚡 Benchmark Results

Benchmark Configuration

 Commercial VLMs
 Open-Source VLMs
 Specialist VLMs

Benchmark Study

Model Size Type Perception (Clean) Perception (Corr.) Perception (T.O.) Prediction (Clean) Prediction (Corr.) Prediction (T.O.) Planning (Clean) Planning (Corr.) Planning (T.O.) Behavior (Clean) Behavior (Corr.) Behavior (T.O.)
Human - - 47.67 38.32 - - - - - - - 69.51 54.09 -
GPT-4o - Commercial 35.37 35.25 36.48 51.30 49.94 49.05 75.75 75.36 73.21 45.40 44.33 50.03
LLaVA-1.5 7B Open 23.22 22.95 22.31 22.02 17.54 14.64 29.15 31.51 32.45 13.60 13.62 14.91
LLaVA-1.5 13B Open 23.35 23.37 22.37 36.98 37.78 23.98 34.26 34.99 38.85 32.99 32.43 32.79
LLaVA-NeXT 7B Open 24.15 19.62 13.86 35.07 35.89 28.36 45.27 44.36 27.58 48.16 39.44 11.92
InternVL2 8B Open 32.36 32.68 33.60 45.52 37.93 48.89 53.27 55.25 34.56 54.58 40.78 20.14
Phi-3 4.2B Open 22.88 23.93 28.26 40.11 37.27 22.61 60.03 61.31 46.88 45.20 44.57 28.22
Phi-3.5 4.2B Open 27.52 27.51 28.26 45.13 38.21 4.92 31.91 28.36 46.30 37.89 49.13 39.16
Oryx 7B Open 17.02 15.97 18.47 48.13 46.63 12.77 53.57 55.76 48.26 33.92 33.81 23.94
Qwen2-VL 7B Open 28.99 27.85 35.16 37.89 39.55 37.77 57.04 54.78 41.66 49.07 47.68 54.48
Qwen2-VL 72B Open 30.13 26.92 17.70 49.35 43.49 5.57 61.30 63.07 53.35 51.26 49.78 39.46
DriveLM 7B Specialist 16.85 16.00 8.75 44.33 39.71 4.70 68.71 67.60 65.24 42.78 40.37 27.83
Dolphins 7B Specialist 9.59 10.84 11.01 32.66 29.88 39.98 52.91 53.77 60.98 8.81 8.25 11.92

Robustness Analysis

Model Size Type
Weather

External

Sensor

Motion

Transmission
MCQ VQA CAP MCQ VQA CAP MCQ VQA CAP MCQ VQA CAP MCQ VQA CAP
GPT-4o - Commercial 57.20 57.28 54.90 29.25 56.60 61.98 44.25 54.95 56.53 34.25 59.20 56.25 36.83 53.95 57.57
LLaVA-1.5 7B Open 69.70 35.49 35.91 26.50 29.17 34.95 18.83 30.64 33.15 71.25 33.43 35.18 10.17 27.28 34.38
LLaVA-1.5 13B Open 61.60 39.76 37.76 15.50 34.55 37.83 24.08 35.48 36.08 79.75 36.46 36.42 15.50 32.53 34.33
LLaVA-NeXT 7B Open 69.70 36.96 48.52 48.50 30.32 57.18 21.83 30.40 44.37 66.00 34.20 50.44 11.83 29.43 53.50
InternVL2 8B Open 59.90 48.72 48.60 50.75 47.74 57.82 29.92 45.06 51.14 68.25 49.51 49.67 30.00 43.42 54.24
Phi-3 4.2B Open 40.00 40.59 45.61 25.00 31.44 45.99 16.83 35.58 43.71 31.25 42.92 48.43 27.67 33.04 41.35
Phi-3.5 4.2B Open 60.60 41.82 45.97 21.25 36.89 30.95 25.58 34.66 39.30 33.00 46.03 49.33 39.67 33.47 39.67
Oryx 7B Open 53.20 40.43 48.95 45.00 40.68 56.06 50.50 36.71 48.55 72.50 40.01 48.33 39.67 36.98 49.87
Qwen2-VL 7B Open 76.70 49.33 45.12 37.50 47.62 51.24 22.83 39.45 47.23 57.00 47.40 47.74 35.83 42.31 48.60
Qwen2-VL 72B Open 59.80 51.05 48.55 45.50 50.57 57.25 52.25 45.89 48.59 58.25 50.85 47.88 44.83 46.23 50.50
DriveLM 7B Specialist 21.20 42.86 20.04 21.25 37.49 21.92 9.00 36.68 15.56 22.25 42.05 17.07 17.50 39.56 10.37
Dolphins 7B Specialist 54.30 30.21 31.08 3.00 30.42 29.38 9.42 26.83 26.30 9.25 29.82 28.05 21.50 28.86 27.65

Qualitative Comparisons

example
Examples of different VLM responses under the Frame Lost condition. We observe that GPT-4o responses with visible objects while LLaVA-NeXT and DriveLM tend to hallucinate objects that cannot be seen from the provided images.
example
Examples of different VLM responses under the Water Splash condition. We observe that, under severe visual corruptions, VLMs respond with ambiguous and general answers based on their learned knowledge, without referring to the visual information. Most responses include traffic signals and pedestrians, even though they are not visible in the provided images.

Citation

If you find this work helpful, please kindly consider citing our paper:

@article{xie2025drivebench,
  author  = {Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},
  title   = {Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},
  journal = {arXiv preprint arXiv:2501.04003},
  year    = {2025},
}

License

This work is under the Apache License Version 2.0, while some specific implementations in this codebase might be with other licenses. Kindly refer to LICENSE.md for a more careful check, if you are using our code for commercial matters.

Acknowledgments

To be updated.

About

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •