Skip to content

Long-VITA: A Strong Baseline for Open-Source Long-Context Visual Language Model Beyond 1 Million Tokens

Notifications You must be signed in to change notification settings

VITA-MLLM/Long-VITA

Repository files navigation

Long-VITA: A Strong Baseline for Open-Source Long-Context Visual Language Model Beyond 1 Million Tokens

🔥 News

  • 2024.12.16 🌟 The training code, deployment code, and model weights have been released. We currently only support Ascend NPU and are working on adapting to Nvidia GPU.
  • 2024.12.16 🌟 We are very proud to launch Long-VITA, which is a strong long-context visual language model and supports more than 1 million tokens.

Contents

✨ Highlights

  • Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
  • Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
  • Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.

📈 Experimental Results

  • Comparison of image understanding.

image

  • Comparison of video understanding.

image

⭐ Training, Inference and Evaluation

We originally implemented Long-VITA on Ascend NPU and will adapt to Nvidia GPU.

About

Long-VITA: A Strong Baseline for Open-Source Long-Context Visual Language Model Beyond 1 Million Tokens

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published