Long-VITA: A Strong Baseline for Open-Source Long-Context Visual Language Model Beyond 1 Million Tokens
2024.12.16
🌟 The training code, deployment code, and model weights have been released. We currently only support Ascend NPU and are working on adapting to Nvidia GPU.2024.12.16
🌟 We are very proud to launch Long-VITA, which is a strong long-context visual language model and supports more than 1 million tokens.
- Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
- Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
- Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.
- Comparison of image understanding.
- Comparison of video understanding.
We originally implemented Long-VITA on Ascend NPU and will adapt to Nvidia GPU.