Long-VITA: A Strong Baseline for Open-Source Long-Context Visual Language Model Beyond 1 Million Tokens

🔥 News

2024.12.16 🌟 The training code, deployment code, and model weights have been released. We currently only support Ascend NPU and are working on adapting to Nvidia GPU.
2024.12.16 🌟 We are very proud to launch Long-VITA, which is a strong long-context visual language model and supports more than 1 million tokens.

Long Context. Long-VITA can process more than 4K frames or over 1M visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
Open Source. Long-VITA is trained on open-source data only, consisting of a mix of 17M samples that are publicly available.
Strong Performance. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.

We originally implemented Long-VITA on Ascend NPU and will adapt to Nvidia GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
VLMEvalKit		VLMEvalKit
cognitron_vl		cognitron_vl
configs		configs
lcvlm_modellink		lcvlm_modellink
scripts		scripts
.gitignore		.gitignore
DATA.md		DATA.md
GPU_DeepSpeed.md		GPU_DeepSpeed.md
GPU_Megatron.md		GPU_Megatron.md
NPU_MindSpeed.md		NPU_MindSpeed.md
README.md		README.md
requirements_npu.txt		requirements_npu.txt
setup.py		setup.py