Slow-Fast Architecture for Video Multi-Modal Large Language Models

Highlights

A novel architecture for video understanding MLLMs that bypass the sequence length limitations of LLMs.
Easily adaptable and compatible with most MLLMs, such as the widely-used LLaVA architectures.
Instruction-aware extraction of visual information from uncompressed video representation.
Capable of analyzing over 1,536 frames during inference without additional compression (64 tokens per frame using a SigLIP encoder, totaling 98k sequence length).

Introduction

Balancing temporal resolution and spatial detail under limited compute budget remains a key challenge for video-based multi-modal large language models (MLLMs). Existing methods typically compress video representations using predefined rules before feeding them into the LLM, resulting in irreversible information loss and often ignoring input instructions. To address this, we propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details. Inspired by how humans first skim a video before focusing on relevant parts, our slow-fast design employs a dual-token strategy: 1) "fast" visual tokens — a compact set of compressed video features — are fed into the LLM alongside text embeddings to provide a quick overview; 2) "slow" visual tokens — uncompressed video features — are cross-attended by text embeddings through specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity. We conduct systematic exploration to optimize both the overall architecture and key components. Experiments show that our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in LLM's prefilling computation, and achieving a 16% average performance improvement across five video understanding benchmarks.

Updates

[04/08/2025] Evaluation code for different benchmarks.
[04/08/2025] Online demo.
[04/08/2025] Model checkpoints and inference code.

Models & Performance

We provide the pretrained models on huggingface hub, which are listed in the following table. All the models use Qwen2-7B-Instruct as the base LLM.

Link	Frames	Sampling Stride	Pooling Stride	VideoMME	VideoMME-sub	MLVU	MVBench	EgoSchema	LongVideoBench	Perception Test	Next-QA	ActivityNetQA	TempCompass
Checkpoint	64	1	4	60.2	63.0	67.3	68.9	59.2	56.6	70.3	83.5	54.8	68.9
Checkpoint	96	1	6	60.3	63.4	68.1	68.6	59.8	58.0	70.2	83.1	54.5	67.7

Comparsion with Other Models

Name	LLM	MME	MME-sub	MLVU	MVBench	EgoSchema	LongVidBench	Per. Test	Next-QA	ActNetQA	Tempcompass
GPT-4O	-	59.9	63.3	-	-	-	-	-	-	57.0	70.9
VILA	Yi-34B	60.1	61.1	56.7	-	58.0	-	54.0	67.9	58.0	-
PLLaVA	Yi-34B	-	-	-	58.1	-	53.2	-	-	60.9	-
LongVA	Qwen2-7B	52.6	54.3	56.3	-	-	-	-	68.3	50.0	-
IXC-2.5	InternLM2-7B	55.8	58.8	37.3	69.1	-	-	34.4	71.0	52.8	-
SlowFast-LLaVA	Qwen2-7B	-	-	-	-	47.2	-	-	64.2	55.5	-
SlowFast-LLaVA	Yi-34B	-	-	-	-	55.8	-	-	66.2	59.2	-
LLaVA-OV	Qwen2-7B	58.2	61.5	64.7	56.7	60.1	56.5	57.1	79.4	56.6	64.8
VideoLLaMA2	Qwen2-7B	47.9	50.3	32.7	54.6	51.7	-	51.4	-	50.2	-
Kangoroo	LLaMA3-8B	56.0	57.6	61.0	61.1	62.7	54.8	-	-	-	61.3
Oryx-MLLM	Qwen2-7B	58.3	62.6	67.5	63.9	-	55.3	68.6	81.9	-	-
mplug-owl3	Qwen2-7B	53.5	-	-	54.5	-	52.1	-	78.6	-	-
Slow-Fast Video MLLM
64 Frame	Qwen2-7B	60.2	63.0	67.3	68.9	59.2	56.6	70.3	83.5	54.8	68.9
96 Frame	Qwen2-7B	60.3	63.4	68.1	68.6	59.8	58.0	70.2	83.1	54.5	67.7

Visual Examples

Install

Please following the guide here to prepare the environment on Linux OS.

Clone this repository

git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git
cd Slow-Fast-Video-Multimodal-LLM

Create environment and install package

conda create -n slowfast_mllm python=3.10 -y
conda activate slowfast_mllm
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt

Note that we use CUDA 12.2 to set up the python environment. Other version of CUDA, Pytorch, and transformers library may also works, but we didn't test the performance under these scenorias.

Install additional packages for training cases

pip install flash-attn==2.4.2 --no-build-isolation
pip install deepspeed==0.14.2

Training Data

We are currently preparing the dataset in the following format. But you can prepare the data in the following format to start the training. The annotation file should be a single json file containing a list of dict with the following format:

{
    {
        'video':"${VIDEO_PATH}",
        'conversation':{
            {
                'from': "human",
                'value': "Describe this video in detail."
            },
            {
                'from': "gpt",
                'value': "This is a dummy annotation."
            },
        }
    },
    ...
}

Inference

We provide a simple inference script in simple_inference_demo.py, which generates a single-round response based on the input video and question. To use this demo:

python simple_inference_demo.py \
    --model-path shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4 \
    --conv-mode qwen_1_5 \
    --video-path "assets/catinterrupt.mp4" \
    --question "Please describe this video in detail." \
    --max_frames 64

You need to set max_frames according to the training configuration of the corresponding model.

You can also use the following code snippet directly.

import torch
import os
import numpy as np
from decord import VideoReader, cpu

from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.utils import disable_torch_init


def load_video(video_path, max_frames_num):
        vr = VideoReader(video_path, num_threads=4)
        fps = round(vr.get_avg_fps())
        frame_idx = [i for i in range(0, len(vr), fps)]

        uniform_sampled_frames = np.linspace(0, len(vr) - 1, max_frames_num, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        spare_frames = vr.get_batch(frame_idx).asnumpy()

        return spare_frames
    
# Model
model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4"
video_path = "assets/catinterrupt.mp4"
question = "Please describe this video in detail."
max_frames=64

disable_torch_init()
model_path = os.path.expanduser(model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, use_flash_attn=True)    

if model.config.mm_use_im_start_end:
    prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + question
else:
    prompt = DEFAULT_IMAGE_TOKEN + "\n" + question

conv = conv_templates["qwen_1_5"].copy()
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# read and process video
video = load_video(video_path, max_frames_num=max_frames)
video_tensor = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda()
videos = [video_tensor]
                    
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device='cuda', non_blocking=True).unsqueeze(dim=0)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=videos,
        do_sample=True,
        max_new_tokens=1024,
        num_beams=1,
        temperature=0.2,
        top_p=1.0,
        use_cache=True)

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"User input: {question}\n")
print(outputs)

Gradio Demo

We have an online demo available here. You can also run it locally by executing:

python gradio_demo.py \
    --model-path ${MODEL_CKPT}

Evaluation

We evaluate model performance using lmms-eval. To ensure consistency, we pin the version and align the system prompt across different benchmarks.

Use the following scripts to run the evaluation:

export HF_HOME=$(realpath ~/.cache/huggingface)
python -m accelerate.commands.launch \
        --num_processes=8 \
        lmms_eval_evaluate.py \
        --model slowfast_videomllm \
        --model_args pretrained=${MODEL_PATH},video_decode_backend=pyav,conv_template=qwen_1_5,num_frames=${TEST_FRAMES},device_map='' \
        --tasks ${TASK_NAME} \
        --batch_size 1 \
        --log_samples \
        --log_samples_suffix ${TASK_NAME}_slowfastvideomllm_ \
        --output_path ./logs/

You can modify ${MODEL_PATH}, ${TEST_FRAMES} and ${TASK_NAME} to evaluate different models and tasks. All evaluation scripts are organized under scripts/lmms_eval_evaluate for reference.

Training

We are currently organizing the code and will release the code soon!

License

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language model. This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Acknowledgement

We would like to thank the Hugging Face team for their Zero GPU support for our demo.
LLaVA: the codebase we built upon.
Eagle: the codebase we built upon.
OpenFlamingo: the hybrid layer implementation borrow some code from open flamingo.
mPLUG-Owl3: we borrow some of the code from them to implement hybrid decoder layer.
LLaVA-Video-178K: we train our model with the data from LLaVA-Video-178k.
VideoChat2: we train our model with part of the data organized by VideoChat2.
LMMs-Eval: many thanks to the LMMs-Lab for their wonderful and easy-to-use evaluation tools!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Slow-Fast Architecture for Video Multi-Modal Large Language Models

Highlights

Introduction

Updates

Contents

Models & Performance

Comparsion with Other Models

Visual Examples

Install

Training Data

Inference

Gradio Demo

Evaluation

Training

License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
llava		llava
lmms_eval		lmms_eval
scripts		scripts
.gitignore		.gitignore
README.md		README.md
gradio_demo.py		gradio_demo.py
lmms_eval_evaluate.py		lmms_eval_evaluate.py
requirements.txt		requirements.txt
simple_inference_demo.py		simple_inference_demo.py
train.py		train.py

SHI-Labs/Slow-Fast-Video-Multimodal-LLM

Folders and files

Latest commit

History

Repository files navigation

Slow-Fast Architecture for Video Multi-Modal Large Language Models

Highlights

Introduction

Updates

Contents

Models & Performance

Comparsion with Other Models

Visual Examples

Install

Training Data

Inference

Gradio Demo

Evaluation

Training

License

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages