ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

[arXiv] [Project Page] [Dataset]

Jianhong Bai^1*, Menghan Xia^2†, Xiao Fu³, Xintao Wang², Lianrui Mu¹, Jinwen Cao²,
Zuozhu Liu¹, Haoji Hu^1†, Xiang Bai⁴, Pengfei Wan², Di Zhang²
(*Work done during an internship at KwaiVGI, Kuaishou Technology †corresponding authors)

¹Zhejiang University, ²Kuaishou Technology, ³CUHK, ⁴HUST.

Important Note: This open-source repository is intended to provide a reference implementation. Due to the difference in the underlying T2V model's performance, the open-source version may not achieve the same performance as the model in our paper. If you'd like to use the best version of ReCamMaster, please upload your video to this link. Additionally, we are working on developing an online trial website. Please stay tuned to updates on the Kling website.

🔥 Updates

[2025.04.09]: Release the training and inference code, model checkpoint.
[2025.03.31]: Release the MultiCamVideo Dataset.
[2025.03.31]: We have sent the inference results to the first 1000 trial users.
[2025.03.17]: Release the project page and the try out link.

📖 Introduction

TL;DR: We propose ReCamMaster to re-capture in-the-wild videos with novel camera trajectories. We also release a multi-camera synchronized video dataset rendered with Unreal Engine 5.

TEASER_compressed.mp4

🚀 Trail: Try ReCamMaster with Your Own Videos

Update: We are actively processing the videos uploaded by users. So far, we have sent the inference results to the email addresses of the first 1180 testers. You should receive an email titled "Inference Results of ReCamMaster" from either [email protected] or [email protected]. Please also check your spam folder, and let us know if you haven't received the email after a long time. If you enjoyed the videos we created, please consider giving us a star 🌟.

You can try out our ReCamMaster by uploading your own video to this link, which will generate a video with camera movements along a new trajectory. We will send the mp4 file generated by ReCamMaster to your inbox as soon as possible. For camera movement trajectories, we offer 10 basic camera trajectories as follows:

Index	Basic Trajectory
1	Pan Right
2	Pan Left
3	Tilt Up
4	Tilt Down
5	Zoom In
6	Zoom Out
7	Translate Up (with rotation)
8	Translate Down (with rotation)
9	Arc Left (with rotation)
10	Arc Right (with rotation)

If you would like to use ReCamMaster as a baseline and need qualitative or quantitative comparisons, please feel free to drop an email to [email protected]. We can assist you with batch inference of our model.

⚙️ Code: ReCamMaster + Wan2.1 (Inference & Training)

The model utilized in our paper is an internally developed T2V model, not Wan2.1. Due to company policy restrictions, we are unable to open-source the model used in the paper. Consequently, we migrated ReCamMaster to Wan2.1 to validate the effectiveness of our method. Due to differences in the underlying T2V model, you may not achieve the same results as demonstrated in the demo.

Inference

Step 1: Set up the environment

DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:

curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"

Install DiffSynth-Studio:

git clone https://github.com/KwaiVGI/ReCamMaster.git
cd ReCamMaster
pip install -e .

Step 2: Download the pretrained checkpoints

Download the pre-trained Wan2.1 models

cd ReCamMaster
python download_wan2.1.py

Download the pre-trained ReCamMaster checkpoint

Please download from huggingface and place it in models/ReCamMaster/checkpoints.

Step 3: Test the example videos

python inference_recammaster.py --cam_type 1

Step 4: Test your own videos

If you want to test your own videos, you need to prepare your test data following the structure of the example_test_data folder. This includes N mp4 videos, each with at least 81 frames, and a metadata.csv file that stores their paths and corresponding captions. You can refer to the Prompt Extension section in Wan2.1 for guidance on preparing video captions.

python inference_recammaster.py --cam_type 1 --dataset_path path/to/your/data

We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.

cam_type	Trajectory
1	Pan Right
2	Pan Left
3	Tilt Up
4	Tilt Down
5	Zoom In
6	Zoom Out
7	Translate Up (with rotation)
8	Translate Down (with rotation)
9	Arc Left (with rotation)
10	Arc Right (with rotation)

Training

Step 1: Set up the environment

pip install lightning pandas websockets

Step 2: Prepare the training dataset

Download the MultiCamVideo dataset.
Extract VAE features

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py   --task data_process   --dataset_path path/to/the/MultiCamVideo/Dataset   --output_path ./models   --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth"   --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth"   --tiled   --num_frames 81   --height 480   --width 832 --dataloader_num_workers 2

Generate Captions for Each Video

You can use video caption tools like LLaVA to generate captions for each video and store them in the metadata.csv file.

Step 3: Training

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py   --task train  --dataset_path recam_train_data   --output_path ./models/train   --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors"   --steps_per_epoch 8000   --max_epochs 100   --learning_rate 1e-4   --accumulate_grad_batches 1   --use_gradient_checkpointing  --dataloader_num_workers 4

We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size.

Step 4: Test the model

python inference_recammaster.py --cam_type 1 --ckpt_path path/to/the/checkpoint

📷 Dataset: MultiCamVideo Dataset

1. Dataset Introduction

TL;DR: The MultiCamVideo Dataset is a multi-camera synchronized video dataset rendered using Unreal Engine 5. It includes synchronized multi-camera videos and their corresponding camera trajectories. The MultiCamVideo Dataset can be valuable in fields such as camera-controlled video generation, synchronized video production, and 3D/4D reconstruction.

datashowcase.mp4

The MultiCamVideo Dataset is a multi-camera synchronized video dataset rendered using Unreal Engine 5. It includes synchronized multi-camera videos and their corresponding camera trajectories. It consists of 13.6K different dynamic scenes, each captured by 10 cameras, resulting in a total of 136K videos. Each dynamic scene is composed of four elements: {3D environment, character, animation, camera}. Specifically, we use animation to drive the character, and position the animated character within the 3D environment. Then, Time-synchronized cameras are set up to move along predefined trajectories to render the multi-camera video data.

3D Environment: We collect 37 high-quality 3D environments assets from Fab. To minimize the domain gap between rendered data and real-world videos, we primarily select visually realistic 3D scenes, while choosing a few stylized or surreal 3D scenes as a supplement. To ensure data diversity, the selected scenes cover a variety of indoor and outdoor settings, such as city streets, shopping malls, cafes, office rooms, and the countryside.

Character: We collect 66 different human 3D models as characters from Fab and Mixamo.

Animation: We collect 93 different animations from Fab and Mixamo, including common actions such as waving, dancing, and cheering. We use these animations to drive the collected characters and create diverse datasets through various combinations.

Camera: To ensure camera movements are diverse and closely resemble real-world distributions, we create a wide range of camera trajectories and parameters to cover various situations. To achieve this by designing rules to batch-generate random camera starting positions and movement trajectories:

Camera Starting Position.

We take the character's position as the center of a hemisphere with a radius of {3m, 5m, 7m, 10m} based on the size of the 3D scene and randomly sample within this range as the camera's starting point, ensuring the closest distance to the character is greater than 0.5m and the pitch angle is within 45 degrees.

Camera Trajectories.

Pan & Tilt:
The camera rotation angles are randomly selected within the range, with pan angles ranging from 5 to 45 degrees and tilt angles ranging from 5 to 30 degrees, with directions randomly chosen left/right or up/down.
Basic Translation:
The camera translates along the positive and negative directions of the xyz axes, with movement distances randomly selected within the range of $[\frac{1}{4}, 1] \times \text{distance2character}$.
Basic Arc Trajectory:
The camera moves along an arc, with rotation angles randomly selected within the range of 15 to 75 degrees.
Random Trajectories:
1-3 points are sampled in space, and the camera moves from the initial position through these points as the movement trajectory, with the total movement distance randomly selected within the range of $[\frac{1}{4}, 1] \times \text{distance2character}$. The polyline is smoothed to make the movement more natural.
Static Camera:
The camera does not translate or rotate during shooting, maintaining a fixed position.

Camera Movement Speed.

To further enhance the diversity of trajectories, 50% of the training data uses constant-speed camera trajectories, while the other 50% uses variable-speed trajectories generated by nonlinear functions. Consider a camera trajectory with a total of $f$ frames, starting at location $L_{start}$ and ending at position $L_{end}$. The location at the $i$-th frame is given by:

$$L_i = L_{start} + (L_{end} - L_{start}) \cdot \left( \frac{1 - \exp(-a \cdot i/f)}{1 - \exp(-a)} \right),$$

where $a$ is an adjustable parameter to control the trajectory speed. When $a > 0$, the trajectory starts fast and then slows down; when $a < 0$, the trajectory starts slow and then speeds up. The larger the absolute value of $a$, the more drastic the change.

Camera Parameters.

We chose four set of camera parameters: {focal=18mm, aperture=10}, {focal=24mm, aperture=5}, {focal=35mm, aperture=2.4} and {focal=50mm, aperture=2.4}.

2. Statistics and Configurations

Dataset Statistics:

Number of Dynamic Scenes	Camera per Scene	Total Videos
13,600	10	136,000

Video Configurations:

Resolution	Frame Number	FPS
1280x1280	81	15

Note: You can use 'center crop' to adjust the video's aspect ratio to fit your video generation model, such as 16:9, 9:16, 4:3, or 3:4.

Camera Configurations:

Focal Length	Aperture	Sensor Height	Sensor Width
18mm, 24mm, 35mm, 50mm	10.0, 5.0, 2.4	23.76mm	23.76mm

3. File Structure

MultiCamVideo-Dataset
├── train
│   ├── f18_aperture10
│   │   ├── scene1    # one dynamic scene
│   │   │   ├── videos
│   │   │   │   ├── cam01.mp4    # synchronized 81-frame videos at 1280x1280 resolution
│   │   │   │   ├── cam02.mp4
│   │   │   │   ├── ...
│   │   │   │   └── cam10.mp4
│   │   │   └── cameras
│   │   │       └── camera_extrinsics.json    # 81-frame camera extrinsics of the 10 cameras 
│   │   ├── ...
│   │   └── scene3400
│   ├── f24_aperture5
│   │   ├── scene1
│   │   ├── ...
│   │   └── scene3400
│   ├── f35_aperture2.4
│   │   ├── scene1
│   │   ├── ...
│   │   └── scene3400
│   └── f50_aperture2.4
│       ├── scene1
│       ├── ...
│       └── scene3400
└── val
    └── 10basic_trajectories
        ├── videos
        │   ├── cam01.mp4    # example videos corresponding to the validation cameras
        │   ├── cam02.mp4
        │   ├── ...
        │   └── cam10.mp4
        └── cameras
            └── camera_extrinsics.json    # 10 different trajectories for validation

3. Useful scripts

Data Extraction

cat MultiCamVideo-Dataset.part* > MultiCamVideo-Dataset.tar.gz
tar -xzvf MultiCamVideo-Dataset.tar.gz

Camera Visualization

python vis_cam.py

The visualization script is modified from CameraCtrl, thanks for their inspiring work.

🤗 Awesome Related Works

Feel free to explore these outstanding related works, including but not limited to:

GCD: GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.

ReCapture: a method for generating new videos with novel camera trajectories from a single user-provided video.

Trajectory Attention: Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.

GS-DiT: GS-DiT provides 4D video control for a single monocular video.

Diffusion as Shader: a versatile video generation control model for various tasks.

TrajectoryCrafter: TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.

GEN3C: a generative video model with precise Camera Control and temporal 3D Consistency.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@misc{bai2025recammaster,
      title={ReCamMaster: Camera-Controlled Generative Rendering from A Single Video}, 
      author={Jianhong Bai and Menghan Xia and Xiao Fu and Xintao Wang and Lianrui Mu and Jinwen Cao and Zuozhu Liu and Haoji Hu and Xiang Bai and Pengfei Wan and Di Zhang},
      year={2025},
      eprint={2503.11647},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.11647}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
diffsynth		diffsynth
example_test_data		example_test_data
models/ReCamMaster/checkpoints		models/ReCamMaster/checkpoints
.gitignore		.gitignore
README.md		README.md
TEASER_compressed.mp4		TEASER_compressed.mp4
Try ReCamMaster with Your Own Videos Here.txt		Try ReCamMaster with Your Own Videos Here.txt
download_wan2.1.py		download_wan2.1.py
inference_recammaster.py		inference_recammaster.py
requirements.txt		requirements.txt
setup.py		setup.py
train_recammaster.py		train_recammaster.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

[arXiv] [Project Page] [Dataset]

🔥 Updates

📖 Introduction

🚀 Trail: Try ReCamMaster with Your Own Videos

⚙️ Code: ReCamMaster + Wan2.1 (Inference & Training)

Inference

Training

📷 Dataset: MultiCamVideo Dataset

1. Dataset Introduction

2. Statistics and Configurations

3. File Structure

3. Useful scripts

🤗 Awesome Related Works

🌟 Citation

About

Releases

Packages

Contributors 2

Languages

KwaiVGI/ReCamMaster

Folders and files

Latest commit

History

Repository files navigation

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

[arXiv] [Project Page] [Dataset]

🔥 Updates

📖 Introduction

🚀 Trail: Try ReCamMaster with Your Own Videos

⚙️ Code: ReCamMaster + Wan2.1 (Inference & Training)

Inference

Training

📷 Dataset: MultiCamVideo Dataset

1. Dataset Introduction

2. Statistics and Configurations

3. File Structure

3. Useful scripts

🤗 Awesome Related Works

🌟 Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages