Official GitHub Repo | Website | Model | Demo | Paper
This repository provides tools and scripts for training and fine-tuning Lightricks' LTX-Video (LTXV) model. It allows training LoRAs on top of LTX-Video, as well as fine-tuning the entire model on custom datasets. The repository also includes auxiliary utilities for preprocessing datasets, captioning videos, splitting scenes, etc.
- Getting Started
- Dataset Preparation
- Training Configuration
- Running the Trainer
- Using Trained LoRAs in ComfyUI
- Example LoRAs
- Contributing
- Acknowledgements
First, install uv if you haven't already. Then clone the repository and install the dependencies:
git clone https://github.com/Lightricks/LTX-Video-Trainer
cd LTX-Video-Trainer
uv sync
source .venv/bin/activate
Follow the steps below to prepare your dataset and configure your training job.
This section describes the workflow for preparing and preprocessing your dataset for training. The general flow is:
- (Optional) Split long videos into scenes using
split_scenes.py
- (Optional) Generate captions for your videos using
caption_videos.py
- Preprocess your dataset using
preprocess_dataset.py
to compute and cache video latents and text embeddings - Run the trainer with your preprocessed dataset
If you're starting with long-form videos (e.g., downloaded from YouTube), you should first split them into shorter, coherent scenes:
# Split a long video into scenes
python scripts/split_scenes.py input.mp4 scenes_output_dir/ \
--filter-shorter-than 5s
This will create multiple video clips in scenes_output_dir
.
These clips will be the input for the captioning step, if you choose to use it.
If your dataset doesn't include captions, you can automatically generate them using vision-language models. Use the directory containing your video clips (either from step 1, or your own clips):
# Generate captions for all videos in the scenes directory
python scripts/caption_videos.py scenes_output_dir/ \
--output scenes_output_dir/captions.json \
--captioner-type llava_next_7b
This will create a captions.json file which contains video paths and their captions This JSON file will be used as input for the data preprocessing step.
This step preprocesses your video dataset by:
- Resizing and cropping videos to fit specified resolution buckets
- Computing and caching video latent representations
- Computing and caching text embeddings for captions
Using the captions.json file generated in step 2:
# Preprocess the dataset using the generated captions.json
python scripts/preprocess_dataset.py scenes_output_dir/captions.json \
--resolution-buckets "768x768x25" \
--caption-column "caption" \
--video-column "media_path"
The preprocessing significantly accelerates training and reduces GPU memory usage.
Videos are organized into "buckets" of specific dimensions (width Γ height Γ frames). Each video is assigned to the nearest matching bucket. Currently, the trainer only supports using a single resolution bucket.
The dimensions of each bucket must follow these constraints due to LTX-Video's VAE architecture:
- Spatial dimensions (width and height) must be multiples of 32
- Number of frames must be a multiple of 8 plus 1 (e.g., 25, 33, 41, 49, etc.)
Guidelines for choosing training resolution:
- For high-quality, detailed videos: use larger spatial dimensions (e.g. 768x448) with fewer frames (e.g. 89)
- For longer, motion-focused videos: use smaller spatial dimensions (512Γ512) with more frames (121)
- Memory usage increases with both spatial and temporal dimensions
Example usage:
python scripts/preprocess_dataset.py /path/to/dataset \
--resolution-buckets "768x768x25"
This creates a bucket with:
- 768Γ768 resolution
- 25 frames
Videos are processed as follows:
- Videos are resized maintaining aspect ratio until either width or height matches the target (768 in this example)
- The larger dimension is center cropped to match the bucket's dimensions
- Frames are sampled uniformly to match the bucket's frame count (25 in this example)
Note
The sequence length processed by the transformer model can be calculated as:
sequence_length = (H/32) * (W/32) * ((F-1)/8 + 1)
Where:
- H = Height of video's latent
- W = Width of video's latent
- F = Number of latent frames
- 32 = VAE's spatial downsampling factor
- 8 = VAE's temporal downsampling factor
For example, a 768Γ448Γ89 video would have sequence length:
(768/32) * (448/32) * ((89-1)/8 + 1) = 24 * 14 * 12 = 4,032
Keep this in mind when choosing video dimensions, as longer sequences require more memory and computation power.
Warning
While the preprocessing script supports multiple buckets, the trainer currently only works with a single resolution bucket. Please ensure you specify just one bucket in your preprocessing command.
The trainer supports on either videos or single images. Note that your dataset must be homogeneous - either all videos or all images, mixing is not supported. When using images, follow the same preprocessing steps and format requirements as with videos, simply provide image files instead of video files.
- Directory with text files:
dataset/
βββ captions.txt # One caption per line
βββ video_paths.txt # One video path per line
python scripts/preprocess_dataset.py dataset/ \
--caption-column captions \
--video-column video_paths
- Single metadata file:
# Using CSV/JSON/JSONL, e.g.
python scripts/preprocess_dataset.py dataset.json \
--caption-column "caption" \
--video-column "video_path" \
--model-source "LTXV_2B_0.9.5" # Optional: specify a specific version, defaults to latest
The preprocessed data is saved in a .precomputed
directory:
dataset/
βββ .precomputed/
βββ latents/ # Cached video latents
βββ conditions/ # Cached text embeddings
When training a LoRA, you can specify a trigger token that will be prepended to all captions:
python scripts/preprocess_dataset.py /path/to/dataset \
--resolution-buckets "1024x576x65" \
--id-token "<TOK>"
This acts as a trigger word that activates the LoRA during inference when you include the same token in your prompts.
By providing the --decode-videos
flag, the script will also VAE-decode the precomputed latents and save the resulting videos under .precomputed/decoded_videos
so you can look at and evaluate the data in the latents. This is useful for debugging and ensuring that your dataset is being processed correctly.
# Preprocess dataset and decode videos for verification
python scripts/preprocess_dataset.py /path/to/dataset \
--resolution-buckets "768x768x25" \
--decode-videos
For single-frame images, they are saved as PNG files instead of MP4.
The trainer uses structured Pydantic models for configuration, making it easy to customize training parameters.
The main configuration class is LtxvTrainerConfig
, which includes:
- ModelConfig: Base model and training mode settings
- LoraConfig: LoRA fine-tuning parameters
- OptimizationConfig: Learning rates, batch sizes, and scheduler settings
- AccelerationConfig: Mixed precision and optimization settings
- DataConfig: Data loading parameters
- ValidationConfig: Validation and inference settings
- CheckpointsConfig: Checkpoint saving frequency and retention settings
- FlowMatchingConfig: Timestep sampling parameters
Check out our example configurations in the configs
directory. You can use these as templates
for your training runs:
- π Full Model Fine-tuning Example
- π LoRA Fine-tuning Example
- π LoRA Fine-tuning Example (Low VRAM) - Optimized for GPUs with 24GB VRAM.
After preprocessing your dataset and preparing a configuration file, you can start training using the trainer script:
# Train a LoRA
python scripts/train.py configs/lora_example.yaml
# Fine-tune the full model
python scripts/train.py configs/full_example.yaml
The trainer loads your configuration, initializes models, applies optimizations, runs the training loop with progress tracking, generates validation videos (if configured), and saves the trained weights.
For LoRA training, the weights will be saved as lora_weights.safetensors
in your output directory.
For full model fine-tuning, the weights will be saved as model_weights.safetensors
.
For a streamlined experience, you can use run_pipeline.py
which automates the entire training workflow. For example provide it with a template configuration which will be instantiated based on the provided values and media files.
python scripts/run_pipeline.py [LORA_BASE_NAME] \
--resolution-buckets "768x768x49" \
--config-template configs/ltxv_2b_lora_template.yaml \
--rank 32
This script will:
-
Process raw videos in
[basename]_raw/
directory (if they exist):- Split long videos into scenes
- Save scenes to
[basename]_scenes/
-
Generate captions for the scenes (if scenes exist):
- Uses LLaVA-Next-7B for captioning
- Saves captions to
[basename]_scenes/captions.json
-
Preprocess the dataset:
- Computes and caches video latents
- Computes and caches text embeddings
- Decodes videos for verification
-
Run the training:
- Uses the provided config template
- Automatically extracts validation prompts from captions
- Saves the final model weights
-
Convert LoRA to ComfyUI format:
- Automatically converts the trained LoRA weights to ComfyUI format
- Saves the converted weights with "_comfy" suffix
Required arguments:
basename
: Base name for your project (e.g., "slime")--resolution-buckets
: Video resolution in format "WxHxF" (e.g., "768x768x49")--config-template
: Path to your configuration template file--rank
: LoRA rank (1-128) for training
The script will create the following directory structure:
[basename]_raw/ # Place your raw videos here
[basename]_scenes/ # Split scenes and captions
βββ .precomputed/ # Preprocessed data
βββ latents/ # Cached video latents
βββ conditions/ # Cached text embeddings
βββ decoded_videos/ # Decoded videos for verification
outputs/ # Training outputs and checkpoints
βββ lora_weights_comfy.safetensors # ComfyUI-compatible LoRA weights
After training your LoRA, you can use it in ComfyUI by following these steps:
-
Convert your trained LoRA weights to ComfyUI format using the conversion script:
python scripts/convert_checkpoint.py your_lora_weights.safetensors --to-comfy
-
Copy the converted LoRA weights (
.safetensors
file) to themodels/loras
folder in your ComfyUI installation. -
In your ComfyUI workflow:
- Use the built-in "Load LoRA" node to load your LoRA file
- Connect it to your LTXV nodes to apply the LoRA to your generation
You can find reference Text-to-Video (T2V) and Image-to-Video (I2V) workflows in the official LTXV ComfyUI repository.
Here are some example LoRAs trained using this trainer, along with their training datasets:
The Cakeify LoRA transforms videos to make objects appear as if they're made of cake. The effect was trained on the Cakeify Dataset.
The Squish LoRA creates a playful squishing effect on objects in videos. It was trained on the Squish Dataset, which contains just 5 example videos.
These examples demonstrate how you can train specialized video effects using this trainer. Feel free to use these datasets as references for preparing your own training data.
Using scripts/convert_checkpoint.py
you can convert your LoRA saved file from diffusers
library format to ComfyUI
format.
# Convert from diffusers to ComfyUI format
python scripts/convert_checkpoint.py input.safetensors --to-comfy --output_path output.safetensors
# Convert from ComfyUI to diffusers format
python scripts/convert_checkpoint.py input.safetensors --output_path output.safetensors
If no output path is specified, the script will automatically generate one by adding _comfy
or _diffusers
suffix to the input filename.
Using scripts/decode_latents.py
you can decode precomputed video latents back into video files.
This is useful for verifying the quality of your preprocessed dataset or debugging the preprocessing pipeline.
# Basic usage
python scripts/decode_latents.py /path/to/latents/dir --output-dir /path/to/output
The script will:
- Load the VAE model from the specified path
- Process all
.pt
latent files in the input directory - Decode each latent back into a video using the VAE
- Save the resulting videos as MP4 files in the output directory
We welcome contributions from the community! Here's how you can help:
- Share Your Work: If you've trained interesting LoRAs or achieved cool results, please share them with the community.
- Report Issues: Found a bug or have a suggestion? Open an issue on GitHub.
- Submit PRs: Help improve the codebase with bug fixes or general improvements.
- Feature Requests: Have ideas for new features? Let us know through GitHub issues.
Parts of this project are inspired by and incorporate ideas from several awesome open-source projects:
Happy training! π