Skip to content

Community trainer for Lightricks' LTX Video model 🎬 ⚑️

Notifications You must be signed in to change notification settings

Lightricks/LTX-Video-Trainer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LTX-Video Community Trainer

Official GitHub Repo | Website | Model | Demo | Paper

This repository provides tools and scripts for training and fine-tuning Lightricks' LTX-Video (LTXV) model. It allows training LoRAs on top of LTX-Video, as well as fine-tuning the entire model on custom datasets. The repository also includes auxiliary utilities for preprocessing datasets, captioning videos, splitting scenes, etc.


πŸ“š Table of Contents


πŸš€ Getting Started

First, install uv if you haven't already. Then clone the repository and install the dependencies:

git clone https://github.com/Lightricks/LTX-Video-Trainer
cd LTX-Video-Trainer
uv sync
source .venv/bin/activate

Follow the steps below to prepare your dataset and configure your training job.


🎬 Dataset Preparation

This section describes the workflow for preparing and preprocessing your dataset for training. The general flow is:

  1. (Optional) Split long videos into scenes using split_scenes.py
  2. (Optional) Generate captions for your videos using caption_videos.py
  3. Preprocess your dataset using preprocess_dataset.py to compute and cache video latents and text embeddings
  4. Run the trainer with your preprocessed dataset

1. Split Scenes (split_scenes.py)

If you're starting with long-form videos (e.g., downloaded from YouTube), you should first split them into shorter, coherent scenes:

# Split a long video into scenes
python scripts/split_scenes.py input.mp4 scenes_output_dir/ \
    --filter-shorter-than 5s

This will create multiple video clips in scenes_output_dir. These clips will be the input for the captioning step, if you choose to use it.

2. Caption Videos (caption_videos.py)

If your dataset doesn't include captions, you can automatically generate them using vision-language models. Use the directory containing your video clips (either from step 1, or your own clips):

# Generate captions for all videos in the scenes directory
python scripts/caption_videos.py scenes_output_dir/ \
    --output scenes_output_dir/captions.json \
    --captioner-type llava_next_7b

This will create a captions.json file which contains video paths and their captions This JSON file will be used as input for the data preprocessing step.

3. Dataset Preprocessing (preprocess_dataset.py)

This step preprocesses your video dataset by:

  1. Resizing and cropping videos to fit specified resolution buckets
  2. Computing and caching video latent representations
  3. Computing and caching text embeddings for captions

Using the captions.json file generated in step 2:

# Preprocess the dataset using the generated captions.json
python scripts/preprocess_dataset.py scenes_output_dir/captions.json \
    --resolution-buckets "768x768x25" \
    --caption-column "caption" \
    --video-column "media_path"

The preprocessing significantly accelerates training and reduces GPU memory usage.

Resolution Buckets

Videos are organized into "buckets" of specific dimensions (width Γ— height Γ— frames). Each video is assigned to the nearest matching bucket. Currently, the trainer only supports using a single resolution bucket.

The dimensions of each bucket must follow these constraints due to LTX-Video's VAE architecture:

  • Spatial dimensions (width and height) must be multiples of 32
  • Number of frames must be a multiple of 8 plus 1 (e.g., 25, 33, 41, 49, etc.)

Guidelines for choosing training resolution:

  • For high-quality, detailed videos: use larger spatial dimensions (e.g. 768x448) with fewer frames (e.g. 89)
  • For longer, motion-focused videos: use smaller spatial dimensions (512Γ—512) with more frames (121)
  • Memory usage increases with both spatial and temporal dimensions

Example usage:

python scripts/preprocess_dataset.py /path/to/dataset \
    --resolution-buckets "768x768x25"

This creates a bucket with:

  • 768Γ—768 resolution
  • 25 frames

Videos are processed as follows:

  1. Videos are resized maintaining aspect ratio until either width or height matches the target (768 in this example)
  2. The larger dimension is center cropped to match the bucket's dimensions
  3. Frames are sampled uniformly to match the bucket's frame count (25 in this example)

Note

The sequence length processed by the transformer model can be calculated as:

sequence_length = (H/32) * (W/32) * ((F-1)/8 + 1)

Where:

  • H = Height of video's latent
  • W = Width of video's latent
  • F = Number of latent frames
  • 32 = VAE's spatial downsampling factor
  • 8 = VAE's temporal downsampling factor

For example, a 768Γ—448Γ—89 video would have sequence length:

(768/32) * (448/32) * ((89-1)/8 + 1) = 24 * 14 * 12 = 4,032

Keep this in mind when choosing video dimensions, as longer sequences require more memory and computation power.

Warning

While the preprocessing script supports multiple buckets, the trainer currently only works with a single resolution bucket. Please ensure you specify just one bucket in your preprocessing command.

Dataset Format

The trainer supports on either videos or single images. Note that your dataset must be homogeneous - either all videos or all images, mixing is not supported. When using images, follow the same preprocessing steps and format requirements as with videos, simply provide image files instead of video files.

  1. Directory with text files:
dataset/
β”œβ”€β”€ captions.txt      # One caption per line
└── video_paths.txt   # One video path per line
python scripts/preprocess_dataset.py dataset/ \
    --caption-column captions \
    --video-column video_paths
  1. Single metadata file:
# Using CSV/JSON/JSONL, e.g.
python scripts/preprocess_dataset.py dataset.json \
    --caption-column "caption" \
    --video-column "video_path" \
    --model-source "LTXV_2B_0.9.5"  # Optional: specify a specific version, defaults to latest
Output Structure

The preprocessed data is saved in a .precomputed directory:

dataset/
└── .precomputed/
    β”œβ”€β”€ latents/     # Cached video latents
    └── conditions/  # Cached text embeddings
LoRA Trigger Words

When training a LoRA, you can specify a trigger token that will be prepended to all captions:

python scripts/preprocess_dataset.py /path/to/dataset \
    --resolution-buckets "1024x576x65" \
    --id-token "<TOK>"

This acts as a trigger word that activates the LoRA during inference when you include the same token in your prompts.

Decoding videos

By providing the --decode-videos flag, the script will also VAE-decode the precomputed latents and save the resulting videos under .precomputed/decoded_videos so you can look at and evaluate the data in the latents. This is useful for debugging and ensuring that your dataset is being processed correctly.

# Preprocess dataset and decode videos for verification
python scripts/preprocess_dataset.py /path/to/dataset \
    --resolution-buckets "768x768x25" \
    --decode-videos

For single-frame images, they are saved as PNG files instead of MP4.

βš™οΈ Training Configuration

The trainer uses structured Pydantic models for configuration, making it easy to customize training parameters. The main configuration class is LtxvTrainerConfig, which includes:

  • ModelConfig: Base model and training mode settings
  • LoraConfig: LoRA fine-tuning parameters
  • OptimizationConfig: Learning rates, batch sizes, and scheduler settings
  • AccelerationConfig: Mixed precision and optimization settings
  • DataConfig: Data loading parameters
  • ValidationConfig: Validation and inference settings
  • CheckpointsConfig: Checkpoint saving frequency and retention settings
  • FlowMatchingConfig: Timestep sampling parameters

Example Configurations

Check out our example configurations in the configs directory. You can use these as templates for your training runs:


⚑ Running the Trainer

After preprocessing your dataset and preparing a configuration file, you can start training using the trainer script:

# Train a LoRA
python scripts/train.py configs/lora_example.yaml

# Fine-tune the full model
python scripts/train.py configs/full_example.yaml

The trainer loads your configuration, initializes models, applies optimizations, runs the training loop with progress tracking, generates validation videos (if configured), and saves the trained weights.

For LoRA training, the weights will be saved as lora_weights.safetensors in your output directory. For full model fine-tuning, the weights will be saved as model_weights.safetensors.


Fast and simple: Running the Complete Pipeline as one command

For a streamlined experience, you can use run_pipeline.py which automates the entire training workflow. For example provide it with a template configuration which will be instantiated based on the provided values and media files.

python scripts/run_pipeline.py [LORA_BASE_NAME] \
    --resolution-buckets "768x768x49" \
    --config-template configs/ltxv_2b_lora_template.yaml \
    --rank 32

This script will:

  1. Process raw videos in [basename]_raw/ directory (if they exist):

    • Split long videos into scenes
    • Save scenes to [basename]_scenes/
  2. Generate captions for the scenes (if scenes exist):

    • Uses LLaVA-Next-7B for captioning
    • Saves captions to [basename]_scenes/captions.json
  3. Preprocess the dataset:

    • Computes and caches video latents
    • Computes and caches text embeddings
    • Decodes videos for verification
  4. Run the training:

    • Uses the provided config template
    • Automatically extracts validation prompts from captions
    • Saves the final model weights
  5. Convert LoRA to ComfyUI format:

    • Automatically converts the trained LoRA weights to ComfyUI format
    • Saves the converted weights with "_comfy" suffix

Required arguments:

  • basename: Base name for your project (e.g., "slime")
  • --resolution-buckets: Video resolution in format "WxHxF" (e.g., "768x768x49")
  • --config-template: Path to your configuration template file
  • --rank: LoRA rank (1-128) for training

The script will create the following directory structure:

[basename]_raw/          # Place your raw videos here
[basename]_scenes/       # Split scenes and captions
└── .precomputed/       # Preprocessed data
    β”œβ”€β”€ latents/       # Cached video latents
    β”œβ”€β”€ conditions/    # Cached text embeddings
    └── decoded_videos/ # Decoded videos for verification
outputs/                # Training outputs and checkpoints
    └── lora_weights_comfy.safetensors  # ComfyUI-compatible LoRA weights

πŸ”Œ Using Trained LoRAs in ComfyUI

After training your LoRA, you can use it in ComfyUI by following these steps:

  1. Convert your trained LoRA weights to ComfyUI format using the conversion script:

    python scripts/convert_checkpoint.py your_lora_weights.safetensors --to-comfy
  2. Copy the converted LoRA weights (.safetensors file) to the models/loras folder in your ComfyUI installation.

  3. In your ComfyUI workflow:

    • Use the built-in "Load LoRA" node to load your LoRA file
    • Connect it to your LTXV nodes to apply the LoRA to your generation

You can find reference Text-to-Video (T2V) and Image-to-Video (I2V) workflows in the official LTXV ComfyUI repository.


🍰 Example LoRAs

Here are some example LoRAs trained using this trainer, along with their training datasets:

Cakeify Effect

The Cakeify LoRA transforms videos to make objects appear as if they're made of cake. The effect was trained on the Cakeify Dataset.

Squish Effect

The Squish LoRA creates a playful squishing effect on objects in videos. It was trained on the Squish Dataset, which contains just 5 example videos.

These examples demonstrate how you can train specialized video effects using this trainer. Feel free to use these datasets as references for preparing your own training data.


οΈπŸ”§ Utility Scripts

LoRA Format Convertor

Using scripts/convert_checkpoint.py you can convert your LoRA saved file from diffusers library format to ComfyUI format.

# Convert from diffusers to ComfyUI format
python scripts/convert_checkpoint.py input.safetensors --to-comfy --output_path output.safetensors

# Convert from ComfyUI to diffusers format
python scripts/convert_checkpoint.py input.safetensors --output_path output.safetensors

If no output path is specified, the script will automatically generate one by adding _comfy or _diffusers suffix to the input filename.

Latents Decoding Script

Using scripts/decode_latents.py you can decode precomputed video latents back into video files. This is useful for verifying the quality of your preprocessed dataset or debugging the preprocessing pipeline.

# Basic usage
python scripts/decode_latents.py /path/to/latents/dir --output-dir /path/to/output

The script will:

  1. Load the VAE model from the specified path
  2. Process all .pt latent files in the input directory
  3. Decode each latent back into a video using the VAE
  4. Save the resulting videos as MP4 files in the output directory

🀝 Contributing

We welcome contributions from the community! Here's how you can help:

  • Share Your Work: If you've trained interesting LoRAs or achieved cool results, please share them with the community.
  • Report Issues: Found a bug or have a suggestion? Open an issue on GitHub.
  • Submit PRs: Help improve the codebase with bug fixes or general improvements.
  • Feature Requests: Have ideas for new features? Let us know through GitHub issues.

🫢 Acknowledgements

Parts of this project are inspired by and incorporate ideas from several awesome open-source projects:


Happy training! πŸŽ‰

About

Community trainer for Lightricks' LTX Video model 🎬 ⚑️

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors 4

  •  
  •  
  •  
  •  

Languages