In this example, we'll be training a Flux.1 LoRA.
Flux requires a lot of system RAM in addition to GPU memory. Simply quantising the model at startup requires about 50GB of system memory. If it takes an excessively long time, you may need to assess your hardware's capabilities and whether any changes are needed.
When you're training every component of a rank-16 LoRA (MLP, projections, multimodal blocks), it ends up using:
- a bit more than 30G VRAM when not quantising the base model
- a bit more than 18G VRAM when quantising to int8 + bf16 base/LoRA weights
- a bit more than 13G VRAM when quantising to int4 + bf16 base/LoRA weights
- a bit more than 9G VRAM when quantising to NF4 + bf16 base/LoRA weights
- a bit more than 9G VRAM when quantising to int2 + bf16 base/LoRA weights
You'll need:
- the absolute minimum is a single 3080 10G
- a realistic minimum is a single 3090 or V100 GPU
- ideally multiple 4090, A6000, L40S, or better
Luckily, these are readily available through providers such as LambdaLabs which provides the lowest available rates, and localised clusters for multi-node training.
Unlike other models, Apple GPUs do not currently work for training Flux.
Make sure that you have python installed; SimpleTuner does well with 3.10 or 3.11. Python 3.12 should not be used.
You can check this by running:
python --version
If you don't have python 3.11 installed on Ubuntu, you can try the following:
apt -y install python3.11 python3.11-venv
For Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.4 image:
apt -y install nvidia-cuda-toolkit libgl1-mesa-glx
If libgl1-mesa-glx
is not found, you might need to use libgl1-mesa-dri
instead. Your mileage may vary.
Clone the SimpleTuner repository and set up the python venv:
git clone --branch=release https://github.com/bghira/SimpleTuner.git
cd SimpleTuner
# if python --version shows 3.11 you can just also use the 'python' command here.
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U poetry pip
# Necessary on some systems to prevent it from deciding it knows better than us.
poetry config virtualenvs.create false
Note: We're currently installing the release
branch here; the main
branch may contain experimental features that might have better results or lower memory use.
Depending on your system, you will run one of 3 commands:
# MacOS
poetry install -C install/apple
# Linux
poetry install
# Linux with ROCM
poetry install -C install/rocm
The following must be executed for an AMD MI300X to be useable:
apt install amd-smi-lib
pushd /opt/rocm/share/amd_smi
python3 -m pip install --upgrade pip
python3 -m pip install .
popd
To run SimpleTuner, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.
An experimental script, configure.py
, may allow you to entirely skip this section through an interactive step-by-step configuration. It contains some safety features that help avoid common pitfalls.
Note: This doesn't configure your dataloader. You will still have to do that manually, later.
To run it:
python configure.py
⚠️ For users located in countries where Hugging Face Hub is not readily accessible, you should addHF_ENDPOINT=https://hf-mirror.com
to your~/.bashrc
or~/.zshrc
depending on which$SHELL
your system uses.
If you prefer to manually configure:
Copy config/config.json.example
to config/config.json
:
cp config/config.json.example config/config.json
There, you will possibly need to modify the following variables:
-
model_type
- Set this tolora
. -
model_family
- Set this toflux
. -
pretrained_model_name_or_path
- Set this toblack-forest-labs/FLUX.1-dev
. -
pretrained_vae_model_name_or_path
- Set this toblack-forest-labs/FLUX.1-dev
.- Note that you will need to log in to Huggingface and be granted access to download this model. We will go over logging in to Huggingface later in this tutorial.
-
output_dir
- Set this to the directory where you want to store your checkpoints and validation images. It's recommended to use a full path here. -
train_batch_size
- this should be kept at 1, especially if you have a very small dataset. -
validation_resolution
- As Flux is a 1024px model, you can set this to1024x1024
.- Additionally, Flux was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them:
1024x1024,1280x768,2048x2048
- Additionally, Flux was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them:
-
validation_guidance
- Use whatever you are used to selecting at inference time for Flux. -
validation_guidance_real
- Use >1.0 to use CFG for flux inference. Slows validations down, but produces better results. Does best with an emptyVALIDATION_NEGATIVE_PROMPT
. -
validation_num_inference_steps
- Use somewhere around 20 to save time while still seeing decent quality. Flux isn't very diverse, and more steps might just waste time. -
--lora_rank=4
if you wish to substantially reduce the size of the LoRA being trained. This can help with VRAM use. -
If training a Schnell LoRA, you'll have to supply
--flux_fast_schedule=true
manually here as well. -
gradient_accumulation_steps
- Previous guidance was to avoid these with bf16 training since they would degrade the model. Further testing showed this is not necessarily the case for Flux.- This option causes update steps to be accumulated over several steps. This will increase the training runtime linearly, such that a value of 2 will make your training run half as quickly, and take twice as long.
-
optimizer
- Beginners are recommended to stick with adamw_bf16, though optimi-lion and optimi-stableadamw are also good choices. -
mixed_precision
- Beginners should keep this inbf16
-
gradient_checkpointing
- set this to true in practically every situation on every device -
gradient_checkpointing_interval
- this could be set to a value of 2 or higher on larger GPUs to only checkpoint every n blocks. A value of 2 would checkpoint half of the blocks, and 3 would be one-third.
Multi-GPU users can reference this document for information on configuring the number of GPUs to use.
Inside config/config.json
is the "primary validation prompt", which is typically the main instance_prompt you are training on for your single subject or style. Additionally, a JSON file may be created that contains extra prompts to run through during validations.
The example config file config/user_prompt_library.json.example
contains the following format:
{
"nickname": "the prompt goes here",
"another_nickname": "another prompt goes here"
}
The nicknames are the filename for the validation, so keep them short and compatible with your filesystem.
To point the trainer to this prompt library, add it to TRAINER_EXTRA_ARGS by adding a new line at the end of config.json
:
"--user_prompt_library": "config/user_prompt_library.json",
A set of diverse prompt will help determine whether the model is collapsing as it trains. In this example, the word <token>
should be replaced with your subject name (instance_prompt).
{
"anime_<token>": "a breathtaking anime-style portrait of <token>, capturing her essence with vibrant colors and expressive features",
"chef_<token>": "a high-quality, detailed photograph of <token> as a sous-chef, immersed in the art of culinary creation",
"just_<token>": "a lifelike and intimate portrait of <token>, showcasing her unique personality and charm",
"cinematic_<token>": "a cinematic, visually stunning photo of <token>, emphasizing her dramatic and captivating presence",
"elegant_<token>": "an elegant and timeless portrait of <token>, exuding grace and sophistication",
"adventurous_<token>": "a dynamic and adventurous photo of <token>, captured in an exciting, action-filled moment",
"mysterious_<token>": "a mysterious and enigmatic portrait of <token>, shrouded in shadows and intrigue",
"vintage_<token>": "a vintage-style portrait of <token>, evoking the charm and nostalgia of a bygone era",
"artistic_<token>": "an artistic and abstract representation of <token>, blending creativity with visual storytelling",
"futuristic_<token>": "a futuristic and cutting-edge portrayal of <token>, set against a backdrop of advanced technology",
"woman": "a beautifully crafted portrait of a woman, highlighting her natural beauty and unique features",
"man": "a powerful and striking portrait of a man, capturing his strength and character",
"boy": "a playful and spirited portrait of a boy, capturing youthful energy and innocence",
"girl": "a charming and vibrant portrait of a girl, emphasizing her bright personality and joy",
"family": "a heartwarming and cohesive family portrait, showcasing the bonds and connections between loved ones"
}
ℹ️ Flux is a flow-matching model and shorter prompts that have strong similarities will result in practically the same image being produced by the model. Be sure to use longer, more descriptive prompts.
If you wish to enable evaluations to score the model's performance, see this document for information on configuring and interpreting CLIP scores.
Flow-matching models such as Flux and SD3 have a property called "shift" that allows us to shift the trained portion of the timestep schedule using a simple decimal value.
By default, no schedule shift is applied to flux, which results in a sigmoid bell-shape to the timestep sampling distribution. This is unlikely to be the ideal approach for Flux, but it results in a greater amount of learning in a shorter period of time than auto-shift.
A commonly-recommended approach is to follow several recent works and enable resolution-dependent timestep shift, --flow_schedule_auto_shift
which uses higher shift values for larger images, and lower shift values for smaller images. This results in stable but potentially mediocre training results.
Thanks to General Awareness from Discord for the following examples
When using a --flow_schedule_shift
value of 0.1 (a very low value), only the finer details of the image are affected:
When using a --flow_schedule_shift
value of 4.0 (a very high value), the large compositional features and potentially colour space of the model becomes impacted:
Tested on Apple and NVIDIA systems, Hugging Face Optimum-Quanto can be used to reduce the precision and VRAM requirements, training Flux on just 16GB.
Inside your SimpleTuner venv:
pip install optimum-quanto
For config.json
users:
"base_model_precision": "int8-quanto",
"text_encoder_1_precision": "no_change",
"text_encoder_2_precision": "no_change",
"lora_rank": 16,
"max_grad_norm": 1.0,
"base_model_default_dtype": "bf16"
# When training 'mmdit', we find very stable training that makes the model take longer to learn.
# When training 'all', we can easily shift the model distribution, but it is more prone to forgetting and benefits from high quality data.
# When training 'all+ffs', all attention layers are trained in addition to the feed-forward which can help with adapting the model objective for the LoRA.
# - This mode has been reported to lack portability, and platforms such as ComfyUI might not be able to load the LoRA.
# The option to train only the 'context' blocks is offered as well, but its impact is unknown, and is offered as an experimental choice.
# - An extension to this mode, 'context+ffs' is also available, which is useful for pretraining new tokens into a LoRA before continuing finetuning it via `--init_lora`.
# Other options include 'tiny' and 'nano' which train just 1 or 2 layers.
"--flux_lora_target": "all",
# If you want to use LoftQ initialisation, you can't use Quanto to quantise the base model.
# This possibly offers better/faster convergence, but only works on NVIDIA devices and requires Bits n Bytes and is incompatible with Quanto.
# Other options are 'default', 'gaussian' (difficult), and untested options: 'olora' and 'pissa'.
"--lora_init_type": "loftq",
⚠️ Image quality for training is more important for Flux than for most other models, as it will absorb the artifacts in your images first, and then learn the concept/subject.
It's crucial to have a substantial dataset to train your model on. There are limitations on the dataset size, and you will need to ensure that your dataset is large enough to train your model effectively. Note that the bare minimum dataset size is train_batch_size * gradient_accumulation_steps
as well as more than vae_batch_size
. The dataset will not be useable if it is too small.
ℹ️ With few enough images, you might see a message no images detected in dataset - increasing the
repeats
value will overcome this limitation.
Depending on the dataset you have, you will need to set up your dataset directory and dataloader configuration file differently. In this example, we will be using pseudo-camera-10k as the dataset.
Create a --data_backend_config
(config/multidatabackend.json
) document containing this:
[
{
"id": "pseudo-camera-10k-flux",
"type": "local",
"crop": true,
"crop_aspect": "square",
"crop_style": "center",
"resolution": 512,
"minimum_image_size": 512,
"maximum_image_size": 512,
"target_downsample_size": 512,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/flux/pseudo-camera-10k",
"instance_data_dir": "datasets/pseudo-camera-10k",
"disabled": false,
"skip_file_discovery": "",
"caption_strategy": "filename",
"metadata_backend": "discovery",
"repeats": 0,
"is_regularisation_data": true
},
{
"id": "dreambooth-subject",
"type": "local",
"crop": false,
"resolution": 1024,
"minimum_image_size": 1024,
"maximum_image_size": 1024,
"target_downsample_size": 1024,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/flux/dreambooth-subject",
"instance_data_dir": "datasets/dreambooth-subject",
"caption_strategy": "instanceprompt",
"instance_prompt": "the name of your subject goes here",
"metadata_backend": "discovery",
"repeats": 1000
},
{
"id": "dreambooth-subject-512",
"type": "local",
"crop": false,
"resolution": 512,
"minimum_image_size": 512,
"maximum_image_size": 512,
"target_downsample_size": 512,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/flux/dreambooth-subject-512",
"instance_data_dir": "datasets/dreambooth-subject",
"caption_strategy": "instanceprompt",
"instance_prompt": "the name of your subject goes here",
"metadata_backend": "discovery",
"repeats": 1000
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/flux",
"disabled": false,
"write_batch_size": 128
}
]
ℹ️ Running 512px and 1024px datasets concurrently is supported, and could result in better convergence for Flux.
Then, create a datasets
directory:
mkdir -p datasets
pushd datasets
huggingface-cli download --repo-type=dataset bghira/pseudo-camera-10k --local-dir=pseudo-camera-10k
mkdir dreambooth-subject
# place your images into dreambooth-subject/ now
popd
This will download about 10k photograph samples to your datasets/pseudo-camera-10k
directory, which will be automatically created for you.
Your Dreambooth images should go into the datasets/dreambooth-subject
directory.
You'll want to login to WandB and HF Hub before beginning training, especially if you're using --push_to_hub
and --report_to=wandb
.
If you're going to be pushing items to a Git LFS repository manually, you should also run git config --global credential.helper store
Run the following commands:
wandb login
and
huggingface-cli login
Follow the instructions to log in to both services.
From the SimpleTuner directory, one simply has to run:
./train.sh
This will begin the text embed and VAE output caching to disk.
For more information, see the dataloader and tutorial documents.
Note: It's unclear whether training on multi-aspect buckets works correctly for Flux at the moment. It's recommended to use crop_style=random
and crop_aspect=square
.
In ComfyUI, you'll need to put Flux through another node called AdaptiveGuider. One of the members from our community has provided a modified node here:
(external links) IdiotSandwichTheThird/ComfyUI-Adaptive-Guidan... and their example workflow here
Inferencing the CFG-distilled LoRA is as easy as using a lower guidance_scale around the value trained with.
Currently, the lowest VRAM utilisation (9090M) can be attained with:
- OS: Ubuntu Linux 24
- GPU: A single NVIDIA CUDA device (10G, 12G)
- System memory: 50G of system memory approximately
- Base model precision:
bnb-nf4
- Optimiser: Lion 8Bit Paged,
bnb-lion8bit-paged
- Resolution: 512px
- 1024px requires >= 12G VRAM
- Batch size: 1, zero gradient accumulation steps
- DeepSpeed: disabled / unconfigured
- PyTorch: 2.6 Nightly (Sept 29th build)
- Using
--quantize_via=cpu
to avoid outOfMemory error during startup on <=16G cards. - With
--attention_mechanism=sageattention
to further reduce VRAM by 0.1GB and improve training validation image generation speed. - Be sure to enable
--gradient_checkpointing
or nothing you do will stop it from OOMing
NOTE: Pre-caching of VAE embeds and text encoder outputs may use more memory and still OOM. If so, text encoder quantisation and VAE tiling can be enabled.
Speed was approximately 1.4 iterations per second on a 4090.
When using --attention_mechanism=sageattention
, inference can be sped-up at validation time.
Note: This isn't compatible with every model configuration, but it's worth trying.
In simplest terms, NF4 is a 4bit-ish representation of the model, which means training has serious stability concerns to address.
In early tests, the following holds true:
- Lion optimiser causes model collapse but uses least VRAM; AdamW variants help to hold it together; bnb-adamw8bit, adamw_bf16 are great choices
- AdEMAMix didn't fare well, but settings were not explored
--max_grad_norm=0.01
further helps reduce model breakage by preventing huge changes to the model in too short a time- NF4, AdamW8bit, and a higher batch size all help to overcome the stability issues, at the cost of more time spent training or VRAM used
- Upping the resolution from 512px to 1024px slows training down from, for example, 1.4 seconds per step to 3.5 seconds per step (batch size of 1, 4090)
- Anything that's difficult to train on int8 or bf16 becomes harder in NF4
- It's less compatible with options like SageAttention
NF4 does not work with torch.compile, so whatever you get for speed is what you get.
If VRAM is not a concern (eg. 48G or greater) then int8 with torch.compile is your best, fastest option.
If you are training a subject or style and would like to mask one or the other, see the masked loss training section of the Dreambooth guide.
The Dev model arrives guidance-distilled out of the box, which means it does a very straight shot trajectory to the teacher model outputs. This is done through a guidance vector that is fed into the model at training and inference time - the value of this vector greatly impacts what type of resulting LoRA you end up with:
- A value of 1.0 (the default) will preserve the initial distillation done to the Dev model
- This is the most compatible mode
- Inference is just as fast as the original model
- Flow-matching distillation reduces the creativity and output variability of the model, as with the original Flux Dev model (everything keeps the same composition/look)
- A higher value (tested around 3.5-4.5) will reintroduce the CFG objective into the model
- This requires the inference pipeline to have support for CFG
- Inference is 50% slower and 0% VRAM increase or about 20% slower and 20% VRAM increase due to batched CFG inference
- However, this style of training improves creativity and model output variability, which might be required for certain training tasks
We can partially reintroduce distillation to a de-distilled model by continuing tuning your model using a vector value of 1.0. It will never fully recover, but it'll at least be more useable.
- This has the end impact of either:
- Increasing inference latency by 2x when we sequentially calculate the unconditional output, eg. with two separate forward pass
- Increasing the VRAM consumption equivalently to using
num_images_per_prompt=2
and receiving two images at inference time, accompanied by the same percent slowdown.- This is often less extreme slowdown than sequential computation, but the VRAM use might be too much for most consumer training hardware.
- This method is not currently integrated into SimpleTuner, but work is ongoing.
- Inference workflows for ComfyUI or other applications (eg. AUTOMATIC1111) will need to be modified to also enable "true" CFG, which might not be currently possible out of the box.
- Minimum 8bit quantisation is required for a 16G card to train this model
- In bfloat16/float16, a rank-1 LoRA sits at just over 30GB of mem use
- Quantising the model to 8bit doesn't harm training
- It allows you to push higher batch sizes and possibly obtain a better result
- Behaves the same as full-precision training - fp32 won't make your model any better than bf16+int8.
- int8 has hardware acceleration and
torch.compile()
support on newer NVIDIA hardware (3090 or better) - nf4-bnb brings VRAM requirements down to 9GB, fitting on a 10G card (with bfloat16 support)
- When loading the LoRA in ComfyUI later, you must use the same base model precision as you trained your LoRA on.
- int4 is relies on custom bf16 kernels, and will not work if your card does not support bfloat16
- If you get SIGKILL after the text encoders are unloaded, this means you do not have enough system memory to quantise Flux.
- Try loading the
--base_model_precision=bf16
but if that does not work, you might just need more memory.. - Try
--quantize_via=accelerator
to use the GPU instead
- Try loading the
- If you train a LyCORIS LoKr on Dev, it generally works very well on Schnell at just 4 steps later.
- Direct Schnell training really needs a bit more time in the oven - currently, the results do not look good
ℹ️ When merging Schnell with Dev in any way, the license of Dev takes over and it becomes non-commercial. This shouldn't really matter for most users, but it's worth noting.
- LoRA has overall worse performance than LoKr for larger datasets
- It's been reported that Flux LoRA trains similarly to SD 1.5 LoRAs
- However, a model as large as 12B has empirically performed better with lower learning rates.
- LoRA at 1e-3 might totally roast the thing. LoRA at 1e-5 does nearly nothing.
- Ranks as large as 64 through 128 might be undesirable on a 12B model due to general difficulties that scale up with the size of the base model.
- Try a smaller network first (rank-1, rank-4) and work your way up - they'll train faster, and might do everything you need.
- If you're finding that it's excessively difficult to train your concept into the model, you might need a higher rank and more regularisation data.
- Other diffusion transformer models like PixArt and SD3 majorly benefit from
--max_grad_norm
and SimpleTuner keeps a pretty high value for this by default on Flux.- A lower value would keep the model from falling apart too soon, but can also make it very difficult to learn new concepts that venture far from the base model data distribution. The model might get stuck and never improve.
- Higher learning rates are better for LoKr (
1e-3
with AdamW,2e-4
with Lion) - Other algo need more exploration.
- Setting
is_regularisation_data
on such datasets may help preserve / prevent bleed and improve the final resulting model's quality.- This behaves differently from "prior loss preservation" which is known for doubling training batch sizes and not improving the result much
- SimpleTuner's regularisation data implementation provides an efficient manner of preserving the base model
Flux will immediately absorb bad image artifacts. It's just how it is - a final training run on just high quality data may be required to fix it at the end.
When you do these things (among others), some square grid artifacts may begin appearing in the samples:
- Overtrain with low quality data
- Use too high of a learning rate
- Overtraining (in general), a low-capacity network with too many images
- Undertraining (also), a high-capacity network with too few images
- Using weird aspect ratios or training data sizes
- Training for too long on square crops probably won't damage this model too much. Go nuts, it's great and reliable.
- On the other hand, using the natural aspect buckets of your dataset might overly bias these shapes during inference time.
- This could be a desirable quality, as it keeps aspect-dependent styles like cinematic stuff from bleeding into other resolutions too much.
- However, if you're looking to improve results equally across many aspect buckets, you might have to experiment with
crop_aspect=random
which comes with its own downsides.
- Mixing dataset configurations by defining your image directory dataset multiple times has produced really good results and a nicely generalised model.
Some fine-tuned Flux models on Hugging Face Hub (such as Dev2Pro) lack the full directory structure, requiring these specific options be set.
Make sure to set these options flux_guidance_value
, validation_guidance_real
and flux_attention_masked_training
according to the way the creator did as well if that information is available.
{
"model_family": "flux",
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"pretrained_transformer_model_name_or_path": "ashen0209/Flux-Dev2Pro",
"pretrained_vae_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"pretrained_transformer_subfolder": "none",
}
The users of Terminus Research who worked on this probably more than their day jobs to figure it out
Lambda Labs for generous compute allocations that were used for tests and verifications for large scale training runs
Especially @JimmyCarter and @kaibioinfo for coming up with some of the best ideas and putting them into action, offering pull requests and running exhaustive tests for analysis - even daring to use their own faces for DreamBooth experimentation.