UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

This project accompanies the research paper,

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
Rui Tian*, Mingfei Gao*, Mingze Xu*, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan

UniGen is a unified multimodal large language model (MLLM) capable of both image understanding and generation. We detail UniGen's full training pipeline from a data-centric perspective, including its multi-stage pre-training, supervised fine-tuning, and direct preference optimization.

More importantly, we introduce Chain-of-Thought Verification (CoT-V), a novel test-time strategy that significantly boosts image generation quality using a simple Best-of-N approach.

📢 News

[11/18] 🚀🚀🚀 UniGen-1.5 is on ArXiv!
[9/19] 🔥🔥🔥 UniGen has been accepted to NeurIPS 2025!

📚 Table of Contents

Getting Started
- Installation
- Data Preparation
Training Scripts
Evaluation Scripts
- Evaluation Installation
License
Citations

🚀 Getting Started

Installation

This code requires Python >= 3.10.12, PyTorch >= 2.4.1, and CUDA 12.4.

[Optional but recommended] Create and activate a new conda environment.
```
conda create -n unigen python=3.10.12
```
And activate the environment.
```
conda activate unigen
```
Install the required dependencies.
```
bash scripts/setup.sh
```

Download the pre-trained weights for Qwen2.5-1.5b, MAGViTv2, and SigLIP from Hugging Face and place them in the unigen_data/checkpoints directory.

huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --repo-type model --local-dir unigen_data/checkpoints/Qwen2.5-1.5B-Instruct

huggingface-cli download showlab/magvitv2 --repo-type model --local-dir unigen_data/checkpoints/magvitv2

huggingface-cli download google/siglip-so400m-patch14-384 --repo-type model --local-dir unigen_data/checkpoints/siglip-so400m-patch14-384

Add your OpenAI API key and organization to your environment variables for model evaluation.

export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
export OPENAI_ORG="YOUR_OPENAI_ORG"  # Optional

[Optional] Add your Weights & Biases API key to enable logging during training.
```
wandb login "YOUR_WANDB_API_KEY"
```

Data Preparation

Prepare the following datasets and place them in the unigen_data/datasets directory.

Text-only Dataset: Download RefinedWeb from Hugging Face.

huggingface-cli download  tiiuae/falcon-refinedweb --repo-type dataset  --local-dir unigen_data/datasets/falcon-refinedweb

Image-Text Pair Dataset (for Pre-training): Download CC-12M, CC-3M, Segment-Anything-11M, and ImageNet-21K. Prepare all datasets in the WebDataset format and perform re-captioning using the following system prompt:
```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>
What is the content of this image?<|im_end|>
<|im_start|>assistant
```
The re-annotated caption should be saved in the key pf .txt in webdataset, and the image should be saved in the key of .png|.jpg|.jpeg|.wbep.
Supervised Fine-Tuning (SFT) Data:
- Generation Data: Download JourneyDB and text2image-2M and prepare them in the WebDataset format.
- Understanding Data: Our paper uses the single-image mixture from SlowFast-LLaVA-1.5. For this open-source release, we use the LLaVA-1.5 instruction tuning data, as specified in the training config.

Direct Preference Optimization (DPO) Data:

Prepare text prompts from various sources.
Set up the data annotation environment with vllm.
```
pip install vllm==0.7.3
```

Convert prompts into related visual questions using an LLM.

python scripts/dataflows/zeroshot_questions.py --metadata_path /path/to/prompt --out_path /path/to/out --model_name Qwen/Qwen2.5-7B-Instruct

Generate N image samples for each text prompt with the UniGen-SFT model.

Run the pseudo-labeling pipeline on each image-question pair.

python scripts/dataflows/zeroshot_vqa.py --metadata_path /path/to/visual_question --out_path /path/to/out --image_root /path/to/img --model_name Qwen/Qwen2.5-VL-7B-Instruct

⚙️ Training Scripts

Pre-training: Stage 1

Run the following script for Stage 1 pre-training on 2x 80GB H100/A100 GPUs.

bash scripts/run_pretraining.sh \
     --experiment_config configs/unigen_1_5b/unigen_pt1.yaml \
     --output_dir path_to_your_out \
     --train_module train.py

Pre-training Stage 2

Place the final checkpoint from Stage 1 (unigen_pt1/checkpoint-150000) in unigen_data/checkpoints. Then, run the following command for Stage 2 pre-training on 4x 80GB H100/A100 GPUs.

bash scripts/run_pretraining.sh \
     --experiment_config configs/unigen_1_5b/unigen_pt2.yaml \
     --pretrained_model  unigen_pt1/checkpoint-150000  \
     --output_dir path_to_your_out \
     --train_module train.py

Supervised Finetuning

Place the final checkpoint from Stage 2 (unigen_pt2/checkpoint-400000) in unigen_data/checkpoints. Then, run the following command for SFT on 1x 80GB H100/A100 GPU.

bash scripts/run_sft.sh \
     --experiment_config configs/unigen_1_5b/unigen_sft.yaml \
     --pretrained_model  unigen_pt2/checkpoint-400000 \
     --train_module train_w_clip_vit.py \
     --output_dir path_to_your_out

Direct Preference Optimization

Place the final SFT checkpoint (unigen_sft/checkpoint-145824) in unigen_data/checkpoints. Then, run the following command for DPO on 1x 80GB H100/A100 GPU.

bash scripts/run_sft.sh \
    --experiment_config configs/unigen_1_5b/unigen_dpo.yaml  \
    --pretrained_model unigen_sft/checkpoint-145824 \
    --train_module train_dpo.py \
    --output_dir path_to_your_out

CoT-V Post-Training

Place the final DPO checkpoint (unigen_dpo/unwrapped_model) in unigen_data/checkpoints. Then, run the following command for CoT-V post-training on 1x 80GB H100/A100 GPU.

bash scripts/run_cotv.sh \
    --experiment_config configs/unigen_1_5b/unigen_cotv_post_sft.yaml \
    --pretrained_model unigen_dpo/unwrapped_model \
    --train_module train_w_clip_vit.py \
    --output_dir path_to_your_out

Evaluation Scripts

Evaluation Installation

Install the necessary requirements and clone required repos for evaluating on understanding (lmms-eval) and generation (DPGbench, GenEval) benchmarks.

bash scripts/setup_eval.sh

Next, download the checkpoints required for evaluation.

LOCAL_CHECKPOINT_DIR=unigen_data/checkpoints
python -c $'from modelscope.hub.snapshot_download import snapshot_download\nsnapshot_download("damo/mplug_visual-question-answering_coco_large_en")'
bash third_party/geneval/evaluation/download_models.sh $LOCAL_CHECKPOINT_DIR

Evaluating UniGen-PT1 Checkpoints

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_pt1.yaml  \
    --eval_modules geneval+dpgbench \
    --eval_checkpoint unigen_pt1/checkpoint-150000 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

Evaluating UniGen-PT2 Checkpoints

bash scripts/run_evaluation.sh \
     --config  configs/unigen_1_5b/unigen_pt2.yaml \
     --eval_modules geneval+dpgbench \
     --eval_checkpoint unigen_pt2/checkpoint-400000 \
     --output_dir path_to_your_out \
     --local_shared_fs unigen_data

Evaluating UniGen-SFT Checkpoints

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_sft.yaml  \
    --lmms_tasks "mmmu_val,gqa,ai2d,mme,mathvista_testmini,mmvet" \
    --eval_modules lmms \
    --eval_checkpoint unigen_sft/checkpoint-145824 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_sft.yaml  \
    --lmms_tasks "realworldqa,scienceqa_img,seedbench,pope" \
    --eval_modules lmms+geneval+dpgbench \
    --eval_checkpoint unigen_sft/checkpoint-145824 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

Evaluating UniGen-DPO Checkpoints

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_dpo.yaml  \
    --lmms_tasks "mmmu_val,gqa,ai2d,mme,mathvista_testmini,mmvet" \
    --eval_modules lmms \
    --eval_checkpoint unigen_dpo/unwrapped_model \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data
    
bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_dpo.yaml  \
    --lmms_tasks "realworldqa,scienceqa_img,seedbench,pope" \
    --eval_modules lmms+geneval+dpgbench \
    --eval_checkpoint unigen_dpo/unwrapped_model \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

Evaluating UniGen after CoT-V Post-Training

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_cotv_post_sft.yaml  \
    --lmms_tasks "mmmu_val,gqa,ai2d,mme,mathvista_testmini,mmvet" \
    --eval_modules lmms \
    --eval_checkpoint unigen/checkpoint-500 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_cotv_post_sft.yaml  \
    --lmms_tasks "realworldqa,scienceqa_img,seedbench,pope" \
    --eval_checkpoint unigen/checkpoint-500 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

Test-time Scaling of UniGen with CoT-V

To perform Best-of-N (where N=5) test-time scaling with CoT-V, set mmu_rating_style="think".

A. On the GenEval Benchmark

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_cotv_post_sft.yaml \
    --eval_modules cot-gen \
    --eval_checkpoint unigen/checkpoint-500 \
    --local_shared_fs unigen_data \
    --output_dir path_to_your_out \
    --mmu_rating_style think

B. On the DPG Benchmark

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_cotv_post_sft.yaml \
    --eval_modules cot-dpg \
    --eval_checkpoint unigen/checkpoint-500 \
    --local_shared_fs unigen_data \
    --output_dir path_to_your_out \
    --mmu_rating_style think

License

This project is licensed under the Apple Sample Code License.

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@article{tian2025unigen,
      title={UniGen: Enhanced Training \& Test-Time Strategies for Unified Multimodal Understanding and Generation},
      author={Tian, Rui and Gao, Mingfei and Xu, Mingze and Hu, Jiaming and Lu, Jiasen and Wu, Zuxuan and Yang, Yinfei and Dehghan, Afshin},
      journal={arXiv preprint arXiv:2505.14682},
      year={2025}
      }

@article{tian2025unigen1.5,
      title={UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning},
      author={Tian, Rui and Gao, Mingfei and Gang, Haiming and Lu, Jiasen and Gan, Zhe and Yang, Yinfei and Wu, Zuxuan and Dehghan, Afshin},
      journal={arXiv preprint arXiv:2511.14760},
      year={2025}
      }

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
asset		asset
components		components
configs		configs
data		data
evaluation		evaluation
models		models
scripts		scripts
third_party		third_party
training		training
utils		utils
.gitignore		.gitignore
ACKNOWLEDGEMENTS.md		ACKNOWLEDGEMENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

📢 News

📚 Table of Contents

🚀 Getting Started

Installation

Data Preparation

⚙️ Training Scripts

Pre-training: Stage 1

Pre-training Stage 2

Supervised Finetuning

Direct Preference Optimization

CoT-V Post-Training

Evaluation Scripts

Evaluation Installation

Evaluating UniGen-PT1 Checkpoints

Evaluating UniGen-PT2 Checkpoints

Evaluating UniGen-SFT Checkpoints

Evaluating UniGen-DPO Checkpoints

Evaluating UniGen after CoT-V Post-Training

Test-time Scaling of UniGen with CoT-V

License

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

apple/ml-unigen

Folders and files

Latest commit

History

Repository files navigation

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

📢 News

📚 Table of Contents

🚀 Getting Started

Installation

Data Preparation

⚙️ Training Scripts

Pre-training: Stage 1

Pre-training Stage 2

Supervised Finetuning

Direct Preference Optimization

CoT-V Post-Training

Evaluation Scripts

Evaluation Installation

Evaluating UniGen-PT1 Checkpoints

Evaluating UniGen-PT2 Checkpoints

Evaluating UniGen-SFT Checkpoints

Evaluating UniGen-DPO Checkpoints

Evaluating UniGen after CoT-V Post-Training

Test-time Scaling of UniGen with CoT-V

License

Citations

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages