Skip to content
/ FQGAN Public

FQGAN: Factorized Visual Tokenization and Generation

License

Notifications You must be signed in to change notification settings

showlab/FQGAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

dc34256 · Mar 29, 2025

History

10 Commits
Nov 28, 2024
Jan 5, 2025
Dec 26, 2024
Dec 26, 2024
Mar 29, 2025
Dec 26, 2024
Dec 26, 2024
Dec 26, 2024
Dec 26, 2024
Dec 26, 2024
Dec 26, 2024
Dec 26, 2024
Dec 26, 2024
Dec 26, 2024
Dec 26, 2024

Repository files navigation


Factorized Visual Tokenization and Generation

Zechen Bai 1  Jianxiong Gao 2  Ziteng Gao 1 

Pichao Wang 3  Zheng Zhang 3  Tong He 3  Mike Zheng Shou 1 

arXiv 2024

1 Show Lab, National University of Singapore   2 Fudan University  3 Amazon 

arXiv

News

  • [2024-12-26] We released our code!
  • [2024-11-26] We released our paper on arXiv.

TL;DR

FQGAN is state-of-the-art visual tokenizer with a novel factorized tokenization design, surpassing VQ and LFQ methods in discrete image reconstruction.

Method Overview

FQGAN addresses the large codebook usage issue by decomposing a single large codebook into multiple independent sub-codebooks. By leveraging disentanglement regularization and representation learning objectives, the sub-codebooks learn hierarchical, structured and semantic meaningful representations. FQGAN achieves state-of-the-art performance on discrete image reconstruction, surpassing VQ and LFQ methods.

Getting Started

Pre-trained Models

Method Downsample rFID (256x256) weight
FQGAN-Dual 16 0.94 fqgan_dual_ds16.pt
FQGAN-Triple 16 0.76 fqgan_triple_ds16.pt
FQGAN-Dual 8 0.32 fqgan_dual_ds8.pt
FQGAN-Triple 8 0.24 fqgan_triple_ds8_c2i.pt

Setup

The main dependency of this project is pytorch and transformers. You may use your existing python environment.

git clone https://github.com/showlab/FQGAN.git

conda create -n fqgan python=3.10 -y
conda activate fqgan

pip3 install torch==2.1.1+cu121 torchvision==0.16.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121
pip3 install -r requirements.txt

Training

First, please prepare ImageNet dataset.

# Train FQGAN-Dual Tokenizer (Downsample 16X by default
bash train_fqgan_dual.sh

# Train FQGAN-Triple Tokenizer (Downsample 16X by default
bash train_fqgan_triple.sh

To train the FAR Generation Model, please follow the instructions in train_far_dual.sh.

Evaluation

Download the pre-trained tokenizer weights or train the model by yourself.

First, generate the reference .npz file of the validation set. You only need to run this command once

torchrun --nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12343 \
tokenizer/val_ddp.py \
--data-path /home/ubuntu/DATA/ImageNet/val \
--image-size 256 \
--per-proc-batch-size 128

Evaluate FQGAN-Dual model

torchrun \
  --nnodes=1 --nproc_per_node=8 --node_rank=0 \
  --master_port=12344 \
  tokenizer/reconstruction_vq_ddp_dual.py \
  --data-path /home/ubuntu/DATA/ImageNet/val \
  --image-size 256 \
  --vq-model VQ-16 \
  --vq-ckpt results_tokenizer_image/fqgan_dual_ds16.pt \
  --codebook-size 16384 \
  --codebook-embed-dim 8 \
  --per-proc-batch-size 128 \
  --with_clip_supervision \
  --folder-name FQGAN_Dual_DS16

python3 evaluations/evaluator.py \
  reconstructions/val_imagenet.npz \
  reconstructions/FQGAN_Dual_DS16.npz

Evaluate FQGAN-Triple model

torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12344 \
tokenizer/reconstruction_vq_ddp_triple.py \
  --data-path /home/ubuntu/DATA/ImageNet/val \
  --image-size 256 \
  --vq-model VQ-16 \
  --vq-ckpt results_tokenizer_image/fqgan_triple_ds16.pt \
  --codebook-size 16384 \
  --codebook-embed-dim 8 \
  --per-proc-batch-size 64 \
  --with_clip_supervision \
  --folder-name FQGAN_Triple_DS16

python3 evaluations/evaluator.py \
  reconstructions/val_imagenet.npz \
  reconstructions/FQGAN_Triple_DS16.npz

To evaluate the FAR Generation Model, please follow the instructions in eval_far.sh.

Comparison with previous visual tokenizers

What has each sub-codebook learned?

Can this tokenizer be used into downstream image generation?

Citation

To cite the paper and model, please use the below:

@article{bai2024factorized,
  title={Factorized Visual Tokenization and Generation},
  author={Bai, Zechen and Gao, Jianxiong and Gao, Ziteng and Wang, Pichao and Zhang, Zheng and He, Tong and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2411.16681},
  year={2024}
}

Acknowledgement

This work is based on Taming-Transformers, Open-MAGVIT2, and LlamaGen. Thanks to all the authors for their great works!

License

The code is released under CC-BY-NC-4.0 license for research purpose only.

About

FQGAN: Factorized Visual Tokenization and Generation

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published