Factorized Visual Tokenization and Generation

Zechen Bai ¹ Jianxiong Gao ² Ziteng Gao ¹

Pichao Wang ³ Zheng Zhang ³ Tong He ³ Mike Zheng Shou ¹

arXiv 2024

¹ Show Lab, National University of Singapore ² Fudan University ³ Amazon

News

[2024-12-26] We released our code!
[2024-11-26] We released our paper on arXiv.

TL;DR

FQGAN is state-of-the-art visual tokenizer with a novel factorized tokenization design, surpassing VQ and LFQ methods in discrete image reconstruction.

Method Overview

FQGAN addresses the large codebook usage issue by decomposing a single large codebook into multiple independent sub-codebooks. By leveraging disentanglement regularization and representation learning objectives, the sub-codebooks learn hierarchical, structured and semantic meaningful representations. FQGAN achieves state-of-the-art performance on discrete image reconstruction, surpassing VQ and LFQ methods.

Getting Started

Pre-trained Models

Method	Downsample	rFID (256x256)	weight
FQGAN-Dual	16	0.94	fqgan_dual_ds16.pt
FQGAN-Triple	16	0.76	fqgan_triple_ds16.pt
FQGAN-Dual	8	0.32	fqgan_dual_ds8.pt
FQGAN-Triple	8	0.24	fqgan_triple_ds8_c2i.pt

Setup

The main dependency of this project is pytorch and transformers. You may use your existing python environment.

git clone https://github.com/showlab/FQGAN.git

conda create -n fqgan python=3.10 -y
conda activate fqgan

pip3 install torch==2.1.1+cu121 torchvision==0.16.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121
pip3 install -r requirements.txt

Training

First, please prepare ImageNet dataset.

# Train FQGAN-Dual Tokenizer (Downsample 16X by default
bash train_fqgan_dual.sh

# Train FQGAN-Triple Tokenizer (Downsample 16X by default
bash train_fqgan_triple.sh

To train the FAR Generation Model, please follow the instructions in train_far_dual.sh.

Evaluation

Download the pre-trained tokenizer weights or train the model by yourself.

First, generate the reference .npz file of the validation set. You only need to run this command once

torchrun --nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12343 \
tokenizer/val_ddp.py \
--data-path /home/ubuntu/DATA/ImageNet/val \
--image-size 256 \
--per-proc-batch-size 128

Evaluate FQGAN-Dual model

torchrun \
  --nnodes=1 --nproc_per_node=8 --node_rank=0 \
  --master_port=12344 \
  tokenizer/reconstruction_vq_ddp_dual.py \
  --data-path /home/ubuntu/DATA/ImageNet/val \
  --image-size 256 \
  --vq-model VQ-16 \
  --vq-ckpt results_tokenizer_image/fqgan_dual_ds16.pt \
  --codebook-size 16384 \
  --codebook-embed-dim 8 \
  --per-proc-batch-size 128 \
  --with_clip_supervision \
  --folder-name FQGAN_Dual_DS16

python3 evaluations/evaluator.py \
  reconstructions/val_imagenet.npz \
  reconstructions/FQGAN_Dual_DS16.npz

Evaluate FQGAN-Triple model

torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12344 \
tokenizer/reconstruction_vq_ddp_triple.py \
  --data-path /home/ubuntu/DATA/ImageNet/val \
  --image-size 256 \
  --vq-model VQ-16 \
  --vq-ckpt results_tokenizer_image/fqgan_triple_ds16.pt \
  --codebook-size 16384 \
  --codebook-embed-dim 8 \
  --per-proc-batch-size 64 \
  --with_clip_supervision \
  --folder-name FQGAN_Triple_DS16

python3 evaluations/evaluator.py \
  reconstructions/val_imagenet.npz \
  reconstructions/FQGAN_Triple_DS16.npz

To evaluate the FAR Generation Model, please follow the instructions in eval_far.sh.

Comparison with previous visual tokenizers

What has each sub-codebook learned?

Can this tokenizer be used into downstream image generation?

Citation

To cite the paper and model, please use the below:

@article{bai2024factorized,
  title={Factorized Visual Tokenization and Generation},
  author={Bai, Zechen and Gao, Jianxiong and Gao, Ziteng and Wang, Pichao and Zhang, Zheng and He, Tong and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2411.16681},
  year={2024}
}

Acknowledgement

This work is based on Taming-Transformers, Open-MAGVIT2, and LlamaGen. Thanks to all the authors for their great works!

License

The code is released under CC-BY-NC-4.0 license for research purpose only.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
autoregressive		autoregressive
dataset		dataset
evaluations		evaluations
tokenizer		tokenizer
utils		utils
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eval_far.sh		eval_far.sh
requirements.txt		requirements.txt
train_far_dual.sh		train_far_dual.sh
train_fqgan_dual.sh		train_fqgan_dual.sh
train_fqgan_triple.sh		train_fqgan_triple.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Factorized Visual Tokenization and Generation

TL;DR

Method Overview

Getting Started

Pre-trained Models

Setup

Training

Evaluation

Comparison with previous visual tokenizers

What has each sub-codebook learned?

Can this tokenizer be used into downstream image generation?

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Languages

License

showlab/FQGAN

Folders and files

Latest commit

History

Repository files navigation

Factorized Visual Tokenization and Generation

TL;DR

Method Overview

Getting Started

Pre-trained Models

Setup

Training

Evaluation

Comparison with previous visual tokenizers

What has each sub-codebook learned?

Can this tokenizer be used into downstream image generation?

Citation

Acknowledgement

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages