FoundationVision
diff --git a/‎.gitignore
Lines changed: 20 additions & 0 deletions b/‎.gitignore
Lines changed: 20 additions & 0 deletions
diff --git a/‎LICENSE
Lines changed: 21 additions & 0 deletions b/‎LICENSE
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 349 additions & 0 deletions b/‎README.md
Lines changed: 349 additions & 0 deletions
diff --git a/‎assets/samples.png
4.72 MB b/‎assets/samples.png
4.72 MB
diff --git a/‎assets/teaser.png
784 KB b/‎assets/teaser.png
784 KB
diff --git a/‎assets/vis_imgs/v0.jpg
2.71 MB b/‎assets/vis_imgs/v0.jpg
2.71 MB
diff --git a/‎assets/vis_imgs/v1.jpg
5.61 MB b/‎assets/vis_imgs/v1.jpg
5.61 MB
diff --git a/‎assets/vis_imgs/v2.jpg
51.4 KB b/‎assets/vis_imgs/v2.jpg
51.4 KB
diff --git a/‎assets/vis_imgs/v3.jpg
98.6 KB b/‎assets/vis_imgs/v3.jpg
98.6 KB
diff --git a/‎assets/vis_imgs/v4.jpg
55.1 KB b/‎assets/vis_imgs/v4.jpg
55.1 KB
@@ -0,0 +1,20 @@
+*.swp
+**/__pycache__/**
+**/.ipynb_checkpoints/**
+.DS_Store
+.idea/*
+.vscode/*
+llava/
+_vis_cached/
+_auto_*
+ckpt/
+log/
+tb*/
+img*/
+local_output*
+*.pth
+*.pth.tar
+*.ckpt
+*.log
+*.txt
+*.ipynb
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 FoundationVision
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,349 @@
+<div align="center">
+<h1>UniTok: A Unified Tokenizer for Visual Generation and Understanding</h1>
+
+[**Chuofan Ma**](https://machuofan.github.io/)<sup>1,2</sup> · [**Junfeng Wu**](https://wjf5203.github.io/)<sup>2,3</sup> · [**Yi Jiang**](https://enjoyyi.github.io/)<sup>2&dagger;</sup> · [**Jihan Yang**](https://jihanyang.github.io/)<sup>1</sup>
+<br>
+[**Xin Yu**](https://xinyu-andy.github.io/)<sup>1</sup> · [**Zehuan Yuan**](https://shallowyuan.github.io/)<sup>2*</sup> · [**Bingyue Peng**](https://openreview.net/profile?id=~BINGYUE_PENG1)<sup>2</sup> · [**Xiaojuan Qi**](https://xjqi.github.io/)<sup>1&dagger;*</sup>
+
+<sup>1</sup>HKU&emsp;&emsp;&emsp;<sup>2</sup>ByteDance&emsp;&emsp;&emsp;<sup>3</sup>HUST
+<br>
+&dagger;project lead&emsp;&emsp;&emsp;*corresponding author
+
+<a href=""><img src='https://img.shields.io/badge/arXiv-UniTok-red' alt='Paper PDF'></a>
+<a href=""><img src='https://img.shields.io/badge/Project_Page-UniTok-green' alt='Project Page'></a>
+<a href="https://huggingface.co/FoundationVision/UniTok"><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
+
+[//]: # (<a href='https://huggingface.co/datasets/depth-anything/DA-2K'><img src='https://img.shields.io/badge/Benchmark-DA--2K-yellow' alt='Benchmark'></a>)
+</div>
+
+This repo implements UniTok, a unified visual tokenizer well-suited for both generation and understanding tasks. 
+It is compatiable with autoregressive generative models (e.g. LlamaGen), 
+multimodal understanding models (e.g. LLaVA), and unified MLLMs (e.g. Chameleon and Liquid).
+
+![teaser](assets/teaser.png)
+
+Built upon UniTok, we construct an MLLM capable of both multimodal generation and understanding,
+which sets a new state-of-the-art among unified autoregressive MLLMs. 
+The code and weights of our MLLM will be released soon.
+
+![teaser](assets/samples.png)
+
+## News
+
+**2025-02-14:** Paper, code, and model weights for UniTok are all released.
+
+
+## Performance
+
+<table>
+    <thead>
+        <tr>
+            <th>Method</th>
+            <th>#Tokens</th>
+            <th>rFID &darr;</th>
+            <th>Accuracy</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td colspan="4"><i>VQVAE Model</i></td>
+        </tr>
+        <tr align="center">
+            <td>VQ-GAN</td>
+            <td>256</td>
+            <td>4.98</td>
+            <td>--</td>
+        </tr>
+        <tr align="center">
+            <td>RQ-VAE</td>
+            <td>256</td>
+            <td>1.30</td>
+            <td>--</td>
+        </tr>
+        <tr align="center">
+            <td>VAR</td>
+            <td>680</td>
+            <td>0.90</td>
+            <td>--</td>
+        </tr>
+        <tr>
+            <td colspan="4"><i>CLIP Model</i></td>
+        </tr>
+        <tr align="center">
+            <td>CLIP</td>
+            <td>256</td>
+            <td>--</td>
+            <td>76.2</td>
+        </tr>
+        <tr align="center">
+            <td>SigLIP</td>
+            <td>256</td>
+            <td>--</td>
+            <td>80.5</td>
+        </tr>
+        <tr align="center">
+            <td>ViTamin</td>
+            <td>256</td>
+            <td>--</td>
+            <td>81.2</td>
+        </tr>
+        <tr>
+            <td colspan="4"><i>Unified Model</i></td>
+        </tr>
+        <tr align="center">
+            <td>TokenFlow &dagger;</td>
+            <td>680</td>
+            <td>1.37</td>
+            <td>--</td>
+        </tr>
+        <tr align="center">
+            <td>VILA-U &dagger;</td>
+            <td>256</td>
+            <td>1.80</td>
+            <td>73.3</td>
+        </tr>
+        <tr align="center">
+            <td>UniTok</td>
+            <td>256</td>
+            <td>0.39</td>
+            <td>70.5</td>
+        </tr>
+        <tr align="center">
+            <td>UniTok &dagger;</td>
+            <td>256</td>
+            <td>0.38</td>
+            <td>78.6</td>
+        </tr>
+    </tbody>
+</table>
+
+&dagger; indicates the model uses pretrained CLIP weights for initialization.
+<br>**Note:** Although CLIP weight initialization yields better ImageNet zero-shot accuracy,
+we notice that random initialization leads to better downstream understanding performance.
+We thus release the model weights of randomly initialized UniTok.
+
+[//]: # (**Visual Understanding Performance on VQA Benchmarks.**)
+
+[//]: # ()
+[//]: # (|   Method   |      LLM       |  Res.   |  VQAv2   |   GQA    | TextVQA  |   POPE   |   MME    |  MM-Vet  |)
+
+[//]: # (|:----------:|:--------------:|:-------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|)
+
+[//]: # (|   Show-o   |  Phi-1.5-1.3B  |   256   |   59.3   |   48.7   |    -     |   73.8   |   948    |    -     |)
+
+[//]: # (|   Liquid   |    Gemma-7B    |   512   |   71.3   |   58.4   |   42.4   |   81.1   |   1119   |    -     |)
+
+[//]: # (|   VILA-U   |   Llama-2-7B   |   256   |   75.3   |   58.3   |   48.3   |   83.9   |   1336   |   27.7   |)
+
+[//]: # (| **UniTok** | **Llama-2-7B** | **256** | **76.8** | **61.1** | **51.6** | **83.2** | **1448** | **33.9** |)
+
+[//]: # ()
+[//]: # (**Visual Generation Performance on GenAI-Bench.**)
+
+[//]: # ()
+[//]: # (<table>)
+
+[//]: # (    <thead>)
+
+[//]: # (    <tr>)
+
+[//]: # (        <th rowspan="2">Method</th>)
+
+[//]: # (        <th rowspan="2">Type</th>)
+
+[//]: # (        <th rowspan="2">Count</th>)
+
+[//]: # (        <th rowspan="2">Differ</th>)
+
+[//]: # (        <th rowspan="2">Compare</th>)
+
+[//]: # (        <th colspan="2">Logical</th>)
+
+[//]: # (        <th rowspan="2">Overall</th>)
+
+[//]: # (    </tr>)
+
+[//]: # (    <tr>)
+
+[//]: # (        <th>Negate</th>)
+
+[//]: # (        <th>Universal</th>)
+
+[//]: # (    </tr>)
+
+[//]: # (    </thead>)
+
+[//]: # (    <tbody>)
+
+[//]: # (    <tr align="center">)
+
+[//]: # (        <td>Show-o</td>)
+
+[//]: # (        <td>Discrete Diff.</td>)
+
+[//]: # (        <td>0.70</td>)
+
+[//]: # (        <td>0.62</td>)
+
+[//]: # (        <td>0.71</td>)
+
+[//]: # (        <td>0.51</td>)
+
+[//]: # (        <td>0.65</td>)
+
+[//]: # (        <td>0.60</td>)
+
+[//]: # (    </tr>)
+
+[//]: # (    <tr align="center">)
+
+[//]: # (        <td>VILA-U</td>)
+
+[//]: # (        <td>Autoregressive</td>)
+
+[//]: # (        <td>0.70</td>)
+
+[//]: # (        <td>0.71</td>)
+
+[//]: # (        <td>0.74</td>)
+
+[//]: # (        <td>0.53</td>)
+
+[//]: # (        <td>0.66</td>)
+
+[//]: # (        <td>0.64</td>)
+
+[//]: # (    </tr>)
+
+[//]: # (    <tr align="center">)
+
+[//]: # (        <td>Liquid</td>)
+
+[//]: # (        <td>Autoregressive</td>)
+
+[//]: # (        <td>0.76</td>)
+
+[//]: # (        <td>0.73</td>)
+
+[//]: # (        <td>0.74</td>)
+
+[//]: # (        <td>0.46</td>)
+
+[//]: # (        <td>0.74</td>)
+
+[//]: # (        <td>0.65</td>)
+
+[//]: # (    </tr>)
+
+[//]: # (    <tr align="center">)
+
+[//]: # (        <th>UniTok</th>)
+
+[//]: # (        <th>Autoregressive</th>)
+
+[//]: # (        <th>0.76</th>)
+
+[//]: # (        <th>0.79</th>)
+
+[//]: # (        <th>0.74</th>)
+
+[//]: # (        <th>0.46</th>)
+
+[//]: # (        <th>0.73</th>)
+
+[//]: # (        <th>0.67</th>)
+
+[//]: # (    </tr>)
+
+[//]: # (    </tbody>)
+
+[//]: # (</table>)
+
+
+## Model Weights
+
+|    Model     | Res. | #Token |        Code Shape         | rFID |  Checkpoint  |
+|:------------:|:----:|:------:|:-------------------------:|:----:|:------------:|
+| UniTok-Large | 256  |  256   | 16 $\times$ 16 $\times$ 8 | 0.39 | [Download](https://huggingface.co/FoundationVision/UniTok/blob/main/unitok_tokenizer.pth) |
+
+
+## Usage
+
+### Requirements
+- Python ≥ 3.10
+- PyTorch ≥ 2.3.1
+
+### Installation
+
+```bash
+git clone https://github.com/FoundationVision/UniTok.git
+cd UniTok
+pip install -r requirements.txt
+```
+
+### Inference
+
+Please download the [checkpoint](https://huggingface.co/FoundationVision/UniTok/blob/main/unitok_tokenizer.pth) and fill in the `ckpt_path`.
+```bash
+python inference.py \
+    --ckpt_path /path/to/unitok/checkpoint \
+    --src_img /path/to/test_img --rec_img /path/to/rec_img
+```
+
+### Training
+
+- We train UniTok on [DataComp-1B](https://github.com/mlfoundations/datacomp). 
+Please follow the [instructions](https://github.com/mlfoundations/datacomp?tab=readme-ov-file#downloading-datacomp-1b) to download and prepare the data.
+
+- Download the [models](https://huggingface.co/FoundationVision/UniTok/tree/main/external) used for loss calculation and put them in `./external`.
+
+- Download the [ImageNet validation set](https://www.image-net.org/) for zero-shot accuracy evaluation.
+
+- Download the ImageNet 256$\times$256 [reference batch](https://huggingface.co/datasets/FoundationVision/imagenet_reference_batch) for FID evaluation.
+
+Configure `nnodes, nproc_per_node, node_rank, master_addr, master_port` in `launch.sh` and run:
+
+```bash
+bash launch.sh \
+    --output_dir '/path/to/save/checkpoints/' \
+    --train_data '/path/to/datacomp/shards/{00000000..00140146}.tar' \
+    --imagenet_val '/path/to/imagenet_val/' \
+    --fid_eval_src '/path/to/imagenet_reference_batch' \
+    --fid_eval_dst '/path/to/save/imagenet_reconstructed_batch'
+```
+**Note:** For more hyper-parameter configurations, please check `utils/config.py`.
+
+### Evaluation
+
+We benchmark UniTok in terms of both understanding performance using the [LLaVA](https://github.com/haotian-liu/LLaVA) framework 
+and generation performance using the [LLamaGen](https://github.com/FoundationVision/LlamaGen) framework.
+Please refer to [EVAL.md](eval/EVAL.md) for more details.
+
+
+
+## Acknowledgement
+UniTok is built upon the awesome works 
+[VAR](https://github.com/FoundationVision/VAR),
+[DataComp](https://github.com/mlfoundations/datacomp),
+[LLaVA](https://github.com/haotian-liu/LLaVA/),
+[LlamaGen](https://github.com/FoundationVision/LlamaGen/),
+and [ViTamin](https://github.com/Beckschen/ViTamin).
+
+
+## LICENSE
+
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+
+
+## Citation
+
+If you find this project useful, please consider citing:
+
+```bibtex
+@article{unitok,
+  title={UniTok: A Unified Tokenizer for Visual Generation and Understanding},
+  author={Ma, Chuofan and Wu, Junfeng and Jiang, Yi and Yang, Jihan and Yu, Xin and Yuan, Zehuan and Peng, Bingyue and Qi, Xiaojuan},
+  journal={},
+  year={2025}
+}
+```