Skip to content

Commit a56697f

Browse files
author
machuofan
committed
init
0 parents  commit a56697f

File tree

346 files changed

+85152
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

346 files changed

+85152
-0
lines changed

.gitignore

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
*.swp
2+
**/__pycache__/**
3+
**/.ipynb_checkpoints/**
4+
.DS_Store
5+
.idea/*
6+
.vscode/*
7+
llava/
8+
_vis_cached/
9+
_auto_*
10+
ckpt/
11+
log/
12+
tb*/
13+
img*/
14+
local_output*
15+
*.pth
16+
*.pth.tar
17+
*.ckpt
18+
*.log
19+
*.txt
20+
*.ipynb

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 FoundationVision
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 349 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,349 @@
1+
<div align="center">
2+
<h1>UniTok: A Unified Tokenizer for Visual Generation and Understanding</h1>
3+
4+
[**Chuofan Ma**](https://machuofan.github.io/)<sup>1,2</sup> · [**Junfeng Wu**](https://wjf5203.github.io/)<sup>2,3</sup> · [**Yi Jiang**](https://enjoyyi.github.io/)<sup>2&dagger;</sup> · [**Jihan Yang**](https://jihanyang.github.io/)<sup>1</sup>
5+
<br>
6+
[**Xin Yu**](https://xinyu-andy.github.io/)<sup>1</sup> · [**Zehuan Yuan**](https://shallowyuan.github.io/)<sup>2*</sup> · [**Bingyue Peng**](https://openreview.net/profile?id=~BINGYUE_PENG1)<sup>2</sup> · [**Xiaojuan Qi**](https://xjqi.github.io/)<sup>1&dagger;*</sup>
7+
8+
<sup>1</sup>HKU&emsp;&emsp;&emsp;<sup>2</sup>ByteDance&emsp;&emsp;&emsp;<sup>3</sup>HUST
9+
<br>
10+
&dagger;project lead&emsp;&emsp;&emsp;*corresponding author
11+
12+
<a href=""><img src='https://img.shields.io/badge/arXiv-UniTok-red' alt='Paper PDF'></a>
13+
<a href=""><img src='https://img.shields.io/badge/Project_Page-UniTok-green' alt='Project Page'></a>
14+
<a href="https://huggingface.co/FoundationVision/UniTok"><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
15+
16+
[//]: # (<a href='https://huggingface.co/datasets/depth-anything/DA-2K'><img src='https://img.shields.io/badge/Benchmark-DA--2K-yellow' alt='Benchmark'></a>)
17+
</div>
18+
19+
This repo implements UniTok, a unified visual tokenizer well-suited for both generation and understanding tasks.
20+
It is compatiable with autoregressive generative models (e.g. LlamaGen),
21+
multimodal understanding models (e.g. LLaVA), and unified MLLMs (e.g. Chameleon and Liquid).
22+
23+
![teaser](assets/teaser.png)
24+
25+
Built upon UniTok, we construct an MLLM capable of both multimodal generation and understanding,
26+
which sets a new state-of-the-art among unified autoregressive MLLMs.
27+
The code and weights of our MLLM will be released soon.
28+
29+
![teaser](assets/samples.png)
30+
31+
## News
32+
33+
**2025-02-14:** Paper, code, and model weights for UniTok are all released.
34+
35+
36+
## Performance
37+
38+
<table>
39+
<thead>
40+
<tr>
41+
<th>Method</th>
42+
<th>#Tokens</th>
43+
<th>rFID &darr;</th>
44+
<th>Accuracy</th>
45+
</tr>
46+
</thead>
47+
<tbody>
48+
<tr>
49+
<td colspan="4"><i>VQVAE Model</i></td>
50+
</tr>
51+
<tr align="center">
52+
<td>VQ-GAN</td>
53+
<td>256</td>
54+
<td>4.98</td>
55+
<td>--</td>
56+
</tr>
57+
<tr align="center">
58+
<td>RQ-VAE</td>
59+
<td>256</td>
60+
<td>1.30</td>
61+
<td>--</td>
62+
</tr>
63+
<tr align="center">
64+
<td>VAR</td>
65+
<td>680</td>
66+
<td>0.90</td>
67+
<td>--</td>
68+
</tr>
69+
<tr>
70+
<td colspan="4"><i>CLIP Model</i></td>
71+
</tr>
72+
<tr align="center">
73+
<td>CLIP</td>
74+
<td>256</td>
75+
<td>--</td>
76+
<td>76.2</td>
77+
</tr>
78+
<tr align="center">
79+
<td>SigLIP</td>
80+
<td>256</td>
81+
<td>--</td>
82+
<td>80.5</td>
83+
</tr>
84+
<tr align="center">
85+
<td>ViTamin</td>
86+
<td>256</td>
87+
<td>--</td>
88+
<td>81.2</td>
89+
</tr>
90+
<tr>
91+
<td colspan="4"><i>Unified Model</i></td>
92+
</tr>
93+
<tr align="center">
94+
<td>TokenFlow &dagger;</td>
95+
<td>680</td>
96+
<td>1.37</td>
97+
<td>--</td>
98+
</tr>
99+
<tr align="center">
100+
<td>VILA-U &dagger;</td>
101+
<td>256</td>
102+
<td>1.80</td>
103+
<td>73.3</td>
104+
</tr>
105+
<tr align="center">
106+
<td>UniTok</td>
107+
<td>256</td>
108+
<td>0.39</td>
109+
<td>70.5</td>
110+
</tr>
111+
<tr align="center">
112+
<td>UniTok &dagger;</td>
113+
<td>256</td>
114+
<td>0.38</td>
115+
<td>78.6</td>
116+
</tr>
117+
</tbody>
118+
</table>
119+
120+
&dagger; indicates the model uses pretrained CLIP weights for initialization.
121+
<br>**Note:** Although CLIP weight initialization yields better ImageNet zero-shot accuracy,
122+
we notice that random initialization leads to better downstream understanding performance.
123+
We thus release the model weights of randomly initialized UniTok.
124+
125+
[//]: # (**Visual Understanding Performance on VQA Benchmarks.**)
126+
127+
[//]: # ()
128+
[//]: # (| Method | LLM | Res. | VQAv2 | GQA | TextVQA | POPE | MME | MM-Vet |)
129+
130+
[//]: # (|:----------:|:--------------:|:-------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|)
131+
132+
[//]: # (| Show-o | Phi-1.5-1.3B | 256 | 59.3 | 48.7 | - | 73.8 | 948 | - |)
133+
134+
[//]: # (| Liquid | Gemma-7B | 512 | 71.3 | 58.4 | 42.4 | 81.1 | 1119 | - |)
135+
136+
[//]: # (| VILA-U | Llama-2-7B | 256 | 75.3 | 58.3 | 48.3 | 83.9 | 1336 | 27.7 |)
137+
138+
[//]: # (| **UniTok** | **Llama-2-7B** | **256** | **76.8** | **61.1** | **51.6** | **83.2** | **1448** | **33.9** |)
139+
140+
[//]: # ()
141+
[//]: # (**Visual Generation Performance on GenAI-Bench.**)
142+
143+
[//]: # ()
144+
[//]: # (<table>)
145+
146+
[//]: # ( <thead>)
147+
148+
[//]: # ( <tr>)
149+
150+
[//]: # ( <th rowspan="2">Method</th>)
151+
152+
[//]: # ( <th rowspan="2">Type</th>)
153+
154+
[//]: # ( <th rowspan="2">Count</th>)
155+
156+
[//]: # ( <th rowspan="2">Differ</th>)
157+
158+
[//]: # ( <th rowspan="2">Compare</th>)
159+
160+
[//]: # ( <th colspan="2">Logical</th>)
161+
162+
[//]: # ( <th rowspan="2">Overall</th>)
163+
164+
[//]: # ( </tr>)
165+
166+
[//]: # ( <tr>)
167+
168+
[//]: # ( <th>Negate</th>)
169+
170+
[//]: # ( <th>Universal</th>)
171+
172+
[//]: # ( </tr>)
173+
174+
[//]: # ( </thead>)
175+
176+
[//]: # ( <tbody>)
177+
178+
[//]: # ( <tr align="center">)
179+
180+
[//]: # ( <td>Show-o</td>)
181+
182+
[//]: # ( <td>Discrete Diff.</td>)
183+
184+
[//]: # ( <td>0.70</td>)
185+
186+
[//]: # ( <td>0.62</td>)
187+
188+
[//]: # ( <td>0.71</td>)
189+
190+
[//]: # ( <td>0.51</td>)
191+
192+
[//]: # ( <td>0.65</td>)
193+
194+
[//]: # ( <td>0.60</td>)
195+
196+
[//]: # ( </tr>)
197+
198+
[//]: # ( <tr align="center">)
199+
200+
[//]: # ( <td>VILA-U</td>)
201+
202+
[//]: # ( <td>Autoregressive</td>)
203+
204+
[//]: # ( <td>0.70</td>)
205+
206+
[//]: # ( <td>0.71</td>)
207+
208+
[//]: # ( <td>0.74</td>)
209+
210+
[//]: # ( <td>0.53</td>)
211+
212+
[//]: # ( <td>0.66</td>)
213+
214+
[//]: # ( <td>0.64</td>)
215+
216+
[//]: # ( </tr>)
217+
218+
[//]: # ( <tr align="center">)
219+
220+
[//]: # ( <td>Liquid</td>)
221+
222+
[//]: # ( <td>Autoregressive</td>)
223+
224+
[//]: # ( <td>0.76</td>)
225+
226+
[//]: # ( <td>0.73</td>)
227+
228+
[//]: # ( <td>0.74</td>)
229+
230+
[//]: # ( <td>0.46</td>)
231+
232+
[//]: # ( <td>0.74</td>)
233+
234+
[//]: # ( <td>0.65</td>)
235+
236+
[//]: # ( </tr>)
237+
238+
[//]: # ( <tr align="center">)
239+
240+
[//]: # ( <th>UniTok</th>)
241+
242+
[//]: # ( <th>Autoregressive</th>)
243+
244+
[//]: # ( <th>0.76</th>)
245+
246+
[//]: # ( <th>0.79</th>)
247+
248+
[//]: # ( <th>0.74</th>)
249+
250+
[//]: # ( <th>0.46</th>)
251+
252+
[//]: # ( <th>0.73</th>)
253+
254+
[//]: # ( <th>0.67</th>)
255+
256+
[//]: # ( </tr>)
257+
258+
[//]: # ( </tbody>)
259+
260+
[//]: # (</table>)
261+
262+
263+
## Model Weights
264+
265+
| Model | Res. | #Token | Code Shape | rFID | Checkpoint |
266+
|:------------:|:----:|:------:|:-------------------------:|:----:|:------------:|
267+
| UniTok-Large | 256 | 256 | 16 $\times$ 16 $\times$ 8 | 0.39 | [Download](https://huggingface.co/FoundationVision/UniTok/blob/main/unitok_tokenizer.pth) |
268+
269+
270+
## Usage
271+
272+
### Requirements
273+
- Python ≥ 3.10
274+
- PyTorch ≥ 2.3.1
275+
276+
### Installation
277+
278+
```bash
279+
git clone https://github.com/FoundationVision/UniTok.git
280+
cd UniTok
281+
pip install -r requirements.txt
282+
```
283+
284+
### Inference
285+
286+
Please download the [checkpoint](https://huggingface.co/FoundationVision/UniTok/blob/main/unitok_tokenizer.pth) and fill in the `ckpt_path`.
287+
```bash
288+
python inference.py \
289+
--ckpt_path /path/to/unitok/checkpoint \
290+
--src_img /path/to/test_img --rec_img /path/to/rec_img
291+
```
292+
293+
### Training
294+
295+
- We train UniTok on [DataComp-1B](https://github.com/mlfoundations/datacomp).
296+
Please follow the [instructions](https://github.com/mlfoundations/datacomp?tab=readme-ov-file#downloading-datacomp-1b) to download and prepare the data.
297+
298+
- Download the [models](https://huggingface.co/FoundationVision/UniTok/tree/main/external) used for loss calculation and put them in `./external`.
299+
300+
- Download the [ImageNet validation set](https://www.image-net.org/) for zero-shot accuracy evaluation.
301+
302+
- Download the ImageNet 256$\times$256 [reference batch](https://huggingface.co/datasets/FoundationVision/imagenet_reference_batch) for FID evaluation.
303+
304+
Configure `nnodes, nproc_per_node, node_rank, master_addr, master_port` in `launch.sh` and run:
305+
306+
```bash
307+
bash launch.sh \
308+
--output_dir '/path/to/save/checkpoints/' \
309+
--train_data '/path/to/datacomp/shards/{00000000..00140146}.tar' \
310+
--imagenet_val '/path/to/imagenet_val/' \
311+
--fid_eval_src '/path/to/imagenet_reference_batch' \
312+
--fid_eval_dst '/path/to/save/imagenet_reconstructed_batch'
313+
```
314+
**Note:** For more hyper-parameter configurations, please check `utils/config.py`.
315+
316+
### Evaluation
317+
318+
We benchmark UniTok in terms of both understanding performance using the [LLaVA](https://github.com/haotian-liu/LLaVA) framework
319+
and generation performance using the [LLamaGen](https://github.com/FoundationVision/LlamaGen) framework.
320+
Please refer to [EVAL.md](eval/EVAL.md) for more details.
321+
322+
323+
324+
## Acknowledgement
325+
UniTok is built upon the awesome works
326+
[VAR](https://github.com/FoundationVision/VAR),
327+
[DataComp](https://github.com/mlfoundations/datacomp),
328+
[LLaVA](https://github.com/haotian-liu/LLaVA/),
329+
[LlamaGen](https://github.com/FoundationVision/LlamaGen/),
330+
and [ViTamin](https://github.com/Beckschen/ViTamin).
331+
332+
333+
## LICENSE
334+
335+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
336+
337+
338+
## Citation
339+
340+
If you find this project useful, please consider citing:
341+
342+
```bibtex
343+
@article{unitok,
344+
title={UniTok: A Unified Tokenizer for Visual Generation and Understanding},
345+
author={Ma, Chuofan and Wu, Junfeng and Jiang, Yi and Yang, Jihan and Yu, Xin and Yuan, Zehuan and Peng, Bingyue and Qi, Xiaojuan},
346+
journal={},
347+
year={2025}
348+
}
349+
```

assets/samples.png

4.72 MB
Loading

assets/teaser.png

784 KB
Loading

assets/vis_imgs/v0.jpg

2.71 MB
Loading

assets/vis_imgs/v1.jpg

5.61 MB
Loading

assets/vis_imgs/v2.jpg

51.4 KB
Loading

assets/vis_imgs/v3.jpg

98.6 KB
Loading

assets/vis_imgs/v4.jpg

55.1 KB
Loading

0 commit comments

Comments
 (0)