Using large Multimodal Models for Cross Modality Compression
Why use LMMs for compression? Large Multimodal Models (LMMs) support the conversion between multiple modalities, where text consumes much less space than image modalities. By cascading Image-to-Text (I2T) and Text-to-Image (T2I) models, images can be compressed and reconstructed from semantic information. This Cross-Modality Compression (CMC) paradigm operates at the semantic level, which outperforms traditional codecs at the pixel level. It enables easy attainment of 1,000 times compression, and even 10,000 times in extreme cases.
However, at such low bitrates, CMC presents two significant issues that cannot be overlooked. CMC-Bench is designed to evaluate : (1) Consistency between the distorted and reference image, (2) Perecption quality of the distorted image only. Thus, CMC-Bench is designed to identify where LMMs can be further optimized toward the compression task, thereby promote the evolution of visual signal codec protocols.
- [2024/6/13] 🔥 Github repo for CMC-Bench is online. Follow the instruction to join the I2T or T2I model arena!!
- [2024/6/11] 🔥 We release the CMC-Bench data and meta information at Huggingface.
- [To Do] [ ] Update the subjective label for quality assessment task.
- [To Do] [ ] Update all interval image and text data for compression.
To provide a comprehensive and high-quality resource for various applications on the Internet, we carefully curated 1,000 images without compression distortion as the ground truth of CMC-Bench. Including 400 NSIs, 300 SCIs, and 300 AIGIs. The data selection and annoattion detail are attached in out paper.
We employ 6 I2T and 12 T2I models across four working modes. (1) Text mode with only T2I and I2T model; (2) Pixel mode with several pixels to guide T2I model; (3) Image mode with a compressed image as I2T guidance but without I2T model; (4) Full mode with all necessary information but most expenses. A I2T+T2I group will be evaluated in 4*2=8 dimensions.
Radar maps are shown as a quick glance. Among I2Ts, GPT-4o shows the best perfoemance. Among T2Is, DiffBIR ranks best in terms of Image and Full but unsupportive at other two modes, while RealVis is the most full-edged model for Consistency and PG25 owns the most satisfying Perecption.
The leaderboard for I2T and T2I models are shown below. The I2T models are combined with RealVis as T2I, while T2I models are combined with GPT-4o as I2T. For detail on differnet content types, please check our paper.
I2T | Full-FR↑ | Full-NR↑ | Pixel-FR↑ | Pixel-NR↑ | Text-FR↑ | Text-NR↑ | Overall↑ |
---|---|---|---|---|---|---|---|
GPT-4o | 2.5646 | 2.0415 | 1.9878 | 2.7815 | 1.7805 | 3.4802 | 2.4394 |
ShareGPT | 2.5597 | 2.065 | 1.9872 | 2.7618 | 1.7940 | 3.4221 | 2.4316 |
Qwen | 2.5530 | 2.0031 | 1.9917 | 2.6748 | 1.7832 | 3.3679 | 2.3956 |
MPlugOwl-2 | 2.5556 | 2.0003 | 1.9902 | 2.6413 | 1.7891 | 3.3299 | 2.3844 |
LLAVA | 2.5484 | 1.9747 | 1.9815 | 2.6373 | 1.7766 | 3.3695 | 2.3813 |
InstructBLIP | 2.5489 | 1.9153 | 1.9858 | 2.5593 | 1.7796 | 3.2888 | 2.3463 |
T2I | Full-FR↑ | Full-NR↑ | Image-FR↑ | Image-NR↑ | Pixel-FR↑ | Pixel-NR↑ | Text-FR↑ | Text-NR↑ | Overall↑ |
---|---|---|---|---|---|---|---|---|---|
DiffBIR | 2.9194 | 2.5803 | 2.8630 | 1.7342 | - | - | - | - | 2.6466 |
PASD | 2.7270 | 2.2256 | 2.6378 | 2.0101 | - | - | - | - | 2.4942 |
PG25 | 2.0716 | 2.9194 | 1.9612 | 2.9935 | 1.7418 | 3.6260 | 1.7382 | 3.7299 | 2.3579 |
RealVis | 2.5646 | 2.0415 | 2.5033 | 1.8098 | 1.9878 | 2.7815 | 1.7805 | 3.4802 | 2.3155 |
PG20 | 2.3603 | 2.3695 | 2.2476 | 2.2071 | 1.8883 | 2.6875 | 1.7180 | 3.7438 | 2.2864 |
SSD-1B | 2.4939 | 2.0803 | 2.4147 | 1.9308 | 1.9611 | 2.4828 | 1.7753 | 3.4796 | 2.2720 |
StableSR | 2.6232 | 1.4368 | 2.6088 | 1.4293 | - | - | - | - | 2.2217 |
Dreamlike | 2.5071 | 1.7892 | 2.4226 | 1.5131 | 1.9545 | 2.3038 | 1.7090 | 3.1588 | 2.1626 |
Animate | 2.2985 | 1.8469 | 2.2522 | 1.6148 | 1.8246 | 2.4324 | 1.6983 | 3.4979 | 2.1283 |
SDXL | 2.4184 | 1.6837 | 2.3482 | 1.5586 | 1.9103 | 1.9724 | 1.7471 | 3.4225 | 2.1238 |
SD15 | 2.4895 | 1.7733 | 2.4163 | 1.5574 | 1.9422 | 2.1444 | 1.6832 | 2.5318 | 2.0891 |
InstructPix | 2.1519 | 1.7191 | 2.3457 | 1.2219 | - | - | - | - | 1.9894 |
CMC paradigms demonstrate an advance in terms of most indicators. The lead in Perception is particularly notable, as it surpasses traditional codecs at extremely low bitrates. However, the advantage in consistency is relatively smaller, achieving a reduction of around 30% in bitrate compared to traditional methods at 0.02 bpp. The DiffBIR decoder generally shows better performance, while RealVis fits A wider range of bitrates.
In summary, we believe that CMC holds a certain advantage over traditional encoding. However, for implementing LMMs into the next generation of visual signal codecs, further optimization is still required for LMM developers.
First download the images from our Hugging website, including:
Ground Truth. Decompress all file into GT
folder in this project.
Pixel Reference. Decompress all file into Ref/pixel
folder in this project.
Compressed Image Reference. Decompress all file into Ref/image
folder in this project.
Then download the Consistency an Perception evaluation model weight from:
Consistency. Put it into Weight
folder in this project.
Perception. Put it into Weight
folder in this project.
After process above, please ensure your folder look like:
CMC-Bench
│
├── GT
│ ├── AIGI_DALLE3_000.png, AIGI_DALLE3_001.png ...
│
├── Ref
│ ├── pixel
│ │ └── AIGI_DALLE3_000.png, AIGI_DALLE3_001.png ...
│ └── image
│ └── AIGI_DALLE3_000.png, AIGI_DALLE3_001.png ...
│
└── Weight
└── topiq-fr.pth, topiq-nr.pth
Use I2T model to transform ground truth image into text.
python script-i2t.py --model_name [your_i2t_name] --model_dir [your_i2t_dictionary]
A csv file including all text input will be generated in your Text
folder according to your_i2t_name
. The default script use Qwen for I2T. If you only want to test T2I model, please skip this step and directly use Text/gpt4v.csv
.
Use T2I model to reconstruct text back into image.
python script-t2i.py --mode full --input_path [csv_in_step_1] --model_name [your_t2i_name] --model_dir [your_t2i_dictionary]
python script-t2i.py --mode image --input_path [csv_in_step_1] --model_name [your_t2i_name] --model_dir [your_t2i_dictionary]
python script-t2i.py --mode pixel --input_path [csv_in_step_1] --model_name [your_t2i_name] --model_dir [your_t2i_dictionary]
python script-t2i.py --mode text --input_path [csv_in_step_1] --model_name [your_t2i_name] --model_dir [your_t2i_dictionary]
All decompressed image will be generated in your Result
folder according to your_t2i_name
. Four subfolder corresponds to four working mode. The default script use RalVis for T2I. If you only want to test T2I model, please empty the --input_path
; If you only want to test I2T model please empty the --model_name
and --model_dir
.
Use fine-tuned quality model to mesure the performance. Check the model name in your Result
folder, all modes in it will be evaluated. The script can still be evaluated with incomplete modes, but we recommend using all four modes at once.
python script-evaluate.py --target [t2i_name_in_step_2]
After finishing validation, you can submit the results via e-mail to get your LMM results on CMC-Bench ! (Noted a valid submission should support at least two among four modes.)
Please contact any of the first authors of this paper for queries.
- Chunyi Li,
[email protected]
, @lcysyzxdxc
If you find our work interesting, please feel free to cite our paper:
@misc{li2024cmcbench,
title={CMC-Bench: Towards a New Paradigm of Visual Signal Compression},
author={Chunyi Li and Xiele Wu and Haoning Wu and Donghui Feng and Zicheng Zhang and Guo Lu and Xiongkuo Min and Xiaohong Liu and Guangtao Zhai and Weisi Lin},
year={2024},
eprint={2406.09356},
archivePrefix={arXiv}
}