A study on open-ended multi-image quality comparison: a dataset, a model and a benchmark.
<script type="module" src="https://gradio.s3-us-west-2.amazonaws.com/3.50.0/gradio.js" ></script>Several general-purpose open-source LMMs have integrated Co-Instruct into their training, which has as good visual quality comparison abilities while retaining as good general abilities. Please find thme as follows:
We thank the authors of these projects to include our data into their training. Please try to use these models if you need a strong general-purpose LMM with decent open-ended visual quality comparison abilities.
Quick Note: Please use transformers==4.36
or ``transformers==4.37` to seamlessly run on
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("q-future/co-instruct",
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="eager",
device_map={"":"cuda:0"})
import requests
from PIL import Image
### Single Image
prompt = "USER: The image: <|image|> Which happens in this image: motion-blur, over-exposure, or under-exposure? ASSISTANT:"
url = "https://raw.githubusercontent.com/Q-Future/Q-Align/main/fig/singapore_flyer.jpg"
image = Image.open(requests.get(url,stream=True).raw)
model.chat(prompt, [image], max_new_tokens=200)
## Motion blur
### Double Image Comparison
prompt_cmp = "USER: The first image: <|image|>\nThe second image: <|image|>Which image has better quality, and why? ASSISTANT:"
url = "https://raw.githubusercontent.com/Q-Future/Q-Align/main/fig/boy_colorful.jpg"
image_2 = Image.open(requests.get(url,stream=True).raw)
model.chat(prompt_cmp, [image, image_2], max_new_tokens=200)
## The second image has better quality. The description indicates that the image has accurate exposure, precise focus, clear details, rich colors, and sufficient lighting. Additionally, the texture details are clear, and the composition is centered. In comparison, the first image has good clarity and rich texture details, but the lighting is slightly weak, which can affect the overall quality of the image. Therefore, the second image is of higher quality due to its accurate exposure, precise focus, clear details, rich colors, sufficient lighting, and centered composition.
We have relesed the training data on HuggingFace datasets on LLaVA format.
Please find on the link: https://huggingface.co/datasets/q-future/Co-Instruct-DB or use as follows:
huggingface-cli download q-future/Co-Instruct-DB --local-dir Co-Instruct-DB --repo-type datasets
tar -xf co-insruct-imageds.tar
The extracted data will look as follows:
-- Co-Instruct-DB/
-- -- coinstruct_562k_llava_format.json
-- -- data/
The data in the JSON contains 562K dicts, each corresponding to a piece of SFT data item.
For MICBench, our team notices that there are some cases with NSFW contents, and we may need to distribute it after making sure it is only used for research purpose. Please email [email protected]
to obtain it.
For training, please refer to the Q-Align codebase, which is a modified version of mPLUG-Owl2 that supports multi-image training. Please use the following script for training:
#!/bin/bash
# Use 8 GPUs to replicate the training
LOAD='MAGAer13/mplug-owl2-llama2-7b'
echo 'Converting data format...'
sed 's/<image>/<|image|>/g' Co-Instruct-DB/coinstruct_562k_llava_format.json > Co-Instruct-DB/coinstruct_562k_mplugowl2_format.json
echo 'Start training!'
DATA_FILE=Co-Instruct-DB/coinstruct_562k_mplugowl2_format.json
deepspeed --master_port 25801 q_align/train/train_mem.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path $LOAD \
--version v1 \
--data_path $DATA_FILE \
--image_folder Co-Instruct-DB/ \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir ./coinstruct_replicated \
--num_train_epochs 1 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1100 \
--save_total_limit 2 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--tune_visual_abstractor True \
--freeze_vision_model False \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
@misc{wu2024openended,
title={Towards Open-ended Visual Quality Comparison},
author={Haoning Wu and Hanwei Zhu and Zicheng Zhang and Erli Zhang and Chaofeng Chen and Liang Liao and Chunyi Li and Annan Wang and Wenxiu Sun and Qiong Yan and Xiaohong Liu and Guangtao Zhai and Shiqi Wang and Weisi Lin},
year={2024},
eprint={2402.16641},
archivePrefix={arXiv},
primaryClass={cs.CV}
}