MLM Filter

Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".

Release

[2/25] 🔥 We released Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters. We propose to adopt fine-tuned Multimodal Language Model as effective and efficient data filters to select high-quality image-text pairs from large-scale web-crawled iamge-text data. Checkout the paper.

Project Structure

LLaVA_ft: codebase for fine-tuning MLM as Data Filter
mlm_filter_scoring_single_image.py: Sample code for perform quality score generation on a single image-text pair
mlm_filter_scoring_datacomp_batch_inference.py: Sample code for perform large-scale quality score generation on Webdataset format image-text data
run_inference.sh: Sample code for perform large-scale quality score generation on Webdataset format image-text data on machines with 8 GPUs

Install

We highly suggest you to use python==3.10, i.e.,

conda create -n mlm_filter python=3.10

Then install the dependencies for quality score generation:

bash setup.sh

Quality Score Generation

Inference on Single Image

python mlm_filter_scoring_single_image.py --image-path /path/to/image --caption "text caption"

Parameters to note:

--metric: quality scoring metric for generation, select among image_text_matching, object_detail_fulfillment, caption_text_quality, semantic_understanding, all
--image-path: path to image file or image url
--caption: text caption

Inference on Webdataset Large-Scale Data

bash run_inference.sh ${GPU_START_ID} ${Metric} ${Model_Path} ${Data_Path} ${Tars_Per_GPU} ${Num_GPU}

Parameters to note:

GPU_START_ID: for large-scale score generation using multi-machines, specify the index of machines
Metric: quality scoring metric for generation, select among image_text_matching, object_detail_fulfillment, caption_text_quality, semantic_understanding, all
Model_Path: path to the mlm filter model checkpoint
Data_Path: path to the webdataset image-text tars
Tars_Per_GPU: the number of webdataset image-text tars for a single-gpu to inference on
Num_GPU: the number of GPUs for one machine, e.g. 1, 8, 16

Fine-Tuning MLM as Data Filter

Prepare data

Please download the 50k multimodal instructions and save it to ./data/mlm_filter_instruct_50k_gpt4v_cc12m_4k.json.

Please download the images from constituting datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2
CC12M: unzip images.zip -C data/images, the images are available at Huggingface Data Repo.

After downloading all of them, organize the data as follows in ./data/images,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
│   ├── VG_100K
│   └── VG_100K_2
└── cc12m

As several images from OCR-VQA data urls are no longer available, you can also try to run the check_missed_image.py for filtering unavailable images from instruction dataset.

Start training!

You may download LLaVA's pretrained projectors in Model Zoo.

Visual instruction tuning takes around 4 hours for LLaVA-v1.5-13B on 8x A100 (80G) with sampled 50k instruction dataset.

Training script with DeepSpeed ZeRO-3: LLaVA_ft/scripts/v1_5/finetune.sh.

We open-source our fine-tuned MLM Data Filters at MLM-Filter-GPT4V and MLM-Filter-GPT4.

License

Usage and License Notices: The data and checkpoint are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Contacts

For any question or issue, please feel free to contact [email protected] or submit github issues.

Citation

Please cite our paper if you find this repository interesting or helpful in your research:

@article{mlm-filter,
    title={Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters}, 
    author={Wang, Weizhi and Mrini, Khalil and Yang, Linjie and Kumar, Sateesh and Tian, Yu and Yan, Xifeng and Wang, Heng},
    publisher={arXiv preprint arXiv:2403.02677},
    year={2024},
}

Credits

MLM-Filter is developed based on

Vicuna: foudation language model for LLaVA
LLaVA: the codebase for fine-tuning LLaVA as image-text data filters
DataComp: the codebase for data filtering and CLIP pre-training

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LLaVA_ft		LLaVA_ft
assets		assets
LICENSE		LICENSE
README.md		README.md
mlm_filter_scoring_datacomp_batch_inference.py		mlm_filter_scoring_datacomp_batch_inference.py
mlm_filter_scoring_datacomp_batch_inference_v3.py		mlm_filter_scoring_datacomp_batch_inference_v3.py
mlm_filter_scoring_single_image.py		mlm_filter_scoring_single_image.py
run_inference.sh		run_inference.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaVA_ft

LLaVA_ft

assets

assets

LICENSE

LICENSE

README.md

README.md

mlm_filter_scoring_datacomp_batch_inference.py

mlm_filter_scoring_datacomp_batch_inference.py

mlm_filter_scoring_datacomp_batch_inference_v3.py

mlm_filter_scoring_datacomp_batch_inference_v3.py

mlm_filter_scoring_single_image.py

mlm_filter_scoring_single_image.py

run_inference.sh

run_inference.sh

setup.sh

setup.sh

Repository files navigation

MLM Filter

Release

Project Structure

Install

Quality Score Generation

Inference on Single Image

Inference on Webdataset Large-Scale Data

Fine-Tuning MLM as Data Filter

License

Contacts

Citation

Credits

About

Releases

Packages

Languages

License

Victorwz/MLM_Filter

Folders and files

Latest commit

History

Repository files navigation

MLM Filter

Release

Project Structure

Install

Quality Score Generation

Inference on Single Image

Inference on Webdataset Large-Scale Data

Fine-Tuning MLM as Data Filter

License

Contacts

Citation

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Languages