"700.000 Vietnamese vision-language samples open-source dataset"
This dataset contains over 700,000 Vietnamese vision-language samples, created by Gemini Pro. We employed several prompt engineering techniques: few-shot learning, caption-based prompting and image-based prompting.
-
For the COCO dataset, we generated data using Llava-style prompts
-
For the ShareGPT4V dataset, we used translation prompts.
-
Caption-based prompting: involves using accurate captions and bounding boxes from the original dataset.
-
Image-based prompting: uses images to create captions and conversations.
Curation process involved removing any Han, Japanese, and Korean characters. The data was also refined by filtering out samples with high perplexity levels.
Report: Coming Soon
The dataset is structured into 5 subsets:
Subset | Split | Method | Size |
---|---|---|---|
Vi-LLAVA conversation | train | caption-based | 107,052 |
validation | 4,550 | ||
Vi-LLAVA complex reasoning | train | caption-based | 112,650 |
validation | 4,771 | ||
Vi-LLAVA detail description | train | caption-based | 111,153 |
validation | 4,714 | ||
Vi-ShareGPT4V | translation | 96,913 | |
Vi-WIT | caption-based, image-based | 264,831 | |
Total | 706,634 |
Follow the instructions in Vi-LLAVA/ folder.
bash scripts/translate_shareGPT4V.sh
Follow the instructions in WIT/ folder.
from perplexity.filtering import FilteringPerplexity
# Specific your own dataset
datasets = load_dataset("Specific your dataset", split="train")
# Set up perplextiy filtering
perplexity_filtering = FilteringPerplexity(
sentencepiece_model_path=os.path.join('path to sentencepiece model'),
kenlm_model_path=os.path.join("path to kenlm model"),
)
# Compute perplexity
data_contains_perplex = perplexity_filtering.compute(dataset)
# Filter perplexity
threshold = 100 # Set your own threshold if needed
data_filtered = perplexity_filtering.filter(data_contains_perplex, threshold=threshold)
- The dataset does not contain any personal or sensitive information.
- The dataset may contain biases due to the sources from which the data was collected.
- Users should be aware of these potential biases when using the dataset.
The dataset is released under the MIT license.
- Organization: Vietnamese-VLM
BibTeX:
@article{ViVLM Vista 2024,
title={Vista},
author={Tran, Oanh Ngoc and Bui, Hop Van and Ha, Hoang Huy and Phan, Phuc Van},
year=2024,
month=May},
url={https://huggingface.co/datasets/Vi-VLM/Vista}