Vista

"700.000 Vietnamese vision-language samples open-source dataset"

Overview

This dataset contains over 700,000 Vietnamese vision-language samples, created by Gemini Pro. We employed several prompt engineering techniques: few-shot learning, caption-based prompting and image-based prompting.

For the COCO dataset, we generated data using Llava-style prompts
For the ShareGPT4V dataset, we used translation prompts.
Caption-based prompting: involves using accurate captions and bounding boxes from the original dataset.
Image-based prompting: uses images to create captions and conversations.

Curation process involved removing any Han, Japanese, and Korean characters. The data was also refined by filtering out samples with high perplexity levels.

HuggingFace Dataset

Report: Coming Soon

Dataset Structure

The dataset is structured into 5 subsets:

Subset	Split	Method	Size
Vi-LLAVA conversation	train	caption-based	107,052
	validation		4,550
Vi-LLAVA complex reasoning	train	caption-based	112,650
	validation		4,771
Vi-LLAVA detail description	train	caption-based	111,153
	validation		4,714
Vi-ShareGPT4V		translation	96,913
Vi-WIT		caption-based, image-based	264,831
Total			706,634

Data process

Vi-LLAVA

Follow the instructions in Vi-LLAVA/ folder.

Translate ShareGPT4V

bash scripts/translate_shareGPT4V.sh

WIT

Follow the instructions in WIT/ folder.

Filtering perplexity

from perplexity.filtering import FilteringPerplexity

# Specific your own dataset
datasets = load_dataset("Specific your dataset", split="train")

# Set up perplextiy filtering
perplexity_filtering = FilteringPerplexity(
    sentencepiece_model_path=os.path.join('path to sentencepiece model'),
    kenlm_model_path=os.path.join("path to kenlm model"),
)

# Compute perplexity
data_contains_perplex = perplexity_filtering.compute(dataset)

# Filter perplexity
threshold = 100  # Set your own threshold if needed
data_filtered = perplexity_filtering.filter(data_contains_perplex, threshold=threshold)

Personal and Sensitive Information

The dataset does not contain any personal or sensitive information.

Bias, Risks, and Limitations

The dataset may contain biases due to the sources from which the data was collected.
Users should be aware of these potential biases when using the dataset.

Authors

Licensing Information

The dataset is released under the MIT license.

Additional Information

Organization: Vietnamese-VLM

Citation Information

BibTeX:

@article{ViVLM Vista 2024,
  title={Vista},
  author={Tran, Oanh Ngoc and Bui, Hop Van and Ha, Hoang Huy and Phan, Phuc Van},
  year=2024,
  month=May},
  url={https://huggingface.co/datasets/Vi-VLM/Vista}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Vi-LLAVA		Vi-LLAVA
WIT		WIT
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
DATA_LICENSE		DATA_LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vista

Overview

Dataset Structure

Data process

Vi-LLAVA

Translate ShareGPT4V

WIT

Filtering perplexity

Personal and Sensitive Information

Bias, Risks, and Limitations

Authors

Licensing Information

Additional Information

Citation Information

About

Releases

Packages

Contributors 4

Languages

Oztobuzz/Vista

Folders and files

Latest commit

History

Repository files navigation

Vista

Overview

Dataset Structure

Data process

Vi-LLAVA

Translate ShareGPT4V

WIT

Filtering perplexity

Personal and Sensitive Information

Bias, Risks, and Limitations

Authors

Licensing Information

Additional Information

Citation Information

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages