31-million-Southeast-Asian-language-news-text-dataset

Description

This dataset is multilingual news data from Southeast Asia, covering four languages: Indonesian, Malay, Thai, and Vietnamese. The total amount of data exceeds 31 million, stored in JSONL format, with each record running independently in a row for efficient reading and processing. The data sources are extensive, covering various news topics, and can comprehensively reflect the social dynamics, cultural hotspots, and economic trends in Southeast Asia. This dataset can help large models improve their multilingual capabilities, enrich cultural knowledge, optimize performance, expand industry applications in Southeast Asia, and promote cross linguistic research.

For more details, please refer to the link: https://www.nexdata.ai/datasets/llm/1625?source=Github

Specifications

Languages

Indonesian, Malay, Thai, Vietnamese

Data volume

14447771 Indonesian, 1239420 Malay, 6467564 Thai, 8942813 Vietnamese, with a total of over 31 million pieces

Field

URL,title,published_time,article_content,category

Format

JSONL

Licensing Information

Commercial License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Indonesia.jsonl		Indonesia.jsonl
Malaysia.jsonl		Malaysia.jsonl
README.md		README.md
Thailand.jsonl		Thailand.jsonl
Vietnam.jsonl		Vietnam.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

31-million-Southeast-Asian-language-news-text-dataset

Description

Specifications

Languages

Data volume

Field

Format

Licensing Information

About

Uh oh!

Releases

Packages

Nexdata-AI/31-million-Southeast-Asian-language-news-text-dataset

Folders and files

Latest commit

History

Repository files navigation

31-million-Southeast-Asian-language-news-text-dataset

Description

Specifications

Languages

Data volume

Field

Format

Licensing Information

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages