DALLE-Datasets

Crawling image-caption pair dataset from Wikimedia Commons for training DALL·E

📚 Data

Download data

Download link: wikimedia_commons.csv
Data count: 30,246,704
File size: 7.9GB

File summary

id: unique identifier
title: name of image. Detail information of image: https://commons.wikimedia.org/wiki/File:$title
mime: image/png, image/jpeg, image/svg+xml, image/gif, image/tiff, image/x-xcf, or image/webp
url: can downlaod image from url
caption: caption of iamge

Head of file

head -n 5 wikimedia_commons.csv

id,title,mime,url,caption
5637467,Ph_locator_cotabato_carmen.png,image/png,https://upload.wikimedia.org/wikipedia/commons/a/a8/Ph_locator_cotabato_carmen.png,Map of the Cotabato showing the location of Carmen
5637468,Ph_locator_tarlac_san_clemente.png,image/png,https://upload.wikimedia.org/wikipedia/commons/2/27/Ph_locator_tarlac_san_clemente.png,Map of Tarlac showing the location of San Clemente
5637470,Ph_locator_sultan_kudarat_isulan.png,image/png,https://upload.wikimedia.org/wikipedia/commons/d/d7/Ph_locator_sultan_kudarat_isulan.png,Map of the Sultan Kudarat showing the location of Isulan
5637471,Ph_locator_lanao_del_sur_balabagan.png,image/png,https://upload.wikimedia.org/wikipedia/commons/d/d4/Ph_locator_lanao_del_sur_balabagan.png,Map of the Lanao del Sur showing the location of Balabagan

⚗️ Steps for crawling

Use mysql to store crawled data
Create a config.ini file based on config.ini.example to connect to your mysql
Install requirenemts.txt dependencies

pip install -r requirements.txt

Create table

python dalle_datasets/create_table.py

Check the table structure

DESC $table;

+---------+---------------+------+-----+---------+----------------+
| Field   | Type          | Null | Key | Default | Extra          |
+---------+---------------+------+-----+---------+----------------+
| id      | int(11)       | NO   | PRI | NULL    | auto_increment |
| title   | varchar(1000) | NO   |     | NULL    |                |
| image   | mediumblob    | YES  |     | NULL    |                |
| mime    | varchar(50)   | YES  |     | NULL    |                |
| url     | varchar(1000) | YES  |     | NULL    |                |
| caption | varchar(2000) | YES  |     | NULL    |                |
+---------+---------------+------+-----+---------+----------------+

Crawl title

python dalle_datasets/crawl_title.py

Check the number of rows in the table

SELECT count(*) FROM $table;

+----------+
| count(*) |
+----------+
| 63024034 |
+----------+

Crawl image url and caption

python dalle_datasets/crawl_caption.py

This script takes too long to run on only one machine. So I split the task into 10 gcp instances.

For example, if you number each instance from 0 to 9 as variable i, you can run the script for each instance as shown below.

# start = 7879 * i
# end = 7879 * (i+1)
python dalle_datasets/crawl_caption.py -s $start -e $end

Delete row with NULL caption

DELETE FROM $table WHERE caption IS NULL;
SELECT count(*) FROM $table;

+----------+
| count(*) |
+----------+
| 30246704 |
+----------+

Selecting mime column in group by

SELECT mime, count(mime) AS count FROM $table GROUP BY mime;

+---------------+----------+
| mime          | count    |
+---------------+----------+
| image/png     |  1469744 |
| image/jpeg    | 27229537 |
| image/svg+xml |  1089487 |
| image/gif     |    71366 |
| image/tiff    |   383288 |
| image/x-xcf   |      680 |
| image/webp    |     2602 |
+---------------+----------+

(WIP) Crawl image

python dalle_datasets/crawl_image.py

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
dalle_datasets		dalle_datasets
tmp		tmp
.gitignore		.gitignore
README.md		README.md
config.ini.example		config.ini.example
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

dalle_datasets

dalle_datasets

tmp

tmp

.gitignore

.gitignore

README.md

README.md

config.ini.example

config.ini.example

requirements.txt

requirements.txt

Repository files navigation

DALLE-Datasets

Crawling image-caption pair dataset from Wikimedia Commons for training DALL·E

📚 Data

Download data

File summary

Head of file

⚗️ Steps for crawling

About

Releases

Packages

Contributors 2

Languages

SIC98/DALLE-Datasets

Folders and files

Latest commit

History

Repository files navigation

DALLE-Datasets

Crawling image-caption pair dataset from Wikimedia Commons for training DALL·E

📚 Data

Download data

File summary

Head of file

⚗️ Steps for crawling

About

Topics

Resources

Stars

Watchers

Forks

Languages