How to download a single parquet file from HuggingFace for use in your recipe ? #1000

touma-I · 2025-01-30T13:35:06Z

touma-I
Jan 30, 2025
Maintainer

When build a new recipe, it is often desirable to test the notebook with an existing data set from HuggingFace. What is the ease way to do that ?

touma-I · 2025-01-30T13:35:17Z

touma-I
Jan 30, 2025
Maintainer Author

The following code snippets shows how one can download a single parquet file from the fine web dataset:

import urllib.request
import shutil
url='https://huggingface.co/datasets/HuggingFaceFW/fineweb/resolve/main/data/CC-MAIN-2013-20/000_00000.parquet'
shutil.os.makedirs("input-folder", exist_ok=True)
urllib.request.urlretrieve(url, "input-folder/000_00000.parquet")

1 reply

touma-I Feb 3, 2025
Maintainer Author

A preferred method is to use hugging face download AI as it provides caching and optimize download based on current cache:

!pip install --upgrade huggingface_hub
from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"

hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

touma-I · 2025-01-30T13:46:41Z

touma-I
Jan 30, 2025
Maintainer Author

Other methods could rely on HF API as follow:

from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "wikimedia/wikipedia"
FILENAME = "20231101.en/train-00000-of-00041.parquet"

hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to download a single parquet file from HuggingFace for use in your recipe ? #1000

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to download a single parquet file from HuggingFace for use in your recipe ? #1000

touma-I Jan 30, 2025 Maintainer

Replies: 2 comments · 1 reply

touma-I Jan 30, 2025 Maintainer Author

touma-I Feb 3, 2025 Maintainer Author

touma-I Jan 30, 2025 Maintainer Author

touma-I
Jan 30, 2025
Maintainer

Replies: 2 comments 1 reply

touma-I
Jan 30, 2025
Maintainer Author

touma-I Feb 3, 2025
Maintainer Author

touma-I
Jan 30, 2025
Maintainer Author