Skip to content

Commit cd55ad0

Browse files
Update docs once dataset scripts transferred to the Hub (huggingface#5136)
* Update docs about Datasets on GitHub (legacy) * Delete ADD_NEW_DATASET.md * Update the issue template chooser * Update docs * Update docstrings * Update ADD_NEW_DATASET.md
1 parent 2699593 commit cd55ad0

File tree

9 files changed

+48
-406
lines changed

9 files changed

+48
-406
lines changed

.github/ISSUE_TEMPLATE/add-dataset.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
name: "Add Dataset"
3+
about: Request the addition of a specific dataset to the library.
4+
title: ''
5+
labels: 'dataset request'
6+
assignees: ''
7+
8+
---
9+
10+
## Adding a Dataset
11+
- **Name:** *name of the dataset*
12+
- **Description:** *short description of the dataset (or link to social media or blog post)*
13+
- **Paper:** *link to the dataset paper if available*
14+
- **Data:** *link to the Github repository or current dataset location*
15+
- **Motivation:** *what are some good reasons to have this dataset*

.github/ISSUE_TEMPLATE/config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
contact_links:
22
- name: Datasets on the Hugging Face Hub
33
url: https://huggingface.co/datasets
4-
about: Open a Pull request / Discussion related to a specific dataset on the Hugging Face Hub (PRs for datasets with no namespace still have to be on GitHub though)
4+
about: Please use the "Community" tab of the dataset on the Hugging Face Hub to open a discussion or a pull request
55
- name: Forum
66
url: https://discuss.huggingface.co/c/datasets/10
77
about: Please ask and answer questions here, and engage with other community members

ADD_NEW_DATASET.md

Lines changed: 8 additions & 357 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232

3333
[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)
3434

35-
[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md)
35+
[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://huggingface.co/docs/datasets/share.html)
3636

3737
<h3 align="center">
3838
<a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/main/docs/source/imgs/course_banner.png"></a>
@@ -127,8 +127,6 @@ We have a very detailed step-by-step guide to add a new dataset to the ![number
127127

128128
You will find [the step-by-step guide here](https://huggingface.co/docs/datasets/share.html) to add a dataset on the Hub.
129129

130-
However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
131-
132130
# Main differences between 🤗 Datasets and `tfds`
133131

134132
If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`:

docs/source/about_dataset_load.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,12 +102,12 @@ To ensure a dataset is complete, [`load_dataset`] will perform a series of tests
102102
If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.
103103
In this case, an error is raised to alert that the dataset has changed.
104104
To ignore the error, one needs to specify `ignore_verifications=True` in [`load_dataset`].
105-
Anytime you see a verification error, feel free to [open an issue on GitHub](https://github.com/huggingface/datasets/issues) so that we can update the integrity checks for this dataset.
105+
Anytime you see a verification error, feel free to open a discussion or pull request in the corresponding dataset "Community" tab, so that the integrity checks for that dataset are updated.
106106

107107
## Security
108108

109109
The dataset repositories on the Hub are scanned for malware, see more information [here](https://huggingface.co/docs/hub/security#malware-scanning).
110110

111-
Moreover the datasets that were constributed on our GitHub repository have all been reviewed by our maintainers.
111+
Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers.
112112
The code of these datasets is considered **safe**.
113113
It concerns datasets that are not under a namespace, e.g. "squad" or "glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".

docs/source/loading.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ An object data type in [pandas.Series](https://pandas.pydata.org/docs/reference/
281281

282282
## Offline
283283

284-
Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub or 🤗 Datasets GitHub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
284+
Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
285285

286286
If you know you won't have internet access, you can run 🤗 Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, 🤗 Datasets will look directly in the cache. Set the environment variable `HF_DATASETS_OFFLINE` to `1` to enable full offline mode.
287287

docs/source/share.mdx

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -144,18 +144,14 @@ Members of the Hugging Face team will be happy to review your dataset script and
144144
## Datasets on GitHub (legacy)
145145

146146
Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the Hugging Face Hub.
147-
The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
148-
Those datasets are still maintained on GitHub, and if you'd like to edit them, please open a Pull Request on the huggingface/datasets repository.
149-
Sharing your dataset to the Hub is the recommended way of adding a dataset.
147+
148+
The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace on the Hub: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
150149

151150
<Tip>
152151

153-
The distinction between a Hub dataset and a dataset from GitHub only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
152+
The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
154153

155154
</Tip>
156155

157-
The code of these datasets are reviewed by the Hugging Face team, and they require test data in order to be regularly tested.
158-
159-
In some rare cases it makes more sense to open a PR on GitHub. For example when you are not the author of the dataset and there is no clear organization / namespace that you can put the dataset under.
160-
161-
For more info, please take a look at the documentation on [How to add a new dataset in the huggingface/datasets repository](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
156+
Those datasets are mow maintained on the Hub: if you think a fix is needed, please use their "Community" tab to open a discussion or create a Pull Request.
157+
The code of these datasets is reviewed by the Hugging Face team.

src/datasets/inspect.py

Lines changed: 6 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -341,12 +341,9 @@ def get_dataset_config_info(
341341
data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s).
342342
download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters.
343343
download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
344-
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
345-
346-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
347-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
348-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
349-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
344+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
345+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
346+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
350347
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
351348
If True, will get token from `"~/.huggingface"`.
352349
**config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied.
@@ -405,12 +402,9 @@ def get_dataset_split_names(
405402
data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s).
406403
download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters.
407404
download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
408-
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
409-
410-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
411-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
412-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
413-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
405+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
406+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
407+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
414408
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
415409
If True, will get token from `"~/.huggingface"`.
416410
**config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied.

src/datasets/load.py

Lines changed: 9 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1079,12 +1079,9 @@ def dataset_module_factory(
10791079
-> load the dataset builder from the dataset script in the dataset repository
10801080
e.g. ``glue``, ``squad``, ``'username/dataset_name'``, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`.
10811081
1082-
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
1083-
1084-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
1085-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
1086-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
1087-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
1082+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
1083+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
1084+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
10881085
download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters.
10891086
download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
10901087
dynamic_modules_path (Optional str, defaults to HF_MODULES_CACHE / "datasets_modules", i.e. ~/.cache/huggingface/modules/datasets_modules):
@@ -1121,9 +1118,6 @@ def dataset_module_factory(
11211118
# - if path is a local directory (but no python file)
11221119
# -> use a packaged module (csv, text etc.) based on content of the directory
11231120
#
1124-
# - if path has no "/" and is a module on GitHub (in /datasets)
1125-
# -> use the module from the python file on GitHub
1126-
# Note that this case will be removed in favor of loading from the HF Hub instead eventually
11271121
# - if path has one "/" and is dataset repository on the HF hub with a python file
11281122
# -> the module from the python file in the dataset repository
11291123
# - if path has one "/" and is dataset repository on the HF hub without a python file
@@ -1459,12 +1453,9 @@ def load_dataset_builder(
14591453
features (:class:`Features`, optional): Set the features type to use for this dataset.
14601454
download_config (:class:`~utils.DownloadConfig`, optional): Specific download configuration parameters.
14611455
download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
1462-
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
1463-
1464-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
1465-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
1466-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
1467-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
1456+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
1457+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
1458+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
14681459
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
14691460
If True, will get token from `"~/.huggingface"`.
14701461
**config_kwargs (additional keyword arguments): Keyword arguments to be passed to the :class:`BuilderConfig`
@@ -1629,12 +1620,9 @@ def load_dataset(
16291620
will not be copied in-memory unless explicitly enabled by setting `datasets.config.IN_MEMORY_MAX_SIZE` to
16301621
nonzero. See more details in the :ref:`load_dataset_enhancing_performance` section.
16311622
save_infos (:obj:`bool`, default ``False``): Save the dataset information (checksums/size/splits/...).
1632-
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
1633-
1634-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
1635-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
1636-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
1637-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
1623+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
1624+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
1625+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
16381626
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
16391627
If True, will get token from `"~/.huggingface"`.
16401628
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` to standardized column names and types as detailed in :py:mod:`datasets.tasks`.

0 commit comments

Comments
 (0)