You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
about: Open a Pull request / Discussion related to a specific dataset on the Hugging Face Hub (PRs for datasets with no namespace still have to be on GitHub though)
4
+
about: Please use the "Community" tab of the dataset on the Hugging Face Hub to open a discussion or a pull request
5
5
- name: Forum
6
6
url: https://discuss.huggingface.co/c/datasets/10
7
7
about: Please ask and answer questions here, and engage with other community members
[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets)[🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md)
35
+
[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets)[🌟 **Add a new dataset to the Hub**](https://huggingface.co/docs/datasets/share.html)
@@ -127,8 +127,6 @@ We have a very detailed step-by-step guide to add a new dataset to the  to add a dataset on the Hub.
129
129
130
-
However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
131
-
132
130
# Main differences between 🤗 Datasets and `tfds`
133
131
134
132
If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`:
Copy file name to clipboardExpand all lines: docs/source/about_dataset_load.mdx
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -102,12 +102,12 @@ To ensure a dataset is complete, [`load_dataset`] will perform a series of tests
102
102
If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.
103
103
In this case, an error is raised to alert that the dataset has changed.
104
104
To ignore the error, one needs to specify `ignore_verifications=True` in [`load_dataset`].
105
-
Anytime you see a verification error, feel free to [open an issue on GitHub](https://github.com/huggingface/datasets/issues) so that we can update the integrity checks for this dataset.
105
+
Anytime you see a verification error, feel free to open a discussion or pull request in the corresponding dataset "Community" tab, so that the integrity checks for that dataset are updated.
106
106
107
107
## Security
108
108
109
109
The dataset repositories on the Hub are scanned for malware, see more information [here](https://huggingface.co/docs/hub/security#malware-scanning).
110
110
111
-
Moreover the datasets that were constributed on our GitHub repository have all been reviewed by our maintainers.
111
+
Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers.
112
112
The code of these datasets is considered **safe**.
113
113
It concerns datasets that are not under a namespace, e.g. "squad" or "glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
Copy file name to clipboardExpand all lines: docs/source/loading.mdx
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -281,7 +281,7 @@ An object data type in [pandas.Series](https://pandas.pydata.org/docs/reference/
281
281
282
282
## Offline
283
283
284
-
Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub or 🤗 Datasets GitHub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
284
+
Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
285
285
286
286
If you know you won't have internet access, you can run 🤗 Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, 🤗 Datasets will look directly in the cache. Set the environment variable `HF_DATASETS_OFFLINE` to `1` to enable full offline mode.
Copy file name to clipboardExpand all lines: docs/source/share.mdx
+5-9Lines changed: 5 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -144,18 +144,14 @@ Members of the Hugging Face team will be happy to review your dataset script and
144
144
## Datasets on GitHub (legacy)
145
145
146
146
Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the Hugging Face Hub.
147
-
The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
148
-
Those datasets are still maintained on GitHub, and if you'd like to edit them, please open a Pull Request on the huggingface/datasets repository.
149
-
Sharing your dataset to the Hub is the recommended way of adding a dataset.
147
+
148
+
The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace on the Hub: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
150
149
151
150
<Tip>
152
151
153
-
The distinction between a Hub dataset and a dataset from GitHub only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
152
+
The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
154
153
155
154
</Tip>
156
155
157
-
The code of these datasets are reviewed by the Hugging Face team, and they require test data in order to be regularly tested.
158
-
159
-
In some rare cases it makes more sense to open a PR on GitHub. For example when you are not the author of the dataset and there is no clear organization / namespace that you can put the dataset under.
160
-
161
-
For more info, please take a look at the documentation on [How to add a new dataset in the huggingface/datasets repository](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
156
+
Those datasets are mow maintained on the Hub: if you think a fix is needed, please use their "Community" tab to open a discussion or create a Pull Request.
157
+
The code of these datasets is reviewed by the Hugging Face team.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
345
-
346
-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
347
-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
348
-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
349
-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
344
+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
345
+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
346
+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
350
347
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
351
348
If True, will get token from `"~/.huggingface"`.
352
349
**config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied.
@@ -405,12 +402,9 @@ def get_dataset_split_names(
405
402
data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s).
406
403
download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
409
-
410
-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
411
-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
412
-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
413
-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
405
+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
406
+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
407
+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
414
408
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
415
409
If True, will get token from `"~/.huggingface"`.
416
410
**config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied.
-> load the dataset builder from the dataset script in the dataset repository
1080
1080
e.g. ``glue``, ``squad``, ``'username/dataset_name'``, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`.
1081
1081
1082
-
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
1083
-
1084
-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
1085
-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
1086
-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
1087
-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
1082
+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
1083
+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
1084
+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
1088
1085
download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters.
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
1463
-
1464
-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
1465
-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
1466
-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
1467
-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
1456
+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
1457
+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
1458
+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
1468
1459
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
1469
1460
If True, will get token from `"~/.huggingface"`.
1470
1461
**config_kwargs (additional keyword arguments): Keyword arguments to be passed to the :class:`BuilderConfig`
@@ -1629,12 +1620,9 @@ def load_dataset(
1629
1620
will not be copied in-memory unless explicitly enabled by setting `datasets.config.IN_MEMORY_MAX_SIZE` to
1630
1621
nonzero. See more details in the :ref:`load_dataset_enhancing_performance` section.
1631
1622
save_infos (:obj:`bool`, default ``False``): Save the dataset information (checksums/size/splits/...).
1632
-
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
1633
-
1634
-
- For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
1635
-
You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
1636
-
- For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
1637
-
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
1623
+
revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
1624
+
As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
1625
+
You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
1638
1626
use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
1639
1627
If True, will get token from `"~/.huggingface"`.
1640
1628
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` to standardized column names and types as detailed in :py:mod:`datasets.tasks`.
0 commit comments