Update docs once dataset scripts transferred to the Hub (huggingface#5136)

albertvillanova · web-flow · commit cd55ad067a89 · 2022-10-20T10:10:00.000+02:00
* Update docs about Datasets on GitHub (legacy)

* Delete ADD_NEW_DATASET.md

* Update the issue template chooser

* Update docs

* Update docstrings

* Update ADD_NEW_DATASET.md
diff --git a/.github/ISSUE_TEMPLATE/add-dataset.md b/.github/ISSUE_TEMPLATE/add-dataset.md
@@ -0,0 +1,15 @@
+---
+name: "Add Dataset"
+about: Request the addition of a specific dataset to the library.
+title: ''
+labels: 'dataset request'
+assignees: ''
+
+---
+
+## Adding a Dataset
+- **Name:** *name of the dataset*
+- **Description:** *short description of the dataset (or link to social media or blog post)*
+- **Paper:** *link to the dataset paper if available*
+- **Data:** *link to the Github repository or current dataset location*
+- **Motivation:** *what are some good reasons to have this dataset*
diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml
@@ -1,7 +1,7 @@
 contact_links:
   - name: Datasets on the Hugging Face Hub
     url: https://huggingface.co/datasets
-    about: Open a Pull request / Discussion related to a specific dataset on the Hugging Face Hub (PRs for datasets with no namespace still have to be on GitHub though)
+    about: Please use the "Community" tab of the dataset on the Hugging Face Hub to open a discussion or a pull request
   - name: Forum
     url: https://discuss.huggingface.co/c/datasets/10
     about: Please ask and answer questions here, and engage with other community members
diff --git a/ADD_NEW_DATASET.md b/ADD_NEW_DATASET.md
diff --git a/README.md b/README.md
@@ -32,7 +32,7 @@
 
 [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)
 
-[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md)
+[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://huggingface.co/docs/datasets/share.html)
 
 <h3 align="center">
     <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/main/docs/source/imgs/course_banner.png"></a>
@@ -127,8 +127,6 @@ We have a very detailed step-by-step guide to add a new dataset to the ![number
 
 You will find [the step-by-step guide here](https://huggingface.co/docs/datasets/share.html) to add a dataset on the Hub.
 
-However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
-
 # Main differences between 🤗 Datasets and `tfds`
 
 If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`:
diff --git a/docs/source/about_dataset_load.mdx b/docs/source/about_dataset_load.mdx
@@ -102,12 +102,12 @@ To ensure a dataset is complete, [`load_dataset`] will perform a series of tests
 If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.
 In this case, an error is raised to alert that the dataset has changed.
 To ignore the error, one needs to specify `ignore_verifications=True` in [`load_dataset`].
-Anytime you see a verification error, feel free to [open an issue on GitHub](https://github.com/huggingface/datasets/issues) so that we can update the integrity checks for this dataset.
+Anytime you see a verification error, feel free to open a discussion or pull request in the corresponding dataset "Community" tab, so that the integrity checks for that dataset are updated.
 
 ## Security
 
 The dataset repositories on the Hub are scanned for malware, see more information [here](https://huggingface.co/docs/hub/security#malware-scanning).
 
-Moreover the datasets that were constributed on our GitHub repository have all been reviewed by our maintainers.
+Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers.
 The code of these datasets is considered **safe**.
 It concerns datasets that are not under a namespace, e.g. "squad" or "glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -281,7 +281,7 @@ An object data type in [pandas.Series](https://pandas.pydata.org/docs/reference/
 
 ## Offline
 
-Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub or 🤗 Datasets GitHub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
+Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
 
 If you know you won't have internet access, you can run 🤗 Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, 🤗 Datasets will look directly in the cache. Set the environment variable `HF_DATASETS_OFFLINE` to `1` to enable full offline mode.
 
diff --git a/docs/source/share.mdx b/docs/source/share.mdx
@@ -144,18 +144,14 @@ Members of the Hugging Face team will be happy to review your dataset script and
 ## Datasets on GitHub (legacy)
 
 Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the Hugging Face Hub.
-The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
-Those datasets are still maintained on GitHub, and if you'd like to edit them, please open a Pull Request on the huggingface/datasets repository.
-Sharing your dataset to the Hub is the recommended way of adding a dataset.
+
+The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace on the Hub: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
 
 <Tip>
 
-The distinction between a Hub dataset and a dataset from GitHub only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
+The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
 
 </Tip>
 
-The code of these datasets are reviewed by the Hugging Face team, and they require test data in order to be regularly tested.
-
-In some rare cases it makes more sense to open a PR on GitHub. For example when you are not the author of the dataset and there is no clear organization / namespace that you can put the dataset under.
-
-For more info, please take a look at the documentation on [How to add a new dataset in the huggingface/datasets repository](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
+Those datasets are mow maintained on the Hub: if you think a fix is needed, please use their "Community" tab to open a discussion or create a Pull Request.
+The code of these datasets is reviewed by the Hugging Face team.
diff --git a/src/datasets/inspect.py b/src/datasets/inspect.py
@@ -341,12 +341,9 @@ def get_dataset_config_info(
         data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s).
         download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters.
         download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
-        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
-
-            - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
-              You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
-            - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
-              You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
+        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
+            As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
+            You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
         use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
             If True, will get token from `"~/.huggingface"`.
         **config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied.
@@ -405,12 +402,9 @@ def get_dataset_split_names(
         data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s).
         download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters.
         download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
-        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
-
-            - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
-              You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
-            - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
-              You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
+        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
+            As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
+            You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
         use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
             If True, will get token from `"~/.huggingface"`.
         **config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied.
diff --git a/src/datasets/load.py b/src/datasets/load.py
@@ -1079,12 +1079,9 @@ def dataset_module_factory(
               -> load the dataset builder from the dataset script in the dataset repository
               e.g. ``glue``, ``squad``, ``'username/dataset_name'``, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`.
 
-        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
-
-            - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
-              You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
-            - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
-              You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
+        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
+            As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
+            You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
         download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters.
         download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
         dynamic_modules_path (Optional str, defaults to HF_MODULES_CACHE / "datasets_modules", i.e. ~/.cache/huggingface/modules/datasets_modules):
@@ -1121,9 +1118,6 @@ def dataset_module_factory(
     # - if path is a local directory (but no python file)
     #   -> use a packaged module (csv, text etc.) based on content of the directory
     #
-    # - if path has no "/" and is a module on GitHub (in /datasets)
-    #   -> use the module from the python file on GitHub
-    #   Note that this case will be removed in favor of loading from the HF Hub instead eventually
     # - if path has one "/" and is dataset repository on the HF hub with a python file
     #   -> the module from the python file in the dataset repository
     # - if path has one "/" and is dataset repository on the HF hub without a python file
@@ -1459,12 +1453,9 @@ def load_dataset_builder(
         features (:class:`Features`, optional): Set the features type to use for this dataset.
         download_config (:class:`~utils.DownloadConfig`, optional): Specific download configuration parameters.
         download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode.
-        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
-
-            - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
-              You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
-            - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
-              You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
+        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
+            As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
+            You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
         use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
             If True, will get token from `"~/.huggingface"`.
         **config_kwargs (additional keyword arguments): Keyword arguments to be passed to the :class:`BuilderConfig`
@@ -1629,12 +1620,9 @@ def load_dataset(
             will not be copied in-memory unless explicitly enabled by setting `datasets.config.IN_MEMORY_MAX_SIZE` to
             nonzero. See more details in the :ref:`load_dataset_enhancing_performance` section.
         save_infos (:obj:`bool`, default ``False``): Save the dataset information (checksums/size/splits/...).
-        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
-
-            - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib.
-              You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues.
-            - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
-              You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
+        revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load.
+            As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch.
+            You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository.
         use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
             If True, will get token from `"~/.huggingface"`.
         task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` to standardized column names and types as detailed in :py:mod:`datasets.tasks`.