Docs details (#2690)

* docs: ✏️ format, update numbers, add link to datasets viewer * docs: ✏️ add a missing item in the list of documentation parts * docs: ✏️ fix link format (rst, not md) * docs: ✏️ update number of datasets + sample * docs: ✏️ newline at EOF * docs: ✏️ fix typo * docs: ✏️ update numbers * docs: ✏️ redaction details * docs: ✏️ fix typos, and update cli output * docs: ✏️ typos and details * docs: ✏️ typos * docs: ✏️ add an empty line so that the copy/pasted code is OK * docs: ✏️ small corrections * docs: ✏️ typos * docs: ✏️ add an external link * docs: ✏️ add a code example for shard * docs: ✏️ fix code example verbose option has been removed in df94a7c Now there is no easy way to remove the progress bar. Using the hack in #2651 (comment) would make the code snippet too complicated. * docs: ✏️ fix copy/paste error * docs: ✏️ fix link * docs: ✏️ disable the progress bar (replaces verbose=False) * docs: ✏️ code snippets format * docs: ✏️ typos and small content edits * docs: ✏️ details * docs: ✏️ typos and details * docs: ✏️ details * Update docs/source/index.rst Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>
huggingface · Jul 27, 2021 · 7d0bd0f · 7d0bd0f · github-actions · Jul 27, 2021
1 parent fbe4ad9
commit 7d0bd0f
Show file tree

Hide file tree

Showing 9 changed files with 118 additions and 103 deletions.
diff --git a/docs/source/exploring.rst b/docs/source/exploring.rst
@@ -9,11 +9,11 @@ The :class:`datasets.Dataset` object that you get when you execute for instance
     >>> from datasets import load_dataset
     >>> dataset = load_dataset('glue', 'mrpc', split='train')
 
-behaves like a normal python container. You can query its length, get rows, columns and also lot of metadata on the dataset (description, citation, split sizes, etc).
+behaves like a normal python container. You can query its length, get rows, columns and also a lot of metadata on the dataset (description, citation, split sizes, etc).
 
 In this guide we will detail what's in this object and how to access all the information.
 
-An :class:`datasets.Dataset` is a python container with a length coresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:
+A :class:`datasets.Dataset` is a python container with a length corresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:
 
 .. code-block::
 
@@ -76,9 +76,9 @@ More details on the ``features`` can be found in the guide on :doc:`features` an
 Metadata
 ------------------------------------------------------
 
-The :class:`datasets.Dataset` object also host many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).
+The :class:`datasets.Dataset` object also hosts many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).
 
-All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``licence`` when this one is available).
+All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``license`` when this one is available).
 
 .. code-block::
 
@@ -168,7 +168,7 @@ You can also get a full column by querying its name as a string. This will retur
 As you can see depending on the object queried (single row, batch of rows or column), the returned object is different:
 
 - a single row like ``dataset[0]`` will be returned as a python dictionary of values,
-- a batch like ``dataset[5:10]``) will be returned as a python dictionary of lists of values,
+- a batch like ``dataset[5:10]`` will be returned as a python dictionary of lists of values,
 - a column like ``dataset['sentence1']`` will be returned as a python lists of values.
 
 This may seems surprising at first but in our experiments it's actually easier to use these various format for data processing than returning the same format for each of these views on the dataset.
@@ -201,12 +201,12 @@ A specific format can be activated with :func:`datasets.Dataset.set_format`.
 - :obj:`type` (``Union[None, str]``, default to ``None``) defines the return type for the dataset :obj:`__getitem__` method and is one of ``[None, 'numpy', 'pandas', 'torch', 'tensorflow', 'jax']`` (``None`` means return python objects),
 - :obj:`columns` (``Union[None, str, List[str]]``, default to ``None``) defines the columns returned by :obj:`__getitem__` and takes the name of a column in the dataset or a list of columns to return (``None`` means return all columns),
 - :obj:`output_all_columns` (``bool``, default to ``False``) controls whether the columns which cannot be formatted (e.g. a column with ``string`` cannot be cast in a PyTorch Tensor) are still outputted as python objects.
-- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the convertiong function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.
+- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the converting function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.
 
 .. note::
 
     The format is only applied to a single row or batches of rows (i.e. when querying :obj:`dataset[0]` or :obj:`dataset[10:20]`). Querying a column (e.g. :obj:`dataset['sentence1']`) will return the column even if it's filtered by the format. In this case the un-formatted column is returned.
-    This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite usefull to be able to access column even when they are masked by the format.
+    This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite useful to be able to access column even when they are masked by the format.
 
 Here is an example:
 
@@ -239,6 +239,7 @@ Here is an example to tokenize and pad tokens on-the-fly when accessing the samp
     >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
     >>> def encode(batch):
     >>>     return tokenizer(batch["sentence1"], padding="longest", truncation=True, max_length=512, return_tensors="pt")
+    >>>
     >>> dataset.set_transform(encode)
     >>> dataset.format
     {'type': 'custom', 'format_kwargs': {'transform': <function __main__.encode(batch)>}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}

diff --git a/docs/source/faiss_and_ea.rst b/docs/source/faiss_and_ea.rst
@@ -3,7 +3,7 @@ Adding a FAISS or Elastic Search index to a Dataset
 
 It is possible to do document retrieval in a dataset.
 
-For example, one way to do Open Domain Question Answering, one way to do that is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.
+For example, one way to do Open Domain Question Answering is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.
 
 FAISS is a library for dense retrieval. It means that it retrieves documents based on their vector representations, by doing a nearest neighbors search.
 As we now have models that can generate good semantic vector representations of documents, this has become an interesting tool for document retrieval.
@@ -29,7 +29,7 @@ Adding a FAISS index
 
 The :func:`datasets.Dataset.add_faiss_index` method is in charge of building, training and adding vectors to a FAISS index.
 
-One way to get good vector representations for text passages is to use the DPR model. We'll compute the representations of only 100 examples just to give you the idea of how it works.
+One way to get good vector representations for text passages is to use the `DPR model <https://huggingface.co/transformers/model_doc/dpr.html>`_. We'll compute the representations of only 100 examples just to give you the idea of how it works.
 
 .. code-block::
 

diff --git a/docs/source/filesystems.rst b/docs/source/filesystems.rst
@@ -4,7 +4,7 @@ FileSystems Integration for cloud storages
 Supported Filesystems
 ---------------------
 
-Currenlty ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.
+Currently ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.
 
 Furthermore ``datasets`` supports all ``fsspec`` implementations. Currently known implementations are: 
 
@@ -24,15 +24,15 @@ Example using :class:`datasets.filesystems.S3FileSystem` within ``datasets``.
 
 .. code-block::
 
-    >>> pip install datasets[s3]
+    >>> pip install "datasets[s3]"
 
 Listing files from a public s3 bucket.
 
 .. code-block::
 
       >>> import datasets
       >>> s3 = datasets.filesystems.S3FileSystem(anon=True)  # doctest: +SKIP
-      >>> s3.ls('public-datasets/imdb/train')  # doctest: +SKIP
+      >>> s3.ls('some-public-datasets/imdb/train')  # doctest: +SKIP
       ['dataset_info.json.json','dataset.arrow','state.json']
 
 Listing files from a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``.
@@ -129,8 +129,8 @@ Loading ``encoded_dataset`` from a public s3 bucket.
       >>> # create S3FileSystem without credentials
       >>> s3 = S3FileSystem(anon=True)  # doctest: +SKIP
       >>>
-      >>> # load encoded_dataset to from s3 bucket
-      >>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3)  # doctest: +SKIP
+      >>> # load encoded_dataset from s3 bucket
+      >>> dataset = load_from_disk('s3://some-public-datasets/imdb/train',fs=s3)  # doctest: +SKIP
       >>>
       >>> print(len(dataset))
       >>> # 25000

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -9,22 +9,23 @@ Compatible with NumPy, Pandas, PyTorch and TensorFlow
 
 🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
 
-Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
-Lightweight and fast with a transparent and pythonic API
-Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
-Smart caching: never wait for your data to process several times
-🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.
+- Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
+- Lightweight and fast with a transparent and pythonic API
+- Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
+- Smart caching: never wait for your data to process several times
+- 🤗 Datasets currently provides access to ~1,000 datasets and ~30 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live `🤗 Datasets viewer <https://huggingface.co/datasets/viewer/>`_.
 
 🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section Main differences between 🤗 Datasets and `tfds`.
 
 Contents
 ---------------------------------
 
-The documentation is organized in five parts:
+The documentation is organized in six parts:
 
 - **GET STARTED** contains a quick tour and the installation instructions.
 - **USING DATASETS** contains general tutorials on how to use and contribute to the datasets in the library.
 - **USING METRICS** contains general tutorials on how to use and contribute to the metrics in the library.
+- **ADDING NEW DATASETS/METRICS** explains how to create your own dataset or metric loading script.
 - **ADVANCED GUIDES** contains more advanced guides that are more specific to a part of the library.
 - **PACKAGE REFERENCE** contains the documentation of each public class and function.
 
@@ -79,4 +80,4 @@ The documentation is organized in five parts:
     package_reference/builder_classes
     package_reference/table_classes
     package_reference/logging_methods
-    package_reference/task_templates
+    package_reference/task_templates