Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs details #2690

Merged
merged 26 commits into from
Jul 27, 2021
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
8fa45e6
docs: ✏️ format, update numbers, add link to datasets viewer
severo Jul 21, 2021
68f3526
docs: ✏️ add a missing item in the list of documentation parts
severo Jul 21, 2021
71972e1
docs: ✏️ fix link format (rst, not md)
severo Jul 21, 2021
a81e5bf
docs: ✏️ update number of datasets + sample
severo Jul 21, 2021
8e7bfb6
docs: ✏️ newline at EOF
severo Jul 21, 2021
faed653
docs: ✏️ fix typo
severo Jul 21, 2021
e43376c
docs: ✏️ update numbers
severo Jul 21, 2021
925fbbc
docs: ✏️ redaction details
severo Jul 21, 2021
13cffc6
docs: ✏️ fix typos, and update cli output
severo Jul 21, 2021
5d14d2f
docs: ✏️ typos and details
severo Jul 22, 2021
6bd0b16
docs: ✏️ typos
severo Jul 22, 2021
4dc7a76
docs: ✏️ add an empty line so that the copy/pasted code is OK
severo Jul 22, 2021
75f43bb
docs: ✏️ small corrections
severo Jul 22, 2021
1b93513
docs: ✏️ typos
severo Jul 23, 2021
d199e40
docs: ✏️ add an external link
severo Jul 23, 2021
9857a98
docs: ✏️ add a code example for shard
severo Jul 23, 2021
9090033
docs: ✏️ fix code example
severo Jul 23, 2021
f996924
docs: ✏️ fix copy/paste error
severo Jul 23, 2021
3517b62
docs: ✏️ fix link
severo Jul 23, 2021
ff0c4b0
docs: ✏️ disable the progress bar (replaces verbose=False)
severo Jul 23, 2021
329b0a2
docs: ✏️ code snippets format
severo Jul 23, 2021
721af09
docs: ✏️ typos and small content edits
severo Jul 23, 2021
703d275
docs: ✏️ details
severo Jul 23, 2021
e08f5f7
docs: ✏️ typos and details
severo Jul 23, 2021
921f946
docs: ✏️ details
severo Jul 23, 2021
a8ced7c
Update docs/source/index.rst
severo Jul 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions docs/source/exploring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ The :class:`datasets.Dataset` object that you get when you execute for instance
>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')

behaves like a normal python container. You can query its length, get rows, columns and also lot of metadata on the dataset (description, citation, split sizes, etc).
behaves like a normal python container. You can query its length, get rows, columns and also a lot of metadata on the dataset (description, citation, split sizes, etc).

In this guide we will detail what's in this object and how to access all the information.

An :class:`datasets.Dataset` is a python container with a length coresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:
A :class:`datasets.Dataset` is a python container with a length corresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:

.. code-block::

Expand Down Expand Up @@ -76,9 +76,9 @@ More details on the ``features`` can be found in the guide on :doc:`features` an
Metadata
------------------------------------------------------

The :class:`datasets.Dataset` object also host many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).
The :class:`datasets.Dataset` object also hosts many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).

All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``licence`` when this one is available).
All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``license`` when this one is available).

.. code-block::

Expand Down Expand Up @@ -168,7 +168,7 @@ You can also get a full column by querying its name as a string. This will retur
As you can see depending on the object queried (single row, batch of rows or column), the returned object is different:

- a single row like ``dataset[0]`` will be returned as a python dictionary of values,
- a batch like ``dataset[5:10]``) will be returned as a python dictionary of lists of values,
- a batch like ``dataset[5:10]`` will be returned as a python dictionary of lists of values,
- a column like ``dataset['sentence1']`` will be returned as a python lists of values.

This may seems surprising at first but in our experiments it's actually easier to use these various format for data processing than returning the same format for each of these views on the dataset.
Expand Down Expand Up @@ -201,12 +201,12 @@ A specific format can be activated with :func:`datasets.Dataset.set_format`.
- :obj:`type` (``Union[None, str]``, default to ``None``) defines the return type for the dataset :obj:`__getitem__` method and is one of ``[None, 'numpy', 'pandas', 'torch', 'tensorflow', 'jax']`` (``None`` means return python objects),
- :obj:`columns` (``Union[None, str, List[str]]``, default to ``None``) defines the columns returned by :obj:`__getitem__` and takes the name of a column in the dataset or a list of columns to return (``None`` means return all columns),
- :obj:`output_all_columns` (``bool``, default to ``False``) controls whether the columns which cannot be formatted (e.g. a column with ``string`` cannot be cast in a PyTorch Tensor) are still outputted as python objects.
- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the convertiong function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.
- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the converting function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.

.. note::

The format is only applied to a single row or batches of rows (i.e. when querying :obj:`dataset[0]` or :obj:`dataset[10:20]`). Querying a column (e.g. :obj:`dataset['sentence1']`) will return the column even if it's filtered by the format. In this case the un-formatted column is returned.
This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite usefull to be able to access column even when they are masked by the format.
This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite useful to be able to access column even when they are masked by the format.

Here is an example:

Expand Down Expand Up @@ -239,6 +239,7 @@ Here is an example to tokenize and pad tokens on-the-fly when accessing the samp
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> def encode(batch):
>>> return tokenizer(batch["sentence1"], padding="longest", truncation=True, max_length=512, return_tensors="pt")
>>>
>>> dataset.set_transform(encode)
>>> dataset.format
{'type': 'custom', 'format_kwargs': {'transform': <function __main__.encode(batch)>}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}
Expand Down
4 changes: 2 additions & 2 deletions docs/source/faiss_and_ea.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Adding a FAISS or Elastic Search index to a Dataset

It is possible to do document retrieval in a dataset.

For example, one way to do Open Domain Question Answering, one way to do that is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.
For example, one way to do Open Domain Question Answering is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.

FAISS is a library for dense retrieval. It means that it retrieves documents based on their vector representations, by doing a nearest neighbors search.
As we now have models that can generate good semantic vector representations of documents, this has become an interesting tool for document retrieval.
Expand All @@ -29,7 +29,7 @@ Adding a FAISS index

The :func:`datasets.Dataset.add_faiss_index` method is in charge of building, training and adding vectors to a FAISS index.

One way to get good vector representations for text passages is to use the DPR model. We'll compute the representations of only 100 examples just to give you the idea of how it works.
One way to get good vector representations for text passages is to use the `DPR model <https://huggingface.co/transformers/model_doc/dpr.html>`_. We'll compute the representations of only 100 examples just to give you the idea of how it works.

.. code-block::

Expand Down
10 changes: 5 additions & 5 deletions docs/source/filesystems.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ FileSystems Integration for cloud storages
Supported Filesystems
---------------------

Currenlty ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.
Currently ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.

Furthermore ``datasets`` supports all ``fsspec`` implementations. Currently known implementations are:

Expand All @@ -24,15 +24,15 @@ Example using :class:`datasets.filesystems.S3FileSystem` within ``datasets``.

.. code-block::

>>> pip install datasets[s3]
>>> pip install "datasets[s3]"

Listing files from a public s3 bucket.

.. code-block::

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True) # doctest: +SKIP
>>> s3.ls('public-datasets/imdb/train') # doctest: +SKIP
>>> s3.ls('some-public-datasets/imdb/train') # doctest: +SKIP
['dataset_info.json.json','dataset.arrow','state.json']

Listing files from a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``.
Expand Down Expand Up @@ -129,8 +129,8 @@ Loading ``encoded_dataset`` from a public s3 bucket.
>>> # create S3FileSystem without credentials
>>> s3 = S3FileSystem(anon=True) # doctest: +SKIP
>>>
>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3) # doctest: +SKIP
>>> # load encoded_dataset from s3 bucket
>>> dataset = load_from_disk('s3://some-public-datasets/imdb/train',fs=s3) # doctest: +SKIP
>>>
>>> print(len(dataset))
>>> # 25000
Expand Down
15 changes: 8 additions & 7 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,23 @@ Compatible with NumPy, Pandas, PyTorch and TensorFlow

🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):

Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
Lightweight and fast with a transparent and pythonic API
Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
Smart caching: never wait for your data to process several times
🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.
- Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
- Lightweight and fast with a transparent and pythonic API
- Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
- Smart caching: never wait for your data to process several times
- 🤗 Datasets currently provides access to ~1,000 datasets and ~30 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live `🤗 Datasets viewer <https://huggingface.co/datasets/viewer/>`.
severo marked this conversation as resolved.
Show resolved Hide resolved

🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section Main differences between 🤗 Datasets and `tfds`.

Contents
---------------------------------

The documentation is organized in five parts:
The documentation is organized in six parts:

- **GET STARTED** contains a quick tour and the installation instructions.
- **USING DATASETS** contains general tutorials on how to use and contribute to the datasets in the library.
- **USING METRICS** contains general tutorials on how to use and contribute to the metrics in the library.
- **ADDING NEW DATASETS/METRICS** explains how to create your own dataset or metric loading script.
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a part of the library.
- **PACKAGE REFERENCE** contains the documentation of each public class and function.

Expand Down Expand Up @@ -79,4 +80,4 @@ The documentation is organized in five parts:
package_reference/builder_classes
package_reference/table_classes
package_reference/logging_methods
package_reference/task_templates
package_reference/task_templates
Loading