Skip to content

Commit

Permalink
Docs details (#2690)
Browse files Browse the repository at this point in the history
* docs: ✏️ format, update numbers, add link to datasets viewer

* docs: ✏️ add a missing item in the list of documentation parts

* docs: ✏️ fix link format (rst, not md)

* docs: ✏️ update number of datasets + sample

* docs: ✏️ newline at EOF

* docs: ✏️ fix typo

* docs: ✏️ update numbers

* docs: ✏️ redaction details

* docs: ✏️ fix typos, and update cli output

* docs: ✏️ typos and details

* docs: ✏️ typos

* docs: ✏️ add an empty line so that the copy/pasted code is OK

* docs: ✏️ small corrections

* docs: ✏️ typos

* docs: ✏️ add an external link

* docs: ✏️ add a code example for shard

* docs: ✏️ fix code example

verbose option has been removed in
df94a7c

Now there is no easy way to remove the progress bar. Using the hack in
#2651 (comment)
would make the code snippet too complicated.

* docs: ✏️ fix copy/paste error

* docs: ✏️ fix link

* docs: ✏️ disable the progress bar (replaces verbose=False)

* docs: ✏️ code snippets format

* docs: ✏️ typos and small content edits

* docs: ✏️ details

* docs: ✏️ typos and details

* docs: ✏️ details

* Update docs/source/index.rst

Co-authored-by: Quentin Lhoest <[email protected]>

Co-authored-by: Quentin Lhoest <[email protected]>
  • Loading branch information
severo and lhoestq authored Jul 27, 2021
1 parent fbe4ad9 commit 7d0bd0f
Show file tree
Hide file tree
Showing 9 changed files with 118 additions and 103 deletions.
15 changes: 8 additions & 7 deletions docs/source/exploring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ The :class:`datasets.Dataset` object that you get when you execute for instance
>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')
behaves like a normal python container. You can query its length, get rows, columns and also lot of metadata on the dataset (description, citation, split sizes, etc).
behaves like a normal python container. You can query its length, get rows, columns and also a lot of metadata on the dataset (description, citation, split sizes, etc).

In this guide we will detail what's in this object and how to access all the information.

An :class:`datasets.Dataset` is a python container with a length coresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:
A :class:`datasets.Dataset` is a python container with a length corresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:

.. code-block::
Expand Down Expand Up @@ -76,9 +76,9 @@ More details on the ``features`` can be found in the guide on :doc:`features` an
Metadata
------------------------------------------------------

The :class:`datasets.Dataset` object also host many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).
The :class:`datasets.Dataset` object also hosts many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).

All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``licence`` when this one is available).
All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``license`` when this one is available).

.. code-block::
Expand Down Expand Up @@ -168,7 +168,7 @@ You can also get a full column by querying its name as a string. This will retur
As you can see depending on the object queried (single row, batch of rows or column), the returned object is different:

- a single row like ``dataset[0]`` will be returned as a python dictionary of values,
- a batch like ``dataset[5:10]``) will be returned as a python dictionary of lists of values,
- a batch like ``dataset[5:10]`` will be returned as a python dictionary of lists of values,
- a column like ``dataset['sentence1']`` will be returned as a python lists of values.

This may seems surprising at first but in our experiments it's actually easier to use these various format for data processing than returning the same format for each of these views on the dataset.
Expand Down Expand Up @@ -201,12 +201,12 @@ A specific format can be activated with :func:`datasets.Dataset.set_format`.
- :obj:`type` (``Union[None, str]``, default to ``None``) defines the return type for the dataset :obj:`__getitem__` method and is one of ``[None, 'numpy', 'pandas', 'torch', 'tensorflow', 'jax']`` (``None`` means return python objects),
- :obj:`columns` (``Union[None, str, List[str]]``, default to ``None``) defines the columns returned by :obj:`__getitem__` and takes the name of a column in the dataset or a list of columns to return (``None`` means return all columns),
- :obj:`output_all_columns` (``bool``, default to ``False``) controls whether the columns which cannot be formatted (e.g. a column with ``string`` cannot be cast in a PyTorch Tensor) are still outputted as python objects.
- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the convertiong function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.
- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the converting function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.

.. note::

The format is only applied to a single row or batches of rows (i.e. when querying :obj:`dataset[0]` or :obj:`dataset[10:20]`). Querying a column (e.g. :obj:`dataset['sentence1']`) will return the column even if it's filtered by the format. In this case the un-formatted column is returned.
This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite usefull to be able to access column even when they are masked by the format.
This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite useful to be able to access column even when they are masked by the format.

Here is an example:

Expand Down Expand Up @@ -239,6 +239,7 @@ Here is an example to tokenize and pad tokens on-the-fly when accessing the samp
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> def encode(batch):
>>> return tokenizer(batch["sentence1"], padding="longest", truncation=True, max_length=512, return_tensors="pt")
>>>
>>> dataset.set_transform(encode)
>>> dataset.format
{'type': 'custom', 'format_kwargs': {'transform': <function __main__.encode(batch)>}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}
Expand Down
4 changes: 2 additions & 2 deletions docs/source/faiss_and_ea.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Adding a FAISS or Elastic Search index to a Dataset

It is possible to do document retrieval in a dataset.

For example, one way to do Open Domain Question Answering, one way to do that is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.
For example, one way to do Open Domain Question Answering is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.

FAISS is a library for dense retrieval. It means that it retrieves documents based on their vector representations, by doing a nearest neighbors search.
As we now have models that can generate good semantic vector representations of documents, this has become an interesting tool for document retrieval.
Expand All @@ -29,7 +29,7 @@ Adding a FAISS index

The :func:`datasets.Dataset.add_faiss_index` method is in charge of building, training and adding vectors to a FAISS index.

One way to get good vector representations for text passages is to use the DPR model. We'll compute the representations of only 100 examples just to give you the idea of how it works.
One way to get good vector representations for text passages is to use the `DPR model <https://huggingface.co/transformers/model_doc/dpr.html>`_. We'll compute the representations of only 100 examples just to give you the idea of how it works.

.. code-block::
Expand Down
10 changes: 5 additions & 5 deletions docs/source/filesystems.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ FileSystems Integration for cloud storages
Supported Filesystems
---------------------

Currenlty ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.
Currently ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.

Furthermore ``datasets`` supports all ``fsspec`` implementations. Currently known implementations are:

Expand All @@ -24,15 +24,15 @@ Example using :class:`datasets.filesystems.S3FileSystem` within ``datasets``.

.. code-block::
>>> pip install datasets[s3]
>>> pip install "datasets[s3]"
Listing files from a public s3 bucket.

.. code-block::
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True) # doctest: +SKIP
>>> s3.ls('public-datasets/imdb/train') # doctest: +SKIP
>>> s3.ls('some-public-datasets/imdb/train') # doctest: +SKIP
['dataset_info.json.json','dataset.arrow','state.json']
Listing files from a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``.
Expand Down Expand Up @@ -129,8 +129,8 @@ Loading ``encoded_dataset`` from a public s3 bucket.
>>> # create S3FileSystem without credentials
>>> s3 = S3FileSystem(anon=True) # doctest: +SKIP
>>>
>>> # load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3) # doctest: +SKIP
>>> # load encoded_dataset from s3 bucket
>>> dataset = load_from_disk('s3://some-public-datasets/imdb/train',fs=s3) # doctest: +SKIP
>>>
>>> print(len(dataset))
>>> # 25000
Expand Down
15 changes: 8 additions & 7 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,23 @@ Compatible with NumPy, Pandas, PyTorch and TensorFlow

🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):

Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
Lightweight and fast with a transparent and pythonic API
Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
Smart caching: never wait for your data to process several times
🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.
- Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
- Lightweight and fast with a transparent and pythonic API
- Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
- Smart caching: never wait for your data to process several times
- 🤗 Datasets currently provides access to ~1,000 datasets and ~30 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live `🤗 Datasets viewer <https://huggingface.co/datasets/viewer/>`_.

🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section Main differences between 🤗 Datasets and `tfds`.

Contents
---------------------------------

The documentation is organized in five parts:
The documentation is organized in six parts:

- **GET STARTED** contains a quick tour and the installation instructions.
- **USING DATASETS** contains general tutorials on how to use and contribute to the datasets in the library.
- **USING METRICS** contains general tutorials on how to use and contribute to the metrics in the library.
- **ADDING NEW DATASETS/METRICS** explains how to create your own dataset or metric loading script.
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a part of the library.
- **PACKAGE REFERENCE** contains the documentation of each public class and function.

Expand Down Expand Up @@ -79,4 +80,4 @@ The documentation is organized in five parts:
package_reference/builder_classes
package_reference/table_classes
package_reference/logging_methods
package_reference/task_templates
package_reference/task_templates
Loading

1 comment on commit 7d0bd0f

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008869 / 0.011353 (-0.002484) 0.003556 / 0.011008 (-0.007452) 0.030754 / 0.038508 (-0.007754) 0.034273 / 0.023109 (0.011164) 0.317105 / 0.275898 (0.041207) 0.352932 / 0.323480 (0.029452) 0.007745 / 0.007986 (-0.000240) 0.004526 / 0.004328 (0.000198) 0.008873 / 0.004250 (0.004623) 0.037649 / 0.037052 (0.000596) 0.311853 / 0.258489 (0.053364) 0.355592 / 0.293841 (0.061752) 0.022739 / 0.128546 (-0.105807) 0.007637 / 0.075646 (-0.068010) 0.249417 / 0.419271 (-0.169855) 0.044536 / 0.043533 (0.001004) 0.311333 / 0.255139 (0.056194) 0.348940 / 0.283200 (0.065740) 0.078926 / 0.141683 (-0.062757) 1.585697 / 1.452155 (0.133542) 1.619696 / 1.492716 (0.126980)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.009572 / 0.018006 (-0.008434) 0.428736 / 0.000490 (0.428247) 0.001665 / 0.000200 (0.001465) 0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036496 / 0.037411 (-0.000915) 0.023438 / 0.014526 (0.008912) 0.029603 / 0.176557 (-0.146953) 0.123471 / 0.737135 (-0.613665) 0.029678 / 0.296338 (-0.266660)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.356397 / 0.215209 (0.141188) 3.587435 / 2.077655 (1.509781) 1.902278 / 1.504120 (0.398158) 1.721182 / 1.541195 (0.179988) 1.667455 / 1.468490 (0.198965) 0.308694 / 4.584777 (-4.276083) 4.278404 / 3.745712 (0.532692) 2.988245 / 5.269862 (-2.281617) 1.154874 / 4.565676 (-3.410803) 0.036176 / 0.424275 (-0.388099) 0.005327 / 0.007607 (-0.002280) 0.451921 / 0.226044 (0.225876) 4.530437 / 2.268929 (2.261508) 2.293373 / 55.444624 (-53.151252) 1.933894 / 6.876477 (-4.942583) 1.974526 / 2.142072 (-0.167546) 0.416776 / 4.805227 (-4.388452) 0.097310 / 6.500664 (-6.403354) 0.050670 / 0.075469 (-0.024799)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.788526 / 1.841788 (10.946738) 12.436252 / 8.074308 (4.361944) 26.540716 / 10.191392 (16.349324) 0.748972 / 0.680424 (0.068548) 0.497635 / 0.534201 (-0.036566) 0.218746 / 0.579283 (-0.360537) 0.473826 / 0.434364 (0.039463) 0.171530 / 0.540337 (-0.368807) 0.900841 / 1.386936 (-0.486095)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008550 / 0.011353 (-0.002803) 0.003413 / 0.011008 (-0.007596) 0.030239 / 0.038508 (-0.008269) 0.033867 / 0.023109 (0.010757) 0.281317 / 0.275898 (0.005419) 0.314842 / 0.323480 (-0.008638) 0.007348 / 0.007986 (-0.000637) 0.004438 / 0.004328 (0.000110) 0.008465 / 0.004250 (0.004215) 0.037741 / 0.037052 (0.000689) 0.283002 / 0.258489 (0.024513) 0.317796 / 0.293841 (0.023955) 0.022559 / 0.128546 (-0.105987) 0.007572 / 0.075646 (-0.068074) 0.248007 / 0.419271 (-0.171265) 0.045406 / 0.043533 (0.001873) 0.282431 / 0.255139 (0.027292) 0.313429 / 0.283200 (0.030230) 0.080975 / 0.141683 (-0.060708) 1.619275 / 1.452155 (0.167120) 1.590601 / 1.492716 (0.097884)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.028887 / 0.018006 (0.010881) 0.425092 / 0.000490 (0.424602) 0.007851 / 0.000200 (0.007651) 0.000238 / 0.000054 (0.000184)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035243 / 0.037411 (-0.002168) 0.022798 / 0.014526 (0.008272) 0.026034 / 0.176557 (-0.150522) 0.124909 / 0.737135 (-0.612226) 0.027566 / 0.296338 (-0.268772)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.341027 / 0.215209 (0.125818) 3.401934 / 2.077655 (1.324280) 1.711644 / 1.504120 (0.207524) 1.538507 / 1.541195 (-0.002687) 1.563389 / 1.468490 (0.094898) 0.304322 / 4.584777 (-4.280455) 4.343314 / 3.745712 (0.597602) 3.820677 / 5.269862 (-1.449185) 1.472669 / 4.565676 (-3.093007) 0.036229 / 0.424275 (-0.388046) 0.005032 / 0.007607 (-0.002575) 0.445477 / 0.226044 (0.219432) 4.414482 / 2.268929 (2.145553) 2.137008 / 55.444624 (-53.307617) 1.797073 / 6.876477 (-5.079404) 1.835512 / 2.142072 (-0.306560) 0.417379 / 4.805227 (-4.387848) 0.098279 / 6.500664 (-6.402385) 0.051050 / 0.075469 (-0.024419)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.732994 / 1.841788 (10.891206) 12.389366 / 8.074308 (4.315057) 26.772093 / 10.191392 (16.580701) 0.661624 / 0.680424 (-0.018800) 0.499748 / 0.534201 (-0.034453) 0.221128 / 0.579283 (-0.358155) 0.474822 / 0.434364 (0.040458) 0.175572 / 0.540337 (-0.364766) 0.927186 / 1.386936 (-0.459750)

CML watermark

Please sign in to comment.