Docs details #2690

severo · 2021-07-21T10:43:14Z

Some comments here:

the code samples assume the expected libraries have already been installed. Maybe add a section at start, or add it to every code sample. Something like pip install datasets transformers torch 'datasets[streaming]' (maybe just link to https://huggingface.co/docs/datasets/installation.html + a one-liner that installs all the requirements / alternatively a requirements.txt file)
"If you’d like to play with the examples, you must install it from source." in https://huggingface.co/docs/datasets/installation.html: it's not clear to me what this means (what are these "examples"?)
in https://huggingface.co/docs/datasets/loading_datasets.html: "or AWS bucket if it’s not already stored in the library". It's the only place in the doc (aside from the docstring https://huggingface.co/docs/datasets/package_reference/loading_methods.html?highlight=aws bucket#datasets.list_datasets) where the "AWS bucket" is mentioned. It's not easy to understand what this means. Maybe explain more, and link to https://s3.amazonaws.com/datasets.huggingface.co and/or https://huggingface.co/docs/datasets/filesystems.html.
example in https://huggingface.co/docs/datasets/loading_datasets.html#manually-downloading-files is obsoleted by Enable auto-download for PAN-X / Wikiann domain in XTREME #2326. Also: see xtreme / pan-x cannot be downloaded #2691 for a bug on this specific dataset.
in https://huggingface.co/docs/datasets/loading_datasets.html#manually-downloading-files the doc says "After you’ve downloaded the files, you can point to the folder hosting them locally with the data_dir argument as follows:", but the following example does not show how to use data_dir
in https://huggingface.co/docs/datasets/loading_datasets.html#csv-files, it would be nice to have an URL to the csv loader reference (but I'm not sure there is one in the API reference). This comment applies in many places in the doc: I would want the API reference to contain doc for all the code/functions/classes... and I would want a lot more links inside the doc pointing to the API entries.
in the API reference (docstrings) I would prefer "SOURCE" to link to github instead of a copy of the code inside the docs site (eg. https://github.com/huggingface/datasets/blob/master/src/datasets/load.py#L711 instead of https://huggingface.co/docs/datasets/_modules/datasets/load.html#load_dataset)
it seems like not all the API is exposed in the doc. For example, there is no doc for disable_progress_bar, see https://huggingface.co/docs/datasets/search.html?q=disable_progress_bar, even if the code contains docstrings. Does it mean that the function is not officially supported? (otherwise, maybe it also deserves a mention in https://huggingface.co/docs/datasets/package_reference/logging_methods.html)
in https://huggingface.co/docs/datasets/loading_datasets.html?highlight=most%20efficient%20format%20have%20json%20files%20consisting%20multiple%20json%20objects#json-files, "The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows:", maybe link to https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON and give it a name ("line-delimited JSON"? "JSON Lines" as in https://huggingface.co/docs/datasets/processing.html#exporting-a-dataset-to-csv-json-parquet-or-to-python-objects ?)
in https://huggingface.co/docs/datasets/loading_datasets.html, for the local files sections, it would be nice to provide sample csv / json / text files to download, so that it's easier for the reader to try to load them (instead: they won't try)
the doc explains how to shard a dataset, but does not explain why and when a dataset should be sharded (I have no idea... for parallelizing?). It does neither give an idea of the number of shards a dataset typically should have and why.
the code example in https://huggingface.co/docs/datasets/processing.html#mapping-in-a-distributed-setting does not work, because training_args has not been defined before in the doc.

verbose option has been removed in df94a7c Now there is no easy way to remove the progress bar. Using the hack in #2651 (comment) would make the code snippet too complicated.

docs/source/index.rst

lhoestq · 2021-07-26T14:00:57Z

Thanks for all the comments and for the corrections in the docs !

About all the points you mentioned:

the code samples assume the expected libraries have already been installed. Maybe add a section at start, or add it to every code sample. Something like pip install datasets transformers torch 'datasets[streaming]' (maybe just link to https://huggingface.co/docs/datasets/installation.html + a one-liner that installs all the requirements / alternatively a requirements.txt file)

Yes good idea

"If you’d like to play with the examples, you must install it from source." in https://huggingface.co/docs/datasets/installation.html: it's not clear to me what this means (what are these "examples"?)

It refers to examples scripts inside the git repository of the library, see the examples folder in the transformers repo.
We don't have examples yet in the git repo of datasets as in transformers. So currently there are no examples. Maybe we can just remove this sentence from the docs for now

in https://huggingface.co/docs/datasets/loading_datasets.html: "or AWS bucket if it’s not already stored in the library". It's the only place in the doc (aside from the docstring https://huggingface.co/docs/datasets/package_reference/loading_methods.html?highlight=aws bucket#datasets.list_datasets) where the "AWS bucket" is mentioned. It's not easy to understand what this means. Maybe explain more, and link to https://s3.amazonaws.com/datasets.huggingface.co and/or https://huggingface.co/docs/datasets/filesystems.html.

This is outdated and must be replaced by

or from the Hugging Face Hub if it’s not already stored in the library

example in https://huggingface.co/docs/datasets/loading_datasets.html#manually-downloading-files is obsoleted by Enable auto-download for PAN-X / Wikiann domain in XTREME #2326. Also: see xtreme / pan-x cannot be downloaded #2691 for a bug on this specific dataset.

We can replace the XTREME PANX dataste by matinf instead for example

in https://huggingface.co/docs/datasets/loading_datasets.html#manually-downloading-files the doc says "After you’ve downloaded the files, you can point to the folder hosting them locally with the data_dir argument as follows:", but the following example does not show how to use data_dir

Let's add data_dir="path/to/your/downloaded/data" for example

in https://huggingface.co/docs/datasets/loading_datasets.html#csv-files, it would be nice to have an URL to the csv loader reference (but I'm not sure there is one in the API reference). This comment applies in many places in the doc: I would want the API reference to contain doc for all the code/functions/classes... and I would want a lot more links inside the doc pointing to the API entries.

Currently there's no documentation for the CSV loader config. Maybe we can add the docstrings to the CsvConfig class to explain the parameters and how it works, and then redirect to the doc of this class in this section of the documentation.

in the API reference (docstrings) I would prefer "SOURCE" to link to github instead of a copy of the code inside the docs site (eg. https://github.com/huggingface/datasets/blob/master/src/datasets/load.py#L711 instead of https://huggingface.co/docs/datasets/_modules/datasets/load.html#load_dataset)

This is the same as in transformers, not sure if this is a big issue

it seems like not all the API is exposed in the doc. For example, there is no doc for disable_progress_bar, see https://huggingface.co/docs/datasets/search.html?q=disable_progress_bar, even if the code contains docstrings. Does it mean that the function is not officially supported? (otherwise, maybe it also deserves a mention in https://huggingface.co/docs/datasets/package_reference/logging_methods.html)

The function disable_progress_bar should definitely be in the docs, thanks. We can add it to the logging methods

in https://huggingface.co/docs/datasets/loading_datasets.html?highlight=most%20efficient%20format%20have%20json%20files%20consisting%20multiple%20json%20objects#json-files, "The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows:", maybe link to https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON and give it a name ("line-delimited JSON"? "JSON Lines" as in https://huggingface.co/docs/datasets/processing.html#exporting-a-dataset-to-csv-json-parquet-or-to-python-objects ?)

Yes good idea !

in https://huggingface.co/docs/datasets/loading_datasets.html, for the local files sections, it would be nice to provide sample csv / json / text files to download, so that it's easier for the reader to try to load them (instead: they won't try)

Sure why not. Moreover the csv loader now supports remote files so you could just run the code pass an an URL to the sample csv file.

the doc explains how to shard a dataset, but does not explain why and when a dataset should be sharded (I have no idea... for parallelizing?). It does neither give an idea of the number of shards a dataset typically should have and why.

This can be used for distributed processing or just to use a percentage of the data. We can definitely give example of use cases

the code example in https://huggingface.co/docs/datasets/processing.html#mapping-in-a-distributed-setting does not work, because training_args has not been defined before in the doc.

training_args comes from transformers, it's a practical way to define all your arguments to train a model. Maybe we can just import it from transformers and use it with the default values

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq

Thanks for the corrections.
Though not all your comments have been addressed in this PR, we can already merge it.
As you may know we'll have a new documentation soon anyway ;)

severo added 9 commits July 21, 2021 17:22

docs: ✏️ format, update numbers, add link to datasets viewer

8fa45e6

docs: ✏️ add a missing item in the list of documentation parts

68f3526

docs: ✏️ fix link format (rst, not md)

71972e1

docs: ✏️ update number of datasets + sample

a81e5bf

docs: ✏️ newline at EOF

8e7bfb6

docs: ✏️ fix typo

faed653

docs: ✏️ update numbers

e43376c

docs: ✏️ redaction details

925fbbc

docs: ✏️ fix typos, and update cli output

13cffc6

severo force-pushed the docs-details branch from 6225ed3 to 13cffc6 Compare July 21, 2021 15:22

severo added 16 commits July 22, 2021 12:45

docs: ✏️ typos and details

5d14d2f

docs: ✏️ typos

6bd0b16

docs: ✏️ add an empty line so that the copy/pasted code is OK

4dc7a76

docs: ✏️ small corrections

75f43bb

docs: ✏️ typos

1b93513

docs: ✏️ add an external link

d199e40

docs: ✏️ add a code example for shard

9857a98

docs: ✏️ fix code example

9090033

verbose option has been removed in df94a7c Now there is no easy way to remove the progress bar. Using the hack in #2651 (comment) would make the code snippet too complicated.

docs: ✏️ fix copy/paste error

f996924

docs: ✏️ fix link

3517b62

docs: ✏️ disable the progress bar (replaces verbose=False)

ff0c4b0

docs: ✏️ code snippets format

329b0a2

docs: ✏️ typos and small content edits

721af09

docs: ✏️ details

703d275

docs: ✏️ typos and details

e08f5f7

docs: ✏️ details

921f946

lhoestq reviewed Jul 26, 2021

View reviewed changes

docs/source/index.rst Outdated Show resolved Hide resolved

Update docs/source/index.rst

a8ced7c

Co-authored-by: Quentin Lhoest <[email protected]>

severo marked this pull request as ready for review July 27, 2021 15:48

severo requested a review from lhoestq July 27, 2021 15:48

lhoestq approved these changes Jul 27, 2021

View reviewed changes

lhoestq merged commit 7d0bd0f into master Jul 27, 2021

lhoestq deleted the docs-details branch July 27, 2021 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs details #2690

Docs details #2690

severo commented Jul 21, 2021 •

edited

Loading

lhoestq commented Jul 26, 2021

lhoestq left a comment

Docs details #2690

Docs details #2690

Conversation

severo commented Jul 21, 2021 • edited Loading

lhoestq commented Jul 26, 2021

lhoestq left a comment

Choose a reason for hiding this comment

severo commented Jul 21, 2021 •

edited

Loading