Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs details #2690

Merged
merged 26 commits into from
Jul 27, 2021
Merged

Docs details #2690

merged 26 commits into from
Jul 27, 2021

Conversation

severo
Copy link
Collaborator

@severo severo commented Jul 21, 2021

Some comments here:

docs/source/index.rst Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member

lhoestq commented Jul 26, 2021

Thanks for all the comments and for the corrections in the docs !

About all the points you mentioned:

  • the code samples assume the expected libraries have already been installed. Maybe add a section at start, or add it to every code sample. Something like pip install datasets transformers torch 'datasets[streaming]' (maybe just link to https://huggingface.co/docs/datasets/installation.html + a one-liner that installs all the requirements / alternatively a requirements.txt file)

Yes good idea

It refers to examples scripts inside the git repository of the library, see the examples folder in the transformers repo.
We don't have examples yet in the git repo of datasets as in transformers. So currently there are no examples. Maybe we can just remove this sentence from the docs for now

This is outdated and must be replaced by

or from the Hugging Face Hub if it’s not already stored in the library

We can replace the XTREME PANX dataste by matinf instead for example

Let's add data_dir="path/to/your/downloaded/data" for example

  • in https://huggingface.co/docs/datasets/loading_datasets.html#csv-files, it would be nice to have an URL to the csv loader reference (but I'm not sure there is one in the API reference). This comment applies in many places in the doc: I would want the API reference to contain doc for all the code/functions/classes... and I would want a lot more links inside the doc pointing to the API entries.

Currently there's no documentation for the CSV loader config. Maybe we can add the docstrings to the CsvConfig class to explain the parameters and how it works, and then redirect to the doc of this class in this section of the documentation.

This is the same as in transformers, not sure if this is a big issue

The function disable_progress_bar should definitely be in the docs, thanks. We can add it to the logging methods

Yes good idea !

Sure why not. Moreover the csv loader now supports remote files so you could just run the code pass an an URL to the sample csv file.

  • the doc explains how to shard a dataset, but does not explain why and when a dataset should be sharded (I have no idea... for parallelizing?). It does neither give an idea of the number of shards a dataset typically should have and why.

This can be used for distributed processing or just to use a percentage of the data. We can definitely give example of use cases

training_args comes from transformers, it's a practical way to define all your arguments to train a model. Maybe we can just import it from transformers and use it with the default values

Co-authored-by: Quentin Lhoest <[email protected]>
@severo severo marked this pull request as ready for review July 27, 2021 15:48
@severo severo requested a review from lhoestq July 27, 2021 15:48
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the corrections.
Though not all your comments have been addressed in this PR, we can already merge it.
As you may know we'll have a new documentation soon anyway ;)

@lhoestq lhoestq merged commit 7d0bd0f into master Jul 27, 2021
@lhoestq lhoestq deleted the docs-details branch July 27, 2021 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants