docs: ✏️ fix typos, and update cli output

severo · severo · commit 13cffc6e8c19 · 2021-07-21T17:22:45.000+02:00
diff --git a/docs/source/loading_datasets.rst b/docs/source/loading_datasets.rst
@@ -52,11 +52,11 @@ This call to :func:`datasets.load_dataset` does the following steps under the ho
 
     Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files. You can find the SQuAD processing script `here <https://github.com/huggingface/datasets/tree/master/datasets/squad/squad.py>`__ for instance.
 
-2. Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard splits stored on the drive.
+2. Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard split stored on the drive.
 
 .. note::
 
-    An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store arbitrarily long dataframe,
+    An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store an arbitrarily long dataframe,
     typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you
     to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use
     memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory
@@ -80,9 +80,16 @@ If you don't provide a :obj:`split` argument to :func:`datasets.load_dataset`, t
     >>> from datasets import load_dataset
     >>> datasets = load_dataset('squad')
     >>> print(datasets)
-    {'train': Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 87599),
-     'validation': Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 10570)
-    }
+    DatasetDict({
+        train: Dataset({
+            features: ['id', 'title', 'context', 'question', 'answers'],
+            num_rows: 87599
+        })
+        validation: Dataset({
+            features: ['id', 'title', 'context', 'question', 'answers'],
+            num_rows: 10570
+        })
+    })
 
 The :obj:`split` argument can actually be used to control extensively the generated dataset split. You can use this argument to build a split from only a portion of a split in absolute number of examples or in proportion (e.g. :obj:`split='train[:10%]'` will load only the first 10% of the train split) or to mix splits (e.g. :obj:`split='train[:100]+validation[:100]'` will create a split from the first 100 examples of the train split and the first 100 examples of the validation split).
 
@@ -91,12 +98,12 @@ You can find more details on the syntax for using :obj:`split` on the :doc:`dedi
 Selecting a configuration
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Some datasets comprise several :obj:`configurations`. A Configuration define a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are:
+Some datasets comprise several :obj:`configurations`. A Configuration defines a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are:
 
 - the **GLUE** dataset which is an agregated benchmark comprised of 10 subsets: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX.
 - the **wikipedia** dataset which is provided for several languages.
 
-When a dataset is provided with more than one :obj:`configurations`, you will be requested to explicitely select a configuration among the possibilities.
+When a dataset is provided with more than one :obj:`configuration`, you will be requested to explicitely select a configuration among the possibilities.
 
 Selecting a configuration is done by providing :func:`datasets.load_dataset` with a :obj:`name` argument. Here is an example for **GLUE**:
 
@@ -115,10 +122,20 @@ Selecting a configuration is done by providing :func:`datasets.load_dataset` wit
     Downloading: 100%|██████████████████████████████████████████████████████████████| 7.44M/7.44M [00:01<00:00, 7.03MB/s]
     Dataset glue downloaded and prepared to /Users/huggignface/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will reuse this data.
     >>> print(dataset)
-    {'train': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 67349),
-     'validation': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 872),
-     'test': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 1821)
-    }
+    DatasetDict({
+        train: Dataset({
+            features: ['sentence', 'label', 'idx'],
+            num_rows: 67349
+        })
+        validation: Dataset({
+            features: ['sentence', 'label', 'idx'],
+            num_rows: 872
+        })
+        test: Dataset({
+            features: ['sentence', 'label', 'idx'],
+            num_rows: 1821
+        })
+    })
 
 Manually downloading files
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -156,9 +173,9 @@ Generic loading scripts are provided for:
 - text files (read as a line-by-line dataset with the :obj:`text` script),
 - pandas pickled dataframe (with the :obj:`pandas` script).
 
-If you want to control better how you files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the `HuggingFace Hub <https://huggingface.co/datasets>`__, it can be more flexible and simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` chapter.
+If you want to control better how your files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the `HuggingFace Hub <https://huggingface.co/datasets>`__, it can be more flexible and simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` chapter.
 
-The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several files. This arguments currently accept three types of inputs:
+The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several files. This argument currently accepts three types of inputs:
 
 - :obj:`str`: a single string as the path to a single file (considered to constitute the `train` split by default)
 - :obj:`List[str]`: a list of strings as paths to a list of files (also considered to constitute the `train` split by default)
@@ -176,13 +193,13 @@ Let's see an example of all the various ways you can provide files to :func:`dat
 
 .. note::
 
-    The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is realted to, the provided files are assumed to belong to the **train** split.
+    The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is related to, the provided files are assumed to belong to the **train** split.
 
 
 CSV files
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-🤗 Datasets can read a dataset made of on or several CSV files.
+🤗 Datasets can read a dataset made of one or several CSV files.
 
 All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns.
 
@@ -205,11 +222,11 @@ The ``csv`` loading script provides a few simple access options to control parsi
 
     - :obj:`skiprows` (int) - Number of first rows in the file to skip (default is 0)
     - :obj:`column_names` (list, optional) – The column names of the target table. If empty, fall back on autogenerate_column_names (default: empty).
-    - :obj:`delimiter` (1-character string) – The character delimiting individual cells in the CSV data (default ``','``).
-    - :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default '"').
-    - :obj:`quoting` (bool) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to pandas.read_csv documentation for more details).
+    - :obj:`delimiter` (1-character string) – The character delimiting individual cells in the CSV data (default ``,``).
+    - :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default ``"``).
+    - :obj:`quoting` (int) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to `pandas.read_csv documentation <https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html>` for more details).
 
-If you want more control, the ``csv`` script provide full control on reading, parsong and convertion through the Apache Arrow `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__, `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ and `pyarrow.csv.ConvertOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html>`__
+If you want more control, the ``csv`` script provides full control on reading, parsing and converting through the Apache Arrow `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__, `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ and `pyarrow.csv.ConvertOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html>`__
 
     - :obj:`read_options` — Can be provided with a `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__ to control all the reading options. If :obj:`skiprows`, :obj:`column_names` or :obj:`autogenerate_column_names` are also provided (see above), they will take priority over the attributes in :obj:`read_options`.
     - :obj:`parse_options` — Can be provided with a `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ to control all the parsing options. If :obj:`delimiter` or :obj:`quote_char` are also provided (see above), they will take priority over the attributes in :obj:`parse_options`.
@@ -219,7 +236,7 @@ If you want more control, the ``csv`` script provide full control on reading, pa
 JSON files
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-🤗 Datasets supports building a dataset from JSON files in various format.
+🤗 Datasets supports building a dataset from JSON files in various formats.
 
 The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows:
 
@@ -285,7 +302,7 @@ In this case you can use the :obj:`features` arguments to :func:`datasets.load_d
 From in-memory data
 -----------------------------------------------------------
 
-Eventually, it's also possible to instantiate a :class:`datasets.Dataset` directly from in-memory data, currently one or:
+Eventually, it's also possible to instantiate a :class:`datasets.Dataset` directly from in-memory data, currently:
 
 - a python dict, or
 - a pandas dataframe.
@@ -333,7 +350,7 @@ Using a custom dataset loading script
 
 If the provided loading scripts for Hub dataset or for local files are not adapted for your use case, you can also easily write and use your own dataset loading script.
 
-You can use a local loading script just by providing its path instead of the usual shortcut name:
+You can use a local loading script by providing its path instead of the usual shortcut name:
 
 .. code-block::
 
@@ -448,7 +465,7 @@ For example, run the following to skip integrity verifications when loading the
 Loading datasets offline
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
+Each dataset builder (e.g. "squad") is a python script that is downloaded and cached either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
 Only the ``text``, ``csv``, ``json`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads.
 
 Therefore if you don't have an internet connection you can't load a dataset that is not packaged with ``datasets``, unless the dataset is already cached.