Skip to content

Commit 13cffc6

Browse files
committed
docs: ✏️ fix typos, and update cli output
1 parent 925fbbc commit 13cffc6

File tree

1 file changed

+40
-23
lines changed

1 file changed

+40
-23
lines changed

docs/source/loading_datasets.rst

Lines changed: 40 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -52,11 +52,11 @@ This call to :func:`datasets.load_dataset` does the following steps under the ho
5252

5353
Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files. You can find the SQuAD processing script `here <https://github.com/huggingface/datasets/tree/master/datasets/squad/squad.py>`__ for instance.
5454

55-
2. Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard splits stored on the drive.
55+
2. Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard split stored on the drive.
5656

5757
.. note::
5858

59-
An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store arbitrarily long dataframe,
59+
An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store an arbitrarily long dataframe,
6060
typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you
6161
to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use
6262
memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory
@@ -80,9 +80,16 @@ If you don't provide a :obj:`split` argument to :func:`datasets.load_dataset`, t
8080
>>> from datasets import load_dataset
8181
>>> datasets = load_dataset('squad')
8282
>>> print(datasets)
83-
{'train': Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 87599),
84-
'validation': Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 10570)
85-
}
83+
DatasetDict({
84+
train: Dataset({
85+
features: ['id', 'title', 'context', 'question', 'answers'],
86+
num_rows: 87599
87+
})
88+
validation: Dataset({
89+
features: ['id', 'title', 'context', 'question', 'answers'],
90+
num_rows: 10570
91+
})
92+
})
8693
8794
The :obj:`split` argument can actually be used to control extensively the generated dataset split. You can use this argument to build a split from only a portion of a split in absolute number of examples or in proportion (e.g. :obj:`split='train[:10%]'` will load only the first 10% of the train split) or to mix splits (e.g. :obj:`split='train[:100]+validation[:100]'` will create a split from the first 100 examples of the train split and the first 100 examples of the validation split).
8895

@@ -91,12 +98,12 @@ You can find more details on the syntax for using :obj:`split` on the :doc:`dedi
9198
Selecting a configuration
9299
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
93100

94-
Some datasets comprise several :obj:`configurations`. A Configuration define a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are:
101+
Some datasets comprise several :obj:`configurations`. A Configuration defines a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are:
95102

96103
- the **GLUE** dataset which is an agregated benchmark comprised of 10 subsets: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX.
97104
- the **wikipedia** dataset which is provided for several languages.
98105

99-
When a dataset is provided with more than one :obj:`configurations`, you will be requested to explicitely select a configuration among the possibilities.
106+
When a dataset is provided with more than one :obj:`configuration`, you will be requested to explicitely select a configuration among the possibilities.
100107

101108
Selecting a configuration is done by providing :func:`datasets.load_dataset` with a :obj:`name` argument. Here is an example for **GLUE**:
102109

@@ -115,10 +122,20 @@ Selecting a configuration is done by providing :func:`datasets.load_dataset` wit
115122
Downloading: 100%|██████████████████████████████████████████████████████████████| 7.44M/7.44M [00:01<00:00, 7.03MB/s]
116123
Dataset glue downloaded and prepared to /Users/huggignface/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will reuse this data.
117124
>>> print(dataset)
118-
{'train': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 67349),
119-
'validation': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 872),
120-
'test': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 1821)
121-
}
125+
DatasetDict({
126+
train: Dataset({
127+
features: ['sentence', 'label', 'idx'],
128+
num_rows: 67349
129+
})
130+
validation: Dataset({
131+
features: ['sentence', 'label', 'idx'],
132+
num_rows: 872
133+
})
134+
test: Dataset({
135+
features: ['sentence', 'label', 'idx'],
136+
num_rows: 1821
137+
})
138+
})
122139
123140
Manually downloading files
124141
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -156,9 +173,9 @@ Generic loading scripts are provided for:
156173
- text files (read as a line-by-line dataset with the :obj:`text` script),
157174
- pandas pickled dataframe (with the :obj:`pandas` script).
158175

159-
If you want to control better how you files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the `HuggingFace Hub <https://huggingface.co/datasets>`__, it can be more flexible and simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` chapter.
176+
If you want to control better how your files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the `HuggingFace Hub <https://huggingface.co/datasets>`__, it can be more flexible and simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` chapter.
160177

161-
The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several files. This arguments currently accept three types of inputs:
178+
The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several files. This argument currently accepts three types of inputs:
162179

163180
- :obj:`str`: a single string as the path to a single file (considered to constitute the `train` split by default)
164181
- :obj:`List[str]`: a list of strings as paths to a list of files (also considered to constitute the `train` split by default)
@@ -176,13 +193,13 @@ Let's see an example of all the various ways you can provide files to :func:`dat
176193
177194
.. note::
178195

179-
The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is realted to, the provided files are assumed to belong to the **train** split.
196+
The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is related to, the provided files are assumed to belong to the **train** split.
180197

181198

182199
CSV files
183200
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
184201

185-
🤗 Datasets can read a dataset made of on or several CSV files.
202+
🤗 Datasets can read a dataset made of one or several CSV files.
186203

187204
All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns.
188205

@@ -205,11 +222,11 @@ The ``csv`` loading script provides a few simple access options to control parsi
205222

206223
- :obj:`skiprows` (int) - Number of first rows in the file to skip (default is 0)
207224
- :obj:`column_names` (list, optional) – The column names of the target table. If empty, fall back on autogenerate_column_names (default: empty).
208-
- :obj:`delimiter` (1-character string) – The character delimiting individual cells in the CSV data (default ``','``).
209-
- :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default '"').
210-
- :obj:`quoting` (bool) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to pandas.read_csv documentation for more details).
225+
- :obj:`delimiter` (1-character string) – The character delimiting individual cells in the CSV data (default ``,``).
226+
- :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default ``"``).
227+
- :obj:`quoting` (int) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to `pandas.read_csv documentation <https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html>` for more details).
211228

212-
If you want more control, the ``csv`` script provide full control on reading, parsong and convertion through the Apache Arrow `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__, `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ and `pyarrow.csv.ConvertOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html>`__
229+
If you want more control, the ``csv`` script provides full control on reading, parsing and converting through the Apache Arrow `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__, `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ and `pyarrow.csv.ConvertOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html>`__
213230

214231
- :obj:`read_options` — Can be provided with a `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__ to control all the reading options. If :obj:`skiprows`, :obj:`column_names` or :obj:`autogenerate_column_names` are also provided (see above), they will take priority over the attributes in :obj:`read_options`.
215232
- :obj:`parse_options` — Can be provided with a `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ to control all the parsing options. If :obj:`delimiter` or :obj:`quote_char` are also provided (see above), they will take priority over the attributes in :obj:`parse_options`.
@@ -219,7 +236,7 @@ If you want more control, the ``csv`` script provide full control on reading, pa
219236
JSON files
220237
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
221238

222-
🤗 Datasets supports building a dataset from JSON files in various format.
239+
🤗 Datasets supports building a dataset from JSON files in various formats.
223240

224241
The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows:
225242

@@ -285,7 +302,7 @@ In this case you can use the :obj:`features` arguments to :func:`datasets.load_d
285302
From in-memory data
286303
-----------------------------------------------------------
287304

288-
Eventually, it's also possible to instantiate a :class:`datasets.Dataset` directly from in-memory data, currently one or:
305+
Eventually, it's also possible to instantiate a :class:`datasets.Dataset` directly from in-memory data, currently:
289306

290307
- a python dict, or
291308
- a pandas dataframe.
@@ -333,7 +350,7 @@ Using a custom dataset loading script
333350

334351
If the provided loading scripts for Hub dataset or for local files are not adapted for your use case, you can also easily write and use your own dataset loading script.
335352

336-
You can use a local loading script just by providing its path instead of the usual shortcut name:
353+
You can use a local loading script by providing its path instead of the usual shortcut name:
337354

338355
.. code-block::
339356
@@ -448,7 +465,7 @@ For example, run the following to skip integrity verifications when loading the
448465
Loading datasets offline
449466
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
450467

451-
Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
468+
Each dataset builder (e.g. "squad") is a python script that is downloaded and cached either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
452469
Only the ``text``, ``csv``, ``json`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads.
453470

454471
Therefore if you don't have an internet connection you can't load a dataset that is not packaged with ``datasets``, unless the dataset is already cached.

0 commit comments

Comments
 (0)