You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/loading_datasets.rst
+40-23Lines changed: 40 additions & 23 deletions
Original file line number
Diff line number
Diff line change
@@ -52,11 +52,11 @@ This call to :func:`datasets.load_dataset` does the following steps under the ho
52
52
53
53
Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files. You can find the SQuAD processing script `here <https://github.com/huggingface/datasets/tree/master/datasets/squad/squad.py>`__ for instance.
54
54
55
-
2. Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard splits stored on the drive.
55
+
2. Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard split stored on the drive.
56
56
57
57
.. note::
58
58
59
-
An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store arbitrarily long dataframe,
59
+
An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store an arbitrarily long dataframe,
60
60
typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you
61
61
to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use
62
62
memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory
@@ -80,9 +80,16 @@ If you don't provide a :obj:`split` argument to :func:`datasets.load_dataset`, t
The :obj:`split` argument can actually be used to control extensively the generated dataset split. You can use this argument to build a split from only a portion of a split in absolute number of examples or in proportion (e.g. :obj:`split='train[:10%]'` will load only the first 10% of the train split) or to mix splits (e.g. :obj:`split='train[:100]+validation[:100]'` will create a split from the first 100 examples of the train split and the first 100 examples of the validation split).
88
95
@@ -91,12 +98,12 @@ You can find more details on the syntax for using :obj:`split` on the :doc:`dedi
91
98
Selecting a configuration
92
99
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
93
100
94
-
Some datasets comprise several :obj:`configurations`. A Configuration define a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are:
101
+
Some datasets comprise several :obj:`configurations`. A Configuration defines a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are:
95
102
96
103
- the **GLUE** dataset which is an agregated benchmark comprised of 10 subsets: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX.
97
104
- the **wikipedia** dataset which is provided for several languages.
98
105
99
-
When a dataset is provided with more than one :obj:`configurations`, you will be requested to explicitely select a configuration among the possibilities.
106
+
When a dataset is provided with more than one :obj:`configuration`, you will be requested to explicitely select a configuration among the possibilities.
100
107
101
108
Selecting a configuration is done by providing :func:`datasets.load_dataset` with a :obj:`name` argument. Here is an example for **GLUE**:
102
109
@@ -115,10 +122,20 @@ Selecting a configuration is done by providing :func:`datasets.load_dataset` wit
@@ -156,9 +173,9 @@ Generic loading scripts are provided for:
156
173
- text files (read as a line-by-line dataset with the :obj:`text` script),
157
174
- pandas pickled dataframe (with the :obj:`pandas` script).
158
175
159
-
If you want to control better how you files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the `HuggingFace Hub <https://huggingface.co/datasets>`__, it can be more flexible and simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` chapter.
176
+
If you want to control better how your files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the `HuggingFace Hub <https://huggingface.co/datasets>`__, it can be more flexible and simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` chapter.
160
177
161
-
The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several files. This arguments currently accept three types of inputs:
178
+
The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several files. This argument currently accepts three types of inputs:
162
179
163
180
- :obj:`str`: a single string as the path to a single file (considered to constitute the `train` split by default)
164
181
- :obj:`List[str]`: a list of strings as paths to a list of files (also considered to constitute the `train` split by default)
@@ -176,13 +193,13 @@ Let's see an example of all the various ways you can provide files to :func:`dat
176
193
177
194
.. note::
178
195
179
-
The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is realted to, the provided files are assumed to belong to the **train** split.
196
+
The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is related to, the provided files are assumed to belong to the **train** split.
180
197
181
198
182
199
CSV files
183
200
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
184
201
185
-
🤗 Datasets can read a dataset made of on or several CSV files.
202
+
🤗 Datasets can read a dataset made of one or several CSV files.
186
203
187
204
All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns.
188
205
@@ -205,11 +222,11 @@ The ``csv`` loading script provides a few simple access options to control parsi
205
222
206
223
- :obj:`skiprows` (int) - Number of first rows in the file to skip (default is 0)
207
224
- :obj:`column_names` (list, optional) – The column names of the target table. If empty, fall back on autogenerate_column_names (default: empty).
208
-
- :obj:`delimiter` (1-character string) – The character delimiting individual cells in the CSV data (default ``','``).
209
-
- :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default '"').
210
-
- :obj:`quoting` (bool) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to pandas.read_csv documentation for more details).
225
+
- :obj:`delimiter` (1-character string) – The character delimiting individual cells in the CSV data (default ``,``).
226
+
- :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default ``"``).
227
+
- :obj:`quoting` (int) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to `pandas.read_csv documentation <https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html>` for more details).
211
228
212
-
If you want more control, the ``csv`` script provide full control on reading, parsong and convertion through the Apache Arrow `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__, `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ and `pyarrow.csv.ConvertOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html>`__
229
+
If you want more control, the ``csv`` script provides full control on reading, parsing and converting through the Apache Arrow `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__, `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ and `pyarrow.csv.ConvertOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html>`__
213
230
214
231
- :obj:`read_options` — Can be provided with a `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__ to control all the reading options. If :obj:`skiprows`, :obj:`column_names` or :obj:`autogenerate_column_names` are also provided (see above), they will take priority over the attributes in :obj:`read_options`.
215
232
- :obj:`parse_options` — Can be provided with a `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ to control all the parsing options. If :obj:`delimiter` or :obj:`quote_char` are also provided (see above), they will take priority over the attributes in :obj:`parse_options`.
@@ -219,7 +236,7 @@ If you want more control, the ``csv`` script provide full control on reading, pa
219
236
JSON files
220
237
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
221
238
222
-
🤗 Datasets supports building a dataset from JSON files in various format.
239
+
🤗 Datasets supports building a dataset from JSON files in various formats.
223
240
224
241
The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows:
225
242
@@ -285,7 +302,7 @@ In this case you can use the :obj:`features` arguments to :func:`datasets.load_d
Eventually, it's also possible to instantiate a :class:`datasets.Dataset` directly from in-memory data, currently one or:
305
+
Eventually, it's also possible to instantiate a :class:`datasets.Dataset` directly from in-memory data, currently:
289
306
290
307
- a python dict, or
291
308
- a pandas dataframe.
@@ -333,7 +350,7 @@ Using a custom dataset loading script
333
350
334
351
If the provided loading scripts for Hub dataset or for local files are not adapted for your use case, you can also easily write and use your own dataset loading script.
335
352
336
-
You can use a local loading script just by providing its path instead of the usual shortcut name:
353
+
You can use a local loading script by providing its path instead of the usual shortcut name:
337
354
338
355
.. code-block::
339
356
@@ -448,7 +465,7 @@ For example, run the following to skip integrity verifications when loading the
448
465
Loading datasets offline
449
466
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
450
467
451
-
Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
468
+
Each dataset builder (e.g. "squad") is a python script that is downloaded and cached either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
452
469
Only the ``text``, ``csv``, ``json`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads.
453
470
454
471
Therefore if you don't have an internet connection you can't load a dataset that is not packaged with ``datasets``, unless the dataset is already cached.
0 commit comments