You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/processing.rst
+8-11Lines changed: 8 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -362,7 +362,7 @@ To this aim, the :obj:`remove_columns=List[str]` argument can be used and provid
362
362
363
363
Columns to remove are removed **after** the example has been provided to the mapped function so that the mapped function can use the content of these columns before they are removed.
364
364
365
-
Here is an example removing the ``sentence1`` column while adding a ``new_sentence`` column with the content of the ``new_sentence``. Said more simply, we are renaming the ``sentence1`` column as ``new_sentence``:
365
+
Here is an example removing the ``sentence1`` column while adding a ``new_sentence`` column with the content of the ``sentence1``. Said more simply, we are renaming the ``sentence1`` column as ``new_sentence``:
366
366
367
367
.. code-block::
368
368
@@ -396,7 +396,7 @@ Processing data in batches
396
396
397
397
This is particularly interesting if you have a mapped function which can efficiently handle batches of inputs like the tokenizers of the fast `HuggingFace tokenizers library <https://github.com/huggingface/tokenizers>`__.
398
398
399
-
To operate on batch of example, just set :obj:`batched=True` when calling :func:`datasets.Dataset.map` and provide a function with the following signature: :obj:`function(examples: Dict[List]) -> Dict[List]` or, if you use indices (:obj:`with_indices=True`): :obj:`function(examples: Dict[List], indices: List[int]) -> Dict[List])`.
399
+
To operate on batch of examples, just set :obj:`batched=True` when calling :func:`datasets.Dataset.map` and provide a function with the following signature: :obj:`function(examples: Dict[List]) -> Dict[List]` or, if you use indices (:obj:`with_indices=True`): :obj:`function(examples: Dict[List], indices: List[int]) -> Dict[List])`.
400
400
401
401
In other words, the mapped function should accept an input with the format of a slice of the dataset: :obj:`function(dataset[:10])`.
402
402
@@ -531,11 +531,11 @@ Since the Roberta model is quite large to run on a small laptop CPU, we will res
This concludes our chapter on data processing with 🤗 Datasets (and 🤗 Transformers).
577
-
578
576
Concatenate several datasets
579
577
----------------------------
580
578
@@ -641,8 +639,7 @@ This is possible thanks to a custom hashing function that works with most python
641
639
Fingerprinting
642
640
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
643
641
644
-
The fingerprint of a dataset in a given state is an internal value computed by combining the fingerprint of the previous state and a hash of the latest transform that was applied. (Transforms are all the processing method for transforming a dataset that we listed in this chapter (:func:`datasets.Dataset.map`, :func:`datasets.Dataset.shuffle`, etc)
645
-
The initial fingerprint is computed using a hash of the arrow table, or a hash of the arrow files if the dataset lives on disk.
642
+
The fingerprint of a dataset in a given state is an internal value computed by combining the fingerprint of the previous state and a hash of the latest transform that was applied (transforms are all the processing methods for transforming a dataset that we listed in this chapter: :func:`datasets.Dataset.map`, :func:`datasets.Dataset.shuffle`, etc). The initial fingerprint is computed using a hash of the arrow table, or a hash of the arrow files if the dataset lives on disk.
The new fingerprint is a combination of the previous fingerprint and the hash of the given transform. For a transform to be hashable, it needs to be picklable using dill or pickle. In particular for :func:`datasets.Dataset.map`, you need to provide a picklable processing method to apply on the dataset so that a determinist fingerprint can be computed by hashing the full state of the provided method (the fingerprint is computed taking into account all the dependencies of the method you provide).
654
+
The new fingerprint is a combination of the previous fingerprint and the hash of the given transform. For a transform to be hashable, it needs to be pickleable using `dill <https://dill.readthedocs.io/en/latest/>`_ or `pickle<https://docs.python.org/3/library/pickle.html>`_. In particular for :func:`datasets.Dataset.map`, you need to provide a pickleable processing method to apply on the dataset so that a determinist fingerprint can be computed by hashing the full state of the provided method (the fingerprint is computed taking into account all the dependencies of the method you provide).
658
655
For non-hashable transform, a random fingerprint is used and a warning is raised.
659
656
Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work.
660
657
If you reuse a non-hashable transform, the caching mechanism will consider it to be different from the previous calls and recompute everything.
@@ -669,12 +666,12 @@ It is also possible to disable caching globally with :func:`datasets.set_caching
669
666
670
667
If the caching is disabled, the library will no longer reload cached dataset files when applying transforms to the datasets.
671
668
More precisely, if the caching is disabled:
669
+
672
670
- cache files are always recreated
673
671
- cache files are written to a temporary directory that is deleted when session closes
674
672
- cache files are named using a random hash instead of the dataset fingerprint
675
673
- use :func:`datasets.Dataset.save_to_disk` to save a transformed dataset or it will be deleted when session closes
676
-
- caching doesn't affect :func:`datasets.load_dataset`. If you want to regenerate a dataset from scratch you should use
677
-
the ``download_mode`` parameter in :func:`datasets.load_dataset`.
674
+
- caching doesn't affect :func:`datasets.load_dataset`. If you want to regenerate a dataset from scratch you should use the ``download_mode`` parameter in :func:`datasets.load_dataset`.
0 commit comments