Skip to content

Commit 721af09

Browse files
committed
docs: ✏️ typos and small content edits
1 parent 329b0a2 commit 721af09

File tree

1 file changed

+8
-11
lines changed

1 file changed

+8
-11
lines changed

docs/source/processing.rst

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -362,7 +362,7 @@ To this aim, the :obj:`remove_columns=List[str]` argument can be used and provid
362362

363363
Columns to remove are removed **after** the example has been provided to the mapped function so that the mapped function can use the content of these columns before they are removed.
364364

365-
Here is an example removing the ``sentence1`` column while adding a ``new_sentence`` column with the content of the ``new_sentence``. Said more simply, we are renaming the ``sentence1`` column as ``new_sentence``:
365+
Here is an example removing the ``sentence1`` column while adding a ``new_sentence`` column with the content of the ``sentence1``. Said more simply, we are renaming the ``sentence1`` column as ``new_sentence``:
366366

367367
.. code-block::
368368
@@ -396,7 +396,7 @@ Processing data in batches
396396

397397
This is particularly interesting if you have a mapped function which can efficiently handle batches of inputs like the tokenizers of the fast `HuggingFace tokenizers library <https://github.com/huggingface/tokenizers>`__.
398398

399-
To operate on batch of example, just set :obj:`batched=True` when calling :func:`datasets.Dataset.map` and provide a function with the following signature: :obj:`function(examples: Dict[List]) -> Dict[List]` or, if you use indices (:obj:`with_indices=True`): :obj:`function(examples: Dict[List], indices: List[int]) -> Dict[List])`.
399+
To operate on batch of examples, just set :obj:`batched=True` when calling :func:`datasets.Dataset.map` and provide a function with the following signature: :obj:`function(examples: Dict[List]) -> Dict[List]` or, if you use indices (:obj:`with_indices=True`): :obj:`function(examples: Dict[List], indices: List[int]) -> Dict[List])`.
400400

401401
In other words, the mapped function should accept an input with the format of a slice of the dataset: :obj:`function(dataset[:10])`.
402402

@@ -531,11 +531,11 @@ Since the Roberta model is quite large to run on a small laptop CPU, we will res
531531
... outputs += [sentence] + augmented_sequences
532532
...
533533
... return {'data': outputs}
534-
...
534+
535535
>>> augmented_dataset = smaller_dataset.map(augment_data, batched=True, remove_columns=dataset.column_names, batch_size=8)
536536
>>> len(augmented_dataset)
537537
400
538-
>>> augmented_dataset[:9]['data']
538+
>>> augmented_dataset[:8]['data']
539539
['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
540540
'Amrozi accused his brother, whom he called " the witness ", of deliberately withholding his evidence.',
541541
'Amrozi accused his brother, whom he called " the witness ", of deliberately suppressing his evidence.',
@@ -573,8 +573,6 @@ You can directly call map, filter, shuffle, and sort directly on a :obj:`dataset
573573
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
574574
}
575575
576-
This concludes our chapter on data processing with 🤗 Datasets (and 🤗 Transformers).
577-
578576
Concatenate several datasets
579577
----------------------------
580578

@@ -641,8 +639,7 @@ This is possible thanks to a custom hashing function that works with most python
641639
Fingerprinting
642640
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
643641

644-
The fingerprint of a dataset in a given state is an internal value computed by combining the fingerprint of the previous state and a hash of the latest transform that was applied. (Transforms are all the processing method for transforming a dataset that we listed in this chapter (:func:`datasets.Dataset.map`, :func:`datasets.Dataset.shuffle`, etc)
645-
The initial fingerprint is computed using a hash of the arrow table, or a hash of the arrow files if the dataset lives on disk.
642+
The fingerprint of a dataset in a given state is an internal value computed by combining the fingerprint of the previous state and a hash of the latest transform that was applied (transforms are all the processing methods for transforming a dataset that we listed in this chapter: :func:`datasets.Dataset.map`, :func:`datasets.Dataset.shuffle`, etc). The initial fingerprint is computed using a hash of the arrow table, or a hash of the arrow files if the dataset lives on disk.
646643

647644
For example:
648645

@@ -654,7 +651,7 @@ For example:
654651
>>> print(dataset1._fingerprint, dataset2._fingerprint)
655652
d19493523d95e2dc 5b86abacd4b42434
656653
657-
The new fingerprint is a combination of the previous fingerprint and the hash of the given transform. For a transform to be hashable, it needs to be picklable using dill or pickle. In particular for :func:`datasets.Dataset.map`, you need to provide a picklable processing method to apply on the dataset so that a determinist fingerprint can be computed by hashing the full state of the provided method (the fingerprint is computed taking into account all the dependencies of the method you provide).
654+
The new fingerprint is a combination of the previous fingerprint and the hash of the given transform. For a transform to be hashable, it needs to be pickleable using `dill <https://dill.readthedocs.io/en/latest/>`_ or `pickle <https://docs.python.org/3/library/pickle.html>`_. In particular for :func:`datasets.Dataset.map`, you need to provide a pickleable processing method to apply on the dataset so that a determinist fingerprint can be computed by hashing the full state of the provided method (the fingerprint is computed taking into account all the dependencies of the method you provide).
658655
For non-hashable transform, a random fingerprint is used and a warning is raised.
659656
Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work.
660657
If you reuse a non-hashable transform, the caching mechanism will consider it to be different from the previous calls and recompute everything.
@@ -669,12 +666,12 @@ It is also possible to disable caching globally with :func:`datasets.set_caching
669666

670667
If the caching is disabled, the library will no longer reload cached dataset files when applying transforms to the datasets.
671668
More precisely, if the caching is disabled:
669+
672670
- cache files are always recreated
673671
- cache files are written to a temporary directory that is deleted when session closes
674672
- cache files are named using a random hash instead of the dataset fingerprint
675673
- use :func:`datasets.Dataset.save_to_disk` to save a transformed dataset or it will be deleted when session closes
676-
- caching doesn't affect :func:`datasets.load_dataset`. If you want to regenerate a dataset from scratch you should use
677-
the ``download_mode`` parameter in :func:`datasets.load_dataset`.
674+
- caching doesn't affect :func:`datasets.load_dataset`. If you want to regenerate a dataset from scratch you should use the ``download_mode`` parameter in :func:`datasets.load_dataset`.
678675

679676
To disable caching you can run:
680677

0 commit comments

Comments
 (0)