Skip to content

Commit

Permalink
docs: ✏️ typos and small content edits
Browse files Browse the repository at this point in the history
  • Loading branch information
severo committed Jul 23, 2021
1 parent 329b0a2 commit 721af09
Showing 1 changed file with 8 additions and 11 deletions.
19 changes: 8 additions & 11 deletions docs/source/processing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ To this aim, the :obj:`remove_columns=List[str]` argument can be used and provid

Columns to remove are removed **after** the example has been provided to the mapped function so that the mapped function can use the content of these columns before they are removed.

Here is an example removing the ``sentence1`` column while adding a ``new_sentence`` column with the content of the ``new_sentence``. Said more simply, we are renaming the ``sentence1`` column as ``new_sentence``:
Here is an example removing the ``sentence1`` column while adding a ``new_sentence`` column with the content of the ``sentence1``. Said more simply, we are renaming the ``sentence1`` column as ``new_sentence``:

.. code-block::
Expand Down Expand Up @@ -396,7 +396,7 @@ Processing data in batches

This is particularly interesting if you have a mapped function which can efficiently handle batches of inputs like the tokenizers of the fast `HuggingFace tokenizers library <https://github.com/huggingface/tokenizers>`__.

To operate on batch of example, just set :obj:`batched=True` when calling :func:`datasets.Dataset.map` and provide a function with the following signature: :obj:`function(examples: Dict[List]) -> Dict[List]` or, if you use indices (:obj:`with_indices=True`): :obj:`function(examples: Dict[List], indices: List[int]) -> Dict[List])`.
To operate on batch of examples, just set :obj:`batched=True` when calling :func:`datasets.Dataset.map` and provide a function with the following signature: :obj:`function(examples: Dict[List]) -> Dict[List]` or, if you use indices (:obj:`with_indices=True`): :obj:`function(examples: Dict[List], indices: List[int]) -> Dict[List])`.

In other words, the mapped function should accept an input with the format of a slice of the dataset: :obj:`function(dataset[:10])`.

Expand Down Expand Up @@ -531,11 +531,11 @@ Since the Roberta model is quite large to run on a small laptop CPU, we will res
... outputs += [sentence] + augmented_sequences
...
... return {'data': outputs}
...
>>> augmented_dataset = smaller_dataset.map(augment_data, batched=True, remove_columns=dataset.column_names, batch_size=8)
>>> len(augmented_dataset)
400
>>> augmented_dataset[:9]['data']
>>> augmented_dataset[:8]['data']
['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'Amrozi accused his brother, whom he called " the witness ", of deliberately withholding his evidence.',
'Amrozi accused his brother, whom he called " the witness ", of deliberately suppressing his evidence.',
Expand Down Expand Up @@ -573,8 +573,6 @@ You can directly call map, filter, shuffle, and sort directly on a :obj:`dataset
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
This concludes our chapter on data processing with 🤗 Datasets (and 🤗 Transformers).

Concatenate several datasets
----------------------------

Expand Down Expand Up @@ -641,8 +639,7 @@ This is possible thanks to a custom hashing function that works with most python
Fingerprinting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The fingerprint of a dataset in a given state is an internal value computed by combining the fingerprint of the previous state and a hash of the latest transform that was applied. (Transforms are all the processing method for transforming a dataset that we listed in this chapter (:func:`datasets.Dataset.map`, :func:`datasets.Dataset.shuffle`, etc)
The initial fingerprint is computed using a hash of the arrow table, or a hash of the arrow files if the dataset lives on disk.
The fingerprint of a dataset in a given state is an internal value computed by combining the fingerprint of the previous state and a hash of the latest transform that was applied (transforms are all the processing methods for transforming a dataset that we listed in this chapter: :func:`datasets.Dataset.map`, :func:`datasets.Dataset.shuffle`, etc). The initial fingerprint is computed using a hash of the arrow table, or a hash of the arrow files if the dataset lives on disk.

For example:

Expand All @@ -654,7 +651,7 @@ For example:
>>> print(dataset1._fingerprint, dataset2._fingerprint)
d19493523d95e2dc 5b86abacd4b42434
The new fingerprint is a combination of the previous fingerprint and the hash of the given transform. For a transform to be hashable, it needs to be picklable using dill or pickle. In particular for :func:`datasets.Dataset.map`, you need to provide a picklable processing method to apply on the dataset so that a determinist fingerprint can be computed by hashing the full state of the provided method (the fingerprint is computed taking into account all the dependencies of the method you provide).
The new fingerprint is a combination of the previous fingerprint and the hash of the given transform. For a transform to be hashable, it needs to be pickleable using `dill <https://dill.readthedocs.io/en/latest/>`_ or `pickle <https://docs.python.org/3/library/pickle.html>`_. In particular for :func:`datasets.Dataset.map`, you need to provide a pickleable processing method to apply on the dataset so that a determinist fingerprint can be computed by hashing the full state of the provided method (the fingerprint is computed taking into account all the dependencies of the method you provide).
For non-hashable transform, a random fingerprint is used and a warning is raised.
Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work.
If you reuse a non-hashable transform, the caching mechanism will consider it to be different from the previous calls and recompute everything.
Expand All @@ -669,12 +666,12 @@ It is also possible to disable caching globally with :func:`datasets.set_caching

If the caching is disabled, the library will no longer reload cached dataset files when applying transforms to the datasets.
More precisely, if the caching is disabled:

- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use :func:`datasets.Dataset.save_to_disk` to save a transformed dataset or it will be deleted when session closes
- caching doesn't affect :func:`datasets.load_dataset`. If you want to regenerate a dataset from scratch you should use
the ``download_mode`` parameter in :func:`datasets.load_dataset`.
- caching doesn't affect :func:`datasets.load_dataset`. If you want to regenerate a dataset from scratch you should use the ``download_mode`` parameter in :func:`datasets.load_dataset`.

To disable caching you can run:

Expand Down

1 comment on commit 721af09

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009992 / 0.011353 (-0.001361) 0.004106 / 0.011008 (-0.006903) 0.035962 / 0.038508 (-0.002546) 0.039883 / 0.023109 (0.016774) 0.353014 / 0.275898 (0.077116) 0.380789 / 0.323480 (0.057309) 0.008727 / 0.007986 (0.000741) 0.005152 / 0.004328 (0.000824) 0.010330 / 0.004250 (0.006080) 0.044618 / 0.037052 (0.007566) 0.350191 / 0.258489 (0.091701) 0.385611 / 0.293841 (0.091770) 0.025790 / 0.128546 (-0.102756) 0.008841 / 0.075646 (-0.066805) 0.289390 / 0.419271 (-0.129881) 0.052324 / 0.043533 (0.008791) 0.347933 / 0.255139 (0.092794) 0.369767 / 0.283200 (0.086567) 0.094021 / 0.141683 (-0.047662) 1.867924 / 1.452155 (0.415769) 1.865563 / 1.492716 (0.372846)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.014184 / 0.018006 (-0.003822) 0.475498 / 0.000490 (0.475009) 0.003472 / 0.000200 (0.003272) 0.000078 / 0.000054 (0.000023)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042120 / 0.037411 (0.004709) 0.025803 / 0.014526 (0.011277) 0.029022 / 0.176557 (-0.147535) 0.143695 / 0.737135 (-0.593440) 0.030383 / 0.296338 (-0.265956)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.414951 / 0.215209 (0.199742) 4.100894 / 2.077655 (2.023240) 2.044853 / 1.504120 (0.540733) 1.830762 / 1.541195 (0.289567) 1.866089 / 1.468490 (0.397599) 0.353101 / 4.584777 (-4.231676) 5.104265 / 3.745712 (1.358553) 4.741792 / 5.269862 (-0.528070) 1.708775 / 4.565676 (-2.856902) 0.041719 / 0.424275 (-0.382556) 0.006199 / 0.007607 (-0.001408) 0.527401 / 0.226044 (0.301357) 5.317721 / 2.268929 (3.048793) 2.589681 / 55.444624 (-52.854943) 2.146349 / 6.876477 (-4.730128) 2.160564 / 2.142072 (0.018491) 0.473972 / 4.805227 (-4.331255) 0.110462 / 6.500664 (-6.390202) 0.059266 / 0.075469 (-0.016204)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 13.986707 / 1.841788 (12.144919) 14.622879 / 8.074308 (6.548571) 29.908833 / 10.191392 (19.717441) 0.850423 / 0.680424 (0.169999) 0.595247 / 0.534201 (0.061046) 0.255807 / 0.579283 (-0.323476) 0.559372 / 0.434364 (0.125008) 0.199113 / 0.540337 (-0.341225) 1.047718 / 1.386936 (-0.339218)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009905 / 0.011353 (-0.001448) 0.003966 / 0.011008 (-0.007042) 0.036010 / 0.038508 (-0.002498) 0.040063 / 0.023109 (0.016954) 0.336213 / 0.275898 (0.060315) 0.375605 / 0.323480 (0.052125) 0.008491 / 0.007986 (0.000506) 0.005074 / 0.004328 (0.000745) 0.010060 / 0.004250 (0.005809) 0.043739 / 0.037052 (0.006686) 0.334292 / 0.258489 (0.075803) 0.374242 / 0.293841 (0.080402) 0.025580 / 0.128546 (-0.102966) 0.008889 / 0.075646 (-0.066757) 0.289091 / 0.419271 (-0.130181) 0.052406 / 0.043533 (0.008873) 0.341014 / 0.255139 (0.085875) 0.364056 / 0.283200 (0.080856) 0.089132 / 0.141683 (-0.052551) 1.829669 / 1.452155 (0.377514) 1.838078 / 1.492716 (0.345362)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.057872 / 0.018006 (0.039866) 0.473829 / 0.000490 (0.473340) 0.029401 / 0.000200 (0.029201) 0.000405 / 0.000054 (0.000350)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.041655 / 0.037411 (0.004244) 0.026674 / 0.014526 (0.012148) 0.029720 / 0.176557 (-0.146837) 0.146442 / 0.737135 (-0.590693) 0.031469 / 0.296338 (-0.264869)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.394637 / 0.215209 (0.179428) 3.974200 / 2.077655 (1.896545) 2.052238 / 1.504120 (0.548118) 1.852734 / 1.541195 (0.311540) 1.890648 / 1.468490 (0.422158) 0.347110 / 4.584777 (-4.237667) 5.068504 / 3.745712 (1.322792) 4.408059 / 5.269862 (-0.861803) 1.634636 / 4.565676 (-2.931040) 0.039790 / 0.424275 (-0.384485) 0.005549 / 0.007607 (-0.002058) 0.520871 / 0.226044 (0.294827) 5.189787 / 2.268929 (2.920858) 2.526393 / 55.444624 (-52.918232) 2.145414 / 6.876477 (-4.731063) 2.164252 / 2.142072 (0.022179) 0.487811 / 4.805227 (-4.317416) 0.111414 / 6.500664 (-6.389250) 0.057479 / 0.075469 (-0.017991)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 14.371215 / 1.841788 (12.529428) 14.533559 / 8.074308 (6.459251) 29.810827 / 10.191392 (19.619435) 0.913046 / 0.680424 (0.232623) 0.621684 / 0.534201 (0.087483) 0.257075 / 0.579283 (-0.322208) 0.568349 / 0.434364 (0.133985) 0.194392 / 0.540337 (-0.345945) 1.011431 / 1.386936 (-0.375505)

CML watermark

Please sign in to comment.