Skip to content

Commit 329b0a2

Browse files
committed
docs: ✏️ code snippets format
1 parent ff0c4b0 commit 329b0a2

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

docs/source/processing.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -304,7 +304,7 @@ Let's add a prefix ``'My sentence: '`` to each ``sentence1`` value in our small
304304
>>> def add_prefix(example):
305305
... example['sentence1'] = 'My sentence: ' + example['sentence1']
306306
... return example
307-
...
307+
308308
>>> updated_dataset = small_dataset.map(add_prefix)
309309
>>> updated_dataset['sentence1'][:5]
310310
['My sentence: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
@@ -485,7 +485,7 @@ We will also remove all the columns of the dataset and only keep the chunks in o
485485
... for sentence in examples['sentence1']:
486486
... chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)]
487487
... return {'chunks': chunks}
488-
...
488+
489489
>>> chunked_dataset = dataset.map(chunk_examples, batched=True, remove_columns=dataset.column_names)
490490
>>> chunked_dataset
491491
Dataset(schema: {'chunks': 'string'}, num_rows: 10470)
@@ -607,7 +607,6 @@ Saving a dataset creates a directory with various files:
607607
.. code-block::
608608
609609
>>> encoded_dataset.save_to_disk("path/of/my/dataset/directory")
610-
>>> ...
611610
>>> from datasets import load_from_disk
612611
>>> reloaded_encoded_dataset = load_from_disk("path/of/my/dataset/directory")
613612
@@ -695,17 +694,18 @@ In a distributed setting, you may use caching and a :func:`torch.distributed.bar
695694
696695
>>> from datasets import Dataset
697696
>>> import torch.distributed
698-
>>>
697+
699698
>>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]})
700-
>>>
699+
701700
>>> if training_args.local_rank > 0:
702701
... print("Waiting for main process to perform the mapping")
703702
... torch.distributed.barrier()
704-
>>>
703+
705704
>>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1})
706705
>>>
707706
>>> if training_args.local_rank == 0:
708707
... print("Loading results from main process")
709708
... torch.distributed.barrier()
709+
710710
711711
When it encounters a barrier, each process will stop until all other processes have reached the barrier. The non-main processes reach the barrier first, before the mapping, and wait there. The main processes creates the cache for the processed dataset. It then reaches the barrier, at which point the other processes resume, and load the cache instead of performing the processing themselves.

0 commit comments

Comments
 (0)