Skip to content

Commit

Permalink
docs: ✏️ add an empty line so that the copy/pasted code is OK
Browse files Browse the repository at this point in the history
  • Loading branch information
severo committed Jul 22, 2021
1 parent 6bd0b16 commit 4dc7a76
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions docs/source/exploring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,7 @@ Here is an example to tokenize and pad tokens on-the-fly when accessing the samp
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> def encode(batch):
>>> return tokenizer(batch["sentence1"], padding="longest", truncation=True, max_length=512, return_tensors="pt")
>>>
>>> dataset.set_transform(encode)
>>> dataset.format
{'type': 'custom', 'format_kwargs': {'transform': <function __main__.encode(batch)>}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}
Expand Down

1 comment on commit 4dc7a76

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008515 / 0.011353 (-0.002838) 0.003578 / 0.011008 (-0.007431) 0.032733 / 0.038508 (-0.005775) 0.034951 / 0.023109 (0.011842) 0.286002 / 0.275898 (0.010104) 0.317653 / 0.323480 (-0.005827) 0.007684 / 0.007986 (-0.000302) 0.004814 / 0.004328 (0.000485) 0.009132 / 0.004250 (0.004882) 0.040141 / 0.037052 (0.003089) 0.290102 / 0.258489 (0.031613) 0.324546 / 0.293841 (0.030705) 0.022647 / 0.128546 (-0.105899) 0.007763 / 0.075646 (-0.067883) 0.269254 / 0.419271 (-0.150017) 0.045080 / 0.043533 (0.001547) 0.297767 / 0.255139 (0.042628) 0.313951 / 0.283200 (0.030751) 0.081180 / 0.141683 (-0.060503) 1.660462 / 1.452155 (0.208307) 1.686759 / 1.492716 (0.194042)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.014889 / 0.018006 (-0.003117) 0.505589 / 0.000490 (0.505099) 0.002016 / 0.000200 (0.001816) 0.000262 / 0.000054 (0.000208)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035288 / 0.037411 (-0.002123) 0.022906 / 0.014526 (0.008380) 0.024471 / 0.176557 (-0.152086) 0.125269 / 0.737135 (-0.611867) 0.025569 / 0.296338 (-0.270770)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.341855 / 0.215209 (0.126646) 3.446258 / 2.077655 (1.368603) 1.716801 / 1.504120 (0.212681) 1.537518 / 1.541195 (-0.003677) 1.551105 / 1.468490 (0.082615) 0.308429 / 4.584777 (-4.276347) 4.281022 / 3.745712 (0.535310) 2.808004 / 5.269862 (-2.461858) 1.016796 / 4.565676 (-3.548880) 0.036438 / 0.424275 (-0.387837) 0.005426 / 0.007607 (-0.002181) 0.446733 / 0.226044 (0.220688) 4.473051 / 2.268929 (2.204123) 2.155401 / 55.444624 (-53.289223) 1.815793 / 6.876477 (-5.060684) 1.831784 / 2.142072 (-0.310288) 0.415239 / 4.805227 (-4.389989) 0.095325 / 6.500664 (-6.405339) 0.049656 / 0.075469 (-0.025813)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.788965 / 1.841788 (10.947177) 12.545688 / 8.074308 (4.471380) 27.038605 / 10.191392 (16.847213) 0.743792 / 0.680424 (0.063368) 0.518637 / 0.534201 (-0.015564) 0.225462 / 0.579283 (-0.353821) 0.473270 / 0.434364 (0.038906) 0.173396 / 0.540337 (-0.366942) 0.883558 / 1.386936 (-0.503378)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008169 / 0.011353 (-0.003184) 0.003330 / 0.011008 (-0.007679) 0.030736 / 0.038508 (-0.007772) 0.035131 / 0.023109 (0.012021) 0.293379 / 0.275898 (0.017481) 0.320585 / 0.323480 (-0.002895) 0.007185 / 0.007986 (-0.000800) 0.003489 / 0.004328 (-0.000839) 0.008674 / 0.004250 (0.004423) 0.039339 / 0.037052 (0.002286) 0.300011 / 0.258489 (0.041522) 0.322638 / 0.293841 (0.028797) 0.022315 / 0.128546 (-0.106231) 0.007728 / 0.075646 (-0.067919) 0.252521 / 0.419271 (-0.166751) 0.044491 / 0.043533 (0.000958) 0.292299 / 0.255139 (0.037160) 0.335712 / 0.283200 (0.052512) 0.078912 / 0.141683 (-0.062771) 1.572351 / 1.452155 (0.120196) 1.603646 / 1.492716 (0.110929)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.027713 / 0.018006 (0.009707) 0.512703 / 0.000490 (0.512213) 0.010441 / 0.000200 (0.010241) 0.000257 / 0.000054 (0.000203)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035409 / 0.037411 (-0.002002) 0.023115 / 0.014526 (0.008589) 0.024140 / 0.176557 (-0.152416) 0.125993 / 0.737135 (-0.611142) 0.025867 / 0.296338 (-0.270471)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.341183 / 0.215209 (0.125974) 3.406799 / 2.077655 (1.329144) 1.711196 / 1.504120 (0.207076) 1.532619 / 1.541195 (-0.008576) 1.545853 / 1.468490 (0.077363) 0.307142 / 4.584777 (-4.277635) 4.306413 / 3.745712 (0.560701) 2.785971 / 5.269862 (-2.483890) 1.014053 / 4.565676 (-3.551623) 0.036631 / 0.424275 (-0.387644) 0.005191 / 0.007607 (-0.002416) 0.450730 / 0.226044 (0.224686) 4.504761 / 2.268929 (2.235833) 2.181341 / 55.444624 (-53.263283) 1.861221 / 6.876477 (-5.015256) 1.849858 / 2.142072 (-0.292215) 0.423197 / 4.805227 (-4.382030) 0.098642 / 6.500664 (-6.402022) 0.052184 / 0.075469 (-0.023285)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.736907 / 1.841788 (10.895120) 12.866288 / 8.074308 (4.791980) 26.379690 / 10.191392 (16.188298) 0.727453 / 0.680424 (0.047029) 0.495246 / 0.534201 (-0.038955) 0.225083 / 0.579283 (-0.354200) 0.471778 / 0.434364 (0.037414) 0.174351 / 0.540337 (-0.365987) 0.880146 / 1.386936 (-0.506790)

CML watermark

Please sign in to comment.