Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Modin join benchmark to current state #162

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gshimansky
Copy link

I updated Modin implementation of join benchmark to current state. Mostly code is copied from Pandas version but there are some differences.

@jangorecki
Copy link
Contributor

Thank you for contributing this script. I am now running modin join benchmark. Will report back when it will finish.

@jangorecki
Copy link
Contributor

jangorecki commented Oct 16, 2020

Below I am presenting timings made on this PR (precisely speaking, on https://github.com/h2oai/db-benchmark/tree/modin-join-dev).

Quite obvious observation is that there is problem with performance of join question 5: big-to-big join. So 1e7 rows join 1e7 rows, 1e8 to 1e8, 1e9 to 1e9. That is quite common problem for a software that works in distributed manner, you may find this video interesting https://www.youtube.com/watch?v=5X7h1rZGVs0

Another thing, more disturbing actually, are timings values in chk_time_sec column, see below. This field is briefly described in https://github.com/h2oai/db-benchmark/blob/master/_docs/maintenance.md#timecsv document as

chk_time_sec - time taken to compute the chk field, used to track lazyness of evaluation

We generally expect this value to be very low, much lower than the value of time_sec. So we can ensure that computation "A" to be measured by time_sec has been actually computed fully, and not deferred to a computation "B" where results of computation "A" needs to be used.
From the values that we can observe here, we could reason that some of join computation might be actually happenning later on, when we are trying to use those not yet fully computed values.
What is necessary here to ensure benchmark is fair is to investigate lazyness of those operations.
Eventually what could help here is a comment from modin maintainer saying if it is lazy evaluation or not, and explaining why the "computation B" takes so much time.

1e7

Timings for all 5 questions:

                  question run time_sec chk_time_sec
 1:     small inner on int   1    4.470        2.188
 2:     small inner on int   2    1.290        2.278
 3:    medium inner on int   1    3.724        2.460
 4:    medium inner on int   2    2.282        2.561
 5:    medium outer on int   1    2.267        2.438
 6:    medium outer on int   2    2.276        2.567
 7: medium inner on factor   1    2.414        2.561
 8: medium inner on factor   2    2.775        2.559
 9:       big inner on int   1  754.830       95.274
10:       big inner on int   2  714.218       95.663

All joins queries sucessfully finished in 1859s.

1e8

When trying to do first run of q5 python is being Killed.

Timings of q1-q4:

                 question run time_sec chk_time_sec
1:     small inner on int   1   50.147       21.417
2:     small inner on int   2   12.247       21.553
3:    medium inner on int   1   22.070       22.748
4:    medium inner on int   2   21.020       22.955
5:    medium outer on int   1   20.398       22.280
6:    medium outer on int   2   22.891       23.420
7: medium inner on factor   1   22.467       23.794
8: medium inner on factor   2   22.616       24.169

1e9

In case of 1e9 rows data, script is already failing during loading data. Unless modin can handle out-of-memory data this is expected. If modin is able to handle out-of-memory data (does it?), then we should enable that just for 1e9 data size.

Traceback (most recent call last):
  File "./modin/join-modin.py", line 35, in <module>
    x = pd.read_csv(src_jn_x)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/pandas/io.py", line 112, in parser_func
    return _read(**kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/pandas/io.py", line 127, in _read
    pd_obj = EngineDispatcher.read_csv(**kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/data_management/factories/dispatcher.py", line 113, in read_csv
    return cls.__engine._read_csv(**kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/data_management/factories/factories.py", line 87, in _read_csv
    return cls.io_cls.read_csv(**kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/file_reader.py", line 29, in read
    query_compiler = cls._read(*args, **kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/text/csv_reader.py", line 159, in _read
    is_quoting=is_quoting,
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/text/text_file_reader.py", line 104, in offset
    chunk = f.read(chunk_size_bytes)
MemoryError

@gshimansky
Copy link
Author

I checked with Modin developer @YarShev who knows details about merge operation, that we don't have any lazy computation for it. Performance there is a subject for investigation because I see these problems too, but we didn't figure out the reason for this behavior yet.

As for memory, it looks like no configurations are able to pass 1e9 configuration currently https://h2oai.github.io/db-benchmark/join/J1_1e9_NA_0_0_basic.png so it is not a wonder that Modin behaves like this too. Modin doesn't do any out of memory operations, all data has to fit into memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants