-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Modin join benchmark to current state #162
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Gregory Shimansky <[email protected]>
Signed-off-by: Gregory Shimansky <[email protected]>
Thank you for contributing this script. I am now running modin join benchmark. Will report back when it will finish. |
Below I am presenting timings made on this PR (precisely speaking, on https://github.com/h2oai/db-benchmark/tree/modin-join-dev). Quite obvious observation is that there is problem with performance of join question 5: big-to-big join. So 1e7 rows join 1e7 rows, 1e8 to 1e8, 1e9 to 1e9. That is quite common problem for a software that works in distributed manner, you may find this video interesting https://www.youtube.com/watch?v=5X7h1rZGVs0 Another thing, more disturbing actually, are timings values in
We generally expect this value to be very low, much lower than the value of 1e7Timings for all 5 questions:
All joins queries sucessfully finished in 1859s. 1e8When trying to do first run of q5 python is being Timings of q1-q4:
1e9In case of 1e9 rows data, script is already failing during loading data. Unless modin can handle out-of-memory data this is expected. If modin is able to handle out-of-memory data (does it?), then we should enable that just for 1e9 data size.
|
I checked with Modin developer @YarShev who knows details about merge operation, that we don't have any lazy computation for it. Performance there is a subject for investigation because I see these problems too, but we didn't figure out the reason for this behavior yet. As for memory, it looks like no configurations are able to pass |
I updated Modin implementation of join benchmark to current state. Mostly code is copied from Pandas version but there are some differences.