Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join and merge benchmarks. #19

Open
racinmat opened this issue Feb 14, 2019 · 2 comments
Open

Join and merge benchmarks. #19

racinmat opened this issue Feb 14, 2019 · 2 comments

Comments

@racinmat
Copy link

Is there benchmark for join and merge operation on indexed columns? According to official Dask performance guide, this operation should be fast compared to pandas, but when benchmarking on my data, the speed up when using 8 partitions was ~7% which was quite disappointing, and I could not find any benchmarks for this.

@mrocklin
Copy link
Member

mrocklin commented Feb 14, 2019 via email

@erolosty
Copy link

I'm considering putting in some effort on shuffle.

A natural part of this work would be producing such benchmarks

I've read the shuffle docs, and looked into the shuffle.py (I'm mainly interested in tasks shuffle, but am not against potentially using partd as part of a solution)

I'm not familiar with the algorithm being used currently, for staged group split join.

Is the idea that by interleaving merge tasks (which in dask can execute concurrently with the map phase), that we can outperform something like spark or Hadoop which starts the reduce phase when all mapping tasks are complete?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants