You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there benchmark for join and merge operation on indexed columns? According to official Dask performance guide, this operation should be fast compared to pandas, but when benchmarking on my data, the speed up when using 8 partitions was ~7% which was quite disappointing, and I could not find any benchmarks for this.
The text was updated successfully, but these errors were encountered:
On Thu, Feb 14, 2019 at 7:05 AM Matěj Račinský ***@***.***> wrote:
Is there benchmark for join and merge operation on indexed columns?
According to official Dask performance guide, this operation should be fast
compared to pandas, but when benchmarking on my data, the speed up when
using 8 partitions was ~7% which was quite disappointing, and I could not
find any benchmarks for this.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#19>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszPY-KSf3zVDbsMwXIrhvzooNNPvkks5vNXsugaJpZM4a7yVq>
.
I'm considering putting in some effort on shuffle.
A natural part of this work would be producing such benchmarks
I've read the shuffle docs, and looked into the shuffle.py (I'm mainly interested in tasks shuffle, but am not against potentially using partd as part of a solution)
I'm not familiar with the algorithm being used currently, for staged group split join.
Is the idea that by interleaving merge tasks (which in dask can execute concurrently with the map phase), that we can outperform something like spark or Hadoop which starts the reduce phase when all mapping tasks are complete?
Is there benchmark for join and merge operation on indexed columns? According to official Dask performance guide, this operation should be fast compared to pandas, but when benchmarking on my data, the speed up when using 8 partitions was ~7% which was quite disappointing, and I could not find any benchmarks for this.
The text was updated successfully, but these errors were encountered: