Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a tunable to increase work splitting in parallel iterators #1217

Open
HadrienG2 opened this issue Dec 12, 2024 · 0 comments
Open

Consider a tunable to increase work splitting in parallel iterators #1217

HadrienG2 opened this issue Dec 12, 2024 · 0 comments

Comments

@HadrienG2
Copy link
Contributor

HadrienG2 commented Dec 12, 2024

What we have today

Rayon's current parallel iterator API provides several ways to reduce the amount of work-splitting that is performed by the rayon implementation. For example one can use methods like by_(exponential|uniform)_blocks, fold_chunks and friends on IndexedParallelIterator. If the collection does not have indexed iterators, one can also use e.g. rayon::iter::split() with a cutoff imposed by the user-provided callback refusing to split after a certain depth.

These concurrency reduction tools are useful in situations where rayon's heuristics overestimate the optimal amount of work splitting, resulting in more task spawning and joining overhead without a meaningful improvement in parallel execution efficiency and execution time. Indeed, in the worst case, wall-clock time and CPU time will regress.

Overall, the need for these tweaks most often comes up when one wants to improve the performance of rayon-based processing on small, fine-grained datasets. They are best combined with a fully sequential fast path (with no rayon call at all) for the smallest datasets.

What I am proposing

Someone recently brought to my attention that we can also get the opposite problem of rayon performing too little work splitting, in the scenario where it is not possible for the application to generate a balanced work-splitting tree as advised by the docs of rayon::iter::split() and friends because the shape of the work is not known in advance.

The precise use case that was described to me is a graph traversal algorithm that should work on graphs of arbitrary shapes, however badly balanced. In this situation, producing balanced chunks for split() or join() would require sequentially traversing the graph to evaluate the processing cost of each outgoing edge, which defeats the point of parallel traversal.

When the split is too imbalanced, the work-splitting heuristics of Rayon, which AFAIK are based on the idea of keeping thread work queues filled with more than N pieces of work, are likely to end up spawning too little concurrent work before switching to sequential processing. As a result, ineffective load balancing will ensue. This is why the docs of split() and friend warn that resulting fragments should be reasonably balanced.

In this situation, it would be good to have a way to hint the rayon implementation so that it splits work in more chunks than it normally would.

Alternatives

Of course, whenever possible, one should instead favor balanced task splitting, because it will reduce the number of tasks that need to be spawned and joined in order to achieve a certain number of load balancing, increase the granularity of each task, and thus minimize the parallelization-related CPU overhead.

When that is not possible, an alternative to the proposed tunable is to drop to low-level join(), but that's a fair amount of work. It would be great if we could instead try something without leaving the comfort of the high-level parallel iterator API, and its ability to automatically adjust the amount of work splitting based on already executing workloads in the Rayon thread pool.

Variations and extensions

The tunable could either be global to the rayon thread pool (easiest to implement, lowest overhead) or specific to a certain parallel iterator (useful for more complex applications that have multiple concurrent rayon parallel iteration jobs).

It could also be bidirectional (allow tuning the scheduler cost model / safety margin towards more or less iterator splitting), which might reduce the need for users to resort to manual reductions of concurrency like fold_chunks() in some easy cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant