You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rayon's current parallel iterator API provides several ways to reduce the amount of work-splitting that is performed by the rayon implementation. For example one can use methods like by_(exponential|uniform)_blocks, fold_chunks and friends on IndexedParallelIterator. If the collection does not have indexed iterators, one can also use e.g. rayon::iter::split() with a cutoff imposed by the user-provided callback refusing to split after a certain depth.
These concurrency reduction tools are useful in situations where rayon's heuristics overestimate the optimal amount of work splitting, resulting in more task spawning and joining overhead without a meaningful improvement in parallel execution efficiency and execution time. Indeed, in the worst case, wall-clock time and CPU time will regress.
Overall, the need for these tweaks most often comes up when one wants to improve the performance of rayon-based processing on small, fine-grained datasets. They are best combined with a fully sequential fast path (with no rayon call at all) for the smallest datasets.
What I am proposing
Someone recently brought to my attention that we can also get the opposite problem of rayon performing too little work splitting, in the scenario where it is not possible for the application to generate a balanced work-splitting tree as advised by the docs of rayon::iter::split() and friends because the shape of the work is not known in advance.
The precise use case that was described to me is a graph traversal algorithm that should work on graphs of arbitrary shapes, however badly balanced. In this situation, producing balanced chunks for split() or join() would require sequentially traversing the graph to evaluate the processing cost of each outgoing edge, which defeats the point of parallel traversal.
When the split is too imbalanced, the work-splitting heuristics of Rayon, which AFAIK are based on the idea of keeping thread work queues filled with more than N pieces of work, are likely to end up spawning too little concurrent work before switching to sequential processing. As a result, ineffective load balancing will ensue. This is why the docs of split() and friend warn that resulting fragments should be reasonably balanced.
In this situation, it would be good to have a way to hint the rayon implementation so that it splits work in more chunks than it normally would.
Alternatives
Of course, whenever possible, one should instead favor balanced task splitting, because it will reduce the number of tasks that need to be spawned and joined in order to achieve a certain number of load balancing, increase the granularity of each task, and thus minimize the parallelization-related CPU overhead.
When that is not possible, an alternative to the proposed tunable is to drop to low-level join(), but that's a fair amount of work. It would be great if we could instead try something without leaving the comfort of the high-level parallel iterator API, and its ability to automatically adjust the amount of work splitting based on already executing workloads in the Rayon thread pool.
Variations and extensions
The tunable could either be global to the rayon thread pool (easiest to implement, lowest overhead) or specific to a certain parallel iterator (useful for more complex applications that have multiple concurrent rayon parallel iteration jobs).
It could also be bidirectional (allow tuning the scheduler cost model / safety margin towards more or less iterator splitting), which might reduce the need for users to resort to manual reductions of concurrency like fold_chunks() in some easy cases.
The text was updated successfully, but these errors were encountered:
What we have today
Rayon's current parallel iterator API provides several ways to reduce the amount of work-splitting that is performed by the rayon implementation. For example one can use methods like
by_(exponential|uniform)_blocks
,fold_chunks
and friends on IndexedParallelIterator. If the collection does not have indexed iterators, one can also use e.g.rayon::iter::split()
with a cutoff imposed by the user-provided callback refusing to split after a certain depth.These concurrency reduction tools are useful in situations where rayon's heuristics overestimate the optimal amount of work splitting, resulting in more task spawning and joining overhead without a meaningful improvement in parallel execution efficiency and execution time. Indeed, in the worst case, wall-clock time and CPU time will regress.
Overall, the need for these tweaks most often comes up when one wants to improve the performance of rayon-based processing on small, fine-grained datasets. They are best combined with a fully sequential fast path (with no rayon call at all) for the smallest datasets.
What I am proposing
Someone recently brought to my attention that we can also get the opposite problem of rayon performing too little work splitting, in the scenario where it is not possible for the application to generate a balanced work-splitting tree as advised by the docs of
rayon::iter::split()
and friends because the shape of the work is not known in advance.The precise use case that was described to me is a graph traversal algorithm that should work on graphs of arbitrary shapes, however badly balanced. In this situation, producing balanced chunks for split() or join() would require sequentially traversing the graph to evaluate the processing cost of each outgoing edge, which defeats the point of parallel traversal.
When the split is too imbalanced, the work-splitting heuristics of Rayon, which AFAIK are based on the idea of keeping thread work queues filled with more than N pieces of work, are likely to end up spawning too little concurrent work before switching to sequential processing. As a result, ineffective load balancing will ensue. This is why the docs of split() and friend warn that resulting fragments should be reasonably balanced.
In this situation, it would be good to have a way to hint the rayon implementation so that it splits work in more chunks than it normally would.
Alternatives
Of course, whenever possible, one should instead favor balanced task splitting, because it will reduce the number of tasks that need to be spawned and joined in order to achieve a certain number of load balancing, increase the granularity of each task, and thus minimize the parallelization-related CPU overhead.
When that is not possible, an alternative to the proposed tunable is to drop to low-level join(), but that's a fair amount of work. It would be great if we could instead try something without leaving the comfort of the high-level parallel iterator API, and its ability to automatically adjust the amount of work splitting based on already executing workloads in the Rayon thread pool.
Variations and extensions
The tunable could either be global to the rayon thread pool (easiest to implement, lowest overhead) or specific to a certain parallel iterator (useful for more complex applications that have multiple concurrent rayon parallel iteration jobs).
It could also be bidirectional (allow tuning the scheduler cost model / safety margin towards more or less iterator splitting), which might reduce the need for users to resort to manual reductions of concurrency like
fold_chunks()
in some easy cases.The text was updated successfully, but these errors were encountered: