Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard AllocMap Lock #136115

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Mark-Simulacrum
Copy link
Member

This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in #136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too.

Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally.

r? @RalfJung

This improves performance on many-seed parallel (-Zthreads=32) miri
executions from managing to use ~8 cores to using 27-28 cores. That's
pretty reasonable scaling for the simplicity of this solution.
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 27, 2025
@rustbot
Copy link
Collaborator

rustbot commented Jan 27, 2025

Some changes occurred to the CTFE / Miri interpreter

cc @rust-lang/miri, @rust-lang/wg-const-eval

@Mark-Simulacrum
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 27, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 27, 2025
Shard AllocMap Lock

This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too.

Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally.

r? `@RalfJung`
@bors
Copy link
Contributor

bors commented Jan 27, 2025

⌛ Trying commit b2bff4f with merge e402369...

@bors
Copy link
Contributor

bors commented Jan 27, 2025

☀️ Try build successful - checks-actions
Build commit: e402369 (e4023695dcabc9a5f89d120c866679e546717831)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (e402369): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.2% [0.2%, 0.3%] 5
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary -2.0%, secondary 2.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.1% [2.1%, 2.1%] 1
Improvements ✅
(primary)
-2.0% [-2.9%, -1.0%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -2.0% [-2.9%, -1.0%] 2

Cycles

Results (primary 2.6%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.6% [2.6%, 2.6%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.6% [2.6%, 2.6%] 1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 772.928s -> 772.728s (-0.03%)
Artifact size: 328.22 MiB -> 328.21 MiB (-0.00%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 27, 2025
@Mark-Simulacrum
Copy link
Member Author

Perf results look neutral enough that I'm okay moving forward given the non-perf measured gains for parallel executions.

@rustbot label: perf-regression-triaged

@rustbot rustbot added the perf-regression-triaged The performance regression has been triaged. label Jan 27, 2025
@@ -389,35 +391,37 @@ pub const CTFE_ALLOC_SALT: usize = 0;

pub(crate) struct AllocMap<'tcx> {
/// Maps `AllocId`s to their corresponding allocations.
alloc_map: FxHashMap<AllocId, GlobalAlloc<'tcx>>,
// Note that this map on rustc workloads seems to be rather dense. In #136105 we considered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note that this map on rustc workloads seems to be rather dense. In #136105 we considered
// Note that this map on rustc workloads seems to be rather dense, but
// in Miri workloads it is expected to be quite sparse. In #136105 we considered

assert!(
self.alloc_map
.to_alloc
.lock_shard_by_value(&id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This locks to_alloc while dedup is locked. Seems worth documenting the lock order (in the AllocMap type, I guess) to avoid deadlocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression-triaged The performance regression has been triaged. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants