-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant Performance Disparity Between Arm64 and x64 Write Barriers #106051
Comments
Tagging subscribers to this area: @dotnet/gc |
Is it certain that the write barrier is to blame? |
The |
This comment was marked as resolved.
This comment was marked as resolved.
var foo = new Foo();
for (long i = 0; i < 200_000_000; i++) {
foo.x = foo;
}
class Foo {
public volatile Foo? x;
} time ./wbcost (base)
________________________________________________________
Executed in 425.01 millis fish external
usr time 404.48 millis 0.07 millis 404.41 millis
sys time 18.57 millis 1.02 millis 17.55 millis |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@EgorBot -arm64 -amd -perf -commit 5598791 vs 5598791 --envvars DOTNET_TieredCompilation:0 DOTNET_ReadyToRun:0 using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
public class Bench
{
[Benchmark]
public void WB()
{
Foo foo = new Foo();
for (long i = 0; i < 200000000; i++)
foo.x = foo;
}
}
internal class Foo
{
public volatile Foo x;
} |
Benchmark results on Amd
Flame graphs: Main vs PR 🔥 For clean |
Benchmark results on Arm64
Flame graphs: Main vs PR 🔥 For clean |
I cannot reproduce your numbers, I suspect you might be measuring OSR pace difference (consider running with Although, arm64 is still slower due to:
Also, we might want to have a more complicated benchmark where objects aren't ephemeral as well? |
@jkotas @cshung If you're not busy - do you have any idea why "is card table already updated" check is so expensive on arm64? 🙂
can it be some false sharing etc? Another thing I noticed that arm64 WB is so expensive that we can add yet another branch ("is object reference null? Exit") and the regression will be <1% (while giving us 2X improvement when we actually write null) |
Yes, we should totally understand the performance of the write barrier function under other execution paths - for example - when cache miss, when branching away because of heap range, generations, and so on. The initial benchmark was designed to be easy to understand. For example, I wanted to make sure the cache always hit and read exactly the same location, that make sure we don't hit any cache issues. As we can see, even in this trivial scenario, the data is showing surprising results, make it more varying will only make it harder to interpret.
I doubt it is false sharing. Since we aren't allocating, the GC should not be running, and no other thread should be accessing the card table, so the core should have exclusive access to the cache entry. Beside the obvious fact that this "slow load" used a different instruction, this slow load is also loading from a computed address, does the ARM architecture does anything special with respect to loading from a hard coded address? I don't know. I wonder if tools like this can give us more insight on what is going on. |
My bet would be sampling bias or some micro-architecture issue. I think it would be best to ask Arm hw engineers to replicate this on a simulator and tell us what's actually going on. |
On Arm64 this could be done by having 4 Probably minor but also spotted is this bit of code is doing
...which can be simplified to
|
I am not sure whether they need to be patched atomically, presumably not since the constants are updated during GC stop, but it needs to be checked. Also, the data it loads is located just near the function so supposed to be not terribly slow?
I've tried that and it either didn't improve anything or even regressed, don't remember exactly |
The constants are not always updated during GC stop. Notice that the x64 implementation of the write barrier has padding nops to make the constants aligned so that they can be patched atomically:
|
This is intentional optimization to avoid cache contention. Simplifying it to val=0xff would likely show up as a regression. |
I recreated the two tests in the top comment using benchmarkdotnet. C# and Arm64 asm code is here: https://gist.github.com/a74nh/c8e06132b2d7c33a373a88f567ef8ef8 I ran it on a bunch of machines, all with Ubuntu 22.04 x86 Gold 5120T
X86 platinum-8480
Altra (neoverse N1)
Altra Max
Cobalt 100 (Neoverse N2)
Graviton 3 c7g.2xlarge (Neoverse V1)
Graviton 3 c7g.16xlarge (Neoverse V1)
Grace (Neoverse V2)
Curiously the Running perf on some of these (see the gist), For |
you can move assignments to separate no-inline methods to make sure the rest of the codegen around loops is the same. To be fair, I also don't think there a problem here. My benchmarks also don't show terrible differences between arm64 and comparable x64. I observe high contention on card-table loads in OrchardCMS (high level of concurrency), but doesn't look to be a bottle-neck. |
Added Cobalt 100 figures to the set of results. |
@EgorBot -linux_azure_cobalt100 -linux_azure_milano -profiler -commit main using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
public class Bench
{
static object Src = new ();
static object Dst;
static Bench() => GC.Collect();
[Benchmark]
public void WB_Gen0()
{
var src = new object();
for (int i = 0; i < 1_000_000; i++)
Dst = src;
}
[Benchmark]
public void WB_Gen2()
{
var obj = Src;
for (int i = 0; i < 1_000_000; i++)
Dst = obj;
}
} |
On Windows 16 core Cobalt machine, here are the results that match linux numbers:
|
@a74nh - After looking at the disassembly, the To compare with internal class DataClass
{
public DataClass x;
public int y;
}
[Benchmark]
public void NoBarrier_v2() => TestNoBarrier_v2(data, 5);
[MethodImpl(MethodImplOptions.NoInlining)]
static unsafe DataClass TestNoBarrier_v2(DataClass d, int _y)
{
for (long i = 0; i < iter; i++)
{
d.y = _y;
}
return d;
}
|
Description
We observed a significant performance disparity between the Arm64 and x64 write barriers. When running a program without the write barrier, Arm64 was 3x slower than x64. However, with the write barrier enabled, Arm64 became 10x slower. This suggests that Arm64's handling of the write barrier is less optimized compared to x64.
Data
Performance Counter Stats without the Write Barrier
To test the performance of the write barrier, we used Crank to run a simple program 10 times on the two machines. Notice that when we do not access the write barrier, it’s approximately 3x slower on the Arm64 machine.
This is a simple program that does not access the write barrier that we measured the performance of using crank:
Table 1: Average Performance Counter Stats without the write barrier.
Performance Counter Stats with the Write Barrier
When we do access the write barrier, performance degrades further, with the Arm64 machine becoming 10x slower.
This is a simple program that access the write barrier that we measured the performance of using crank:
Table 2: Performance Counter Stats with the write barrier.
The text was updated successfully, but these errors were encountered: