Skip to content

Have CSE correctly track float/simd and mask registers for their own enregistration #117118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jun 28, 2025

This updates CSE to correctly track float/simd and mask registers for their own enregistration.

Previously the logic attempted to skip the handling of these types or even misclassified simd/mask under the integer enregistration limits. This could lead to weird diffs and register selection in some cases due to suboptimal CSE considerations.

As is typical with these changes, it's "difficult" to assess whether its an improvement or not just by looking at the diffs, because the CSE decisions impact other optimizations, register selection, etc. The diffs are primarily impacting methods that use floating-point or SIMD and generally look positive overall.

In general, we're ending up with fewer things marked as CSE moderate and instead more being marked as CSE aggressive instead. This is because we're no longer eating up the CSE budget by classifying everything as integer. This leads to many cases with more total CSEs:

-;  V94 cse0         [V94,T11] (  3,  1.50)    long  ->   x1         "CSE #05: moderate"
-;  V95 cse1         [V95,T02] (  3, 12   )    long  ->   x5         "CSE #02: moderate"
-;  V96 cse2         [V96,T14] (  4, 15.84)   simd8  ->  d18         "CSE #07: moderate"
-;  V97 cse3         [V97,T15] (  4, 15.84)   simd8  ->  d19         "CSE #08: moderate"
+;  V94 cse0         [V94,T12] (  3,  1.50)    long  ->   x1         "CSE #05: moderate"
+;  V95 cse1         [V95,T02] (  3, 12   )    long  ->   x5         "CSE #02: aggressive"
+;  V96 cse2         [V96,T15] (  4, 15.84)   simd8  ->  d18         "CSE #07: aggressive"
+;  V97 cse3         [V97,T16] (  4, 15.84)   simd8  ->  d19         "CSE #08: aggressive"
+;  V98 cse4         [V98,T03] (  3, 11.88)    long  ->   x4         "CSE #06: aggressive"

This in turn shifts some temp numbers and impacts everything else in the method. Most of which look to be positive changes, such as:

-						;; size=236 bbWeight=1 PerfScore 81.00
+						;; size=220 bbWeight=1 PerfScore 73.00

Which is because we no longer reload some constants and do other computations unnecessarily:

; gcrRegs +[x19]
             ldr     d16, [@RWD00]
             ldr     d17, [@RWD08]
-            fmin    d16, d16, d17
-            str     d16, [fp, #0x30]	// [V08 cse0]
-            str     d16, [x19, #0x08]
-            ldr     d17, [@RWD16]
-            ldr     d18, [@RWD24]
-            fmin    d17, d17, d18
-            str     d17, [fp, #0x28]	// [V09 cse1]
-            str     d17, [x19, #0x10]
-            ldr     d18, [@RWD00]
-            ldr     d19, [@RWD08]
-            fmax    d18, d18, d19
-            str     d18, [fp, #0x20]	// [V10 cse2]
-            str     d18, [x19, #0x18]
+            fmin    d18, d16, d17
+            str     d18, [fp, #0x30]	// [V08 cse0]
+            str     d18, [x19, #0x08]
             ldr     d19, [@RWD16]
             ldr     d20, [@RWD24]
-            fmax    d19, d19, d20
-            str     d19, [fp, #0x18]	// [V11 cse3]
-            str     d19, [x19, #0x20]
+            fmin    d21, d19, d20
+            str     d21, [fp, #0x28]	// [V09 cse1]
+            str     d21, [x19, #0x10]
+            fmax    d16, d16, d17
+            str     d16, [fp, #0x20]	// [V10 cse2]
+            str     d16, [x19, #0x18]
+            fmax    d17, d19, d20
+            str     d17, [fp, #0x18]	// [V11 cse3]
+            str     d17, [x19, #0x20]

There's some tuning to be done, however, because there's some places where things like zero constants are being CSE'd where it'd be better to have morph mark against that so we can still optimize accordingly (which should be a simple fix):

             mov     w3, wzr
 						;; size=16 bbWeight=0.50 PerfScore 1.00
 G_M48160_IG04:        ; bbWeight=3.96, gcrefRegs=0000 {}, byrefRegs=0005 {x0 x2}, byref, isz
-            ldr     d18, [x0, w3, UXTW #3]
-            ldr     d19, [x2, w3, UXTW #3]
+            mov     w4, w3
+            ldr     d18, [x0, x4, LSL #3]
+            ldr     d19, [x2, x4, LSL #3]

There are many similar cases on x64 and with some similar fine tuning that could be done.

@tannergooding tannergooding added the NO-REVIEW Experimental/testing PR, do NOT review it label Jun 28, 2025
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 28, 2025
@tannergooding tannergooding changed the title Don't have CSE track simd or mask registers as part of the integer enregCount Have CSE correctly track simd and mask registers for their own enregistration Jun 28, 2025
@tannergooding tannergooding changed the title Have CSE correctly track simd and mask registers for their own enregistration Have CSE correctly track float/simd and mask registers for their own enregistration Jun 28, 2025
Comment on lines +43 to +66
// Set the cut off values to use for deciding when we want to use aggressive, moderate or conservative
//
// The value of aggressiveRefCnt and moderateRefCnt start off as zero and
// when enregCount reached a certain value we assign the current LclVar
// (weighted) ref count to aggressiveRefCnt or moderateRefCnt.
//
//
// On Windows x64 this yields:
// CNT_AGGRESSIVE_ENREG == 12 and CNT_MODERATE_ENREG == 38
// Thus we will typically set the cutoff values for
// aggressiveRefCnt based upon the weight of T13 (the 13th tracked LclVar)
// moderateRefCnt based upon the weight of T39 (the 39th tracked LclVar)
//
// For other architecture and platforms these values dynamically change
// based upon the number of callee saved and callee scratch registers.
//
#define CNT_AGGRESSIVE_ENREG ((CNT_CALLEE_ENREG * 3) / 2)
#define CNT_MODERATE_ENREG ((CNT_CALLEE_ENREG * 3) + (CNT_CALLEE_TRASH * 2))

#define CNT_AGGRESSIVE_ENREG_FLT ((CNT_CALLEE_ENREG_FLOAT * 3) / 2)
#define CNT_MODERATE_ENREG_FLT ((CNT_CALLEE_ENREG_FLOAT * 3) + (CNT_CALLEE_TRASH_FLOAT * 2))

#define CNT_AGGRESSIVE_ENREG_MSK ((CNT_CALLEE_ENREG_MASK * 3) / 2)
#define CNT_MODERATE_ENREG_MSK ((CNT_CALLEE_ENREG_MASK * 3) + (CNT_CALLEE_TRASH_MASK * 2))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just hoists these defines for broader reuse and avoids duplicating the computation in a few different places, helping ensure they stay consistent.

Comment on lines +3155 to +3157
if (varTypeUsesIntReg(varTyp))
{
enregCount++; // The primitive types, including TYP_SIMD types use one register
enregCountInt++;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of where we would previously skip floating-point values but this would include simd and mask, despite simd using the same register set as float.

This meant the enregistration count would get off by a method utilizing SIMD, which is undesirable.

The new logic correctly tracks the enregistration count of the 3 different register types that exist today.

Comment on lines +4017 to +4019
unsigned regAvailEstimateInt = CNT_MODERATE_ENREG + 1;
unsigned regAvailEstimateFlt = CNT_MODERATE_ENREG_FLT + 1;
unsigned regAvailEstimateMsk = CNT_MODERATE_ENREG_MSK + 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is already using CNT_CALLEE_ENREG for each. However, I'm not convinced it is "optimal".

Really we have CALLEE_SAVED and CALLEE_TRASH sets for each register set. Which is ideal to use depends in part whether or not there are calls in the method.

If there are no calls, then it is better to use CALLEE_TRASH for example, which would give us a bigger set of registers for use. This is because there is no consideration of needing to save them and the caller will have already "spent" the cost of spill/restore if it was using them.

However, if there are calls then we want to prefer CALLEE_TRASH for anything that doesn't live across a call boundary and CALLEE_SAVE for anything that does live across a call boundary. This allows us to take advantage of who needs to pay what costs, whether something is already presumed trashed by a method higher in the stack, and gives us access to the "whole" register set for enregistration purposes, so the most values can live in register without spills.

I kept this "as is" for now though, and think we should look at this more deeply in the future.

@@ -4016,12 +4072,12 @@ void CSE_Heuristic::Initialize()
}

#ifdef TARGET_X86
// Treat floating point and 64 bit integers as always on the stack
if (varTypeIsFloating(varDsc->TypeGet()) || varTypeIsLong(varDsc->TypeGet()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (for floating-point) is a consideration that hasn't been true for years. We've required and preferred using the SSE/SSE2 registers for years on x86 and only use the x87 stack space where required. So we really want to just treat it like any other enregisterable value, same as on x64.

Comment on lines -4133 to -4141
// enregCount only tracks the uses of integer registers.
//
// We could track floating point register usage separately
// but it isn't worth the additional complexity as floating point CSEs
// are rare and we typically have plenty of floating point register available.
//
if (!varTypeIsFloating(varTyp))
unsigned enregCount;
unsigned cntAggressiveEnreg;
unsigned cntModerateEnreg;

if (varTypeUsesIntReg(varTyp))
{
enregCount++; // The primitive types, including TYP_SIMD types use one register
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other place where we said we were tracking integer registers only, but then actually were tracking SIMD and Mask values, which pessimizes things for no good reason.

Comment on lines 4548 to 4553
if (varTypeIsFloating(candidate->Expr()->TypeGet()))
{
// floating point loads/store encode larger
cse_def_cost += 2;
cse_use_cost += 1;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these costs are "accurate" and likely deserve more investigation in the future. It also isn't accounting for SIMD/Mask. -- We have a similar issue with some of the costs in the gen tree computation, where a lot of them were modeled around the x87 FPU and never updated to account for modern SIMD considerations.

Here's the rough costs of loads:

  • a 32-bit integer move: encoding: 2 bytes; execution: 2-5 cycles
  • a 64-bit integer move: encoding: 3 bytes; execution: 2-5 cycles
  • a 32-bit floating-point move: encoding 4 bytes; execution: 4-7 cycles
  • a 64-bit floating-point move: encoding 4 bytes; execution: 4-7 cycles
  • a 128-bit simd move: encoding 3-4 bytes; execution: 4-7 cycles
  • a 256-bit simd move: encoding 4-5 bytes; execution: 5-8 cycles
  • a 512-bit simd move: encoding 6 bytes; execution: 5-9 cycles

Stores tend to be up to more expensive than loads. You get roughly 4-10 cycles for simd and floating-point, but nearly 2-10 cycles for integers

Comment on lines -4617 to -4647
#ifdef FEATURE_SIMD
// SIMD types may cause a SIMD register to be spilled/restored in the prolog and epilog.
//
if (varTypeIsSIMD(candidate->Expr()->TypeGet()))
{
// We don't have complete information about when these extra spilled/restore will be needed.
// Instead we are conservative and assume that each SIMD CSE that is live across a call
// will cause an additional spill/restore in the prolog and epilog.
//
int spillSimdRegInProlog = 1;

#if defined(TARGET_XARCH)
// If we have a SIMD32/64 that is live across a call we have even higher spill costs
//
if (candidate->Expr()->TypeIs(TYP_SIMD32, TYP_SIMD64))
{
// Additionally for a simd32 CSE candidate we assume that and second spilled/restore will be needed.
// (to hold the upper half of the simd32 register that isn't preserved across the call)
//
spillSimdRegInProlog++;

// We also increase the CSE use cost here to because we may have to generate instructions
// to move the upper half of the simd32 before and after a call.
//
cse_use_cost += 2;
}
#endif // TARGET_XARCH

extra_yes_cost = (BB_UNITY_WEIGHT_UNSIGNED * spillSimdRegInProlog) * 3;
}
#endif // FEATURE_SIMD
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic hasn't really been applicable in a long time. We have single instruction spill/reload regardless of SIMD size, it's just that TYP_SIMD32/64 require spilling since the upper bits are volatile and we may only spill 128-bits instead of all bits, if possible (and which is no more expensive to do).

A long time ago we did have more complex logic for upper save/restore, but now we can just follow the normal callee save/restore consideration instead.

@tannergooding tannergooding removed the NO-REVIEW Experimental/testing PR, do NOT review it label Jun 28, 2025
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib. This should be ready for review.

There's a minor (+0.02% for most, but up to +0.04% on x64 Windows) throughput hit, but this gets us more into a correct shape for CSE considerations and removes some historical quirks where we were eating the integer CSE budget due to floating-point and mask types being in the method. So I think it's worth taking and then doing some follow-up targeted fixes (likely around not CSE'ing certain containable constants) where possible.

@tannergooding tannergooding marked this pull request as ready for review June 28, 2025 21:34
@risc-vv
Copy link

risc-vv commented Jun 28, 2025

@dotnet/samsung Could you please take a look? These changes may be related to riscv64.

@risc-vv
Copy link

risc-vv commented Jun 28, 2025

RISC-V Release-CLR-QEMU: 9086 / 9116 (99.67%)
=======================
      passed: 9086
      failed: 2
     skipped: 600
      killed: 28
------------------------
 TOTAL tests: 9716
VIRTUAL time: 35h 20min 51s 144ms
   REAL time: 36min 7s 681ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9087 / 9117 (99.67%)
=======================
      passed: 9087
      failed: 2
     skipped: 600
      killed: 28
------------------------
 TOTAL tests: 9717
VIRTUAL time: 10h 43min 39s 782ms
   REAL time: 43min 36s 331ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-QEMU: 335773 / 336836 (99.68%)
=======================
      passed: 335773
      failed: 1054
     skipped: 39
      killed: 9
------------------------
 TOTAL tests: 336875
VIRTUAL time: 32h 34min 59s 596ms
   REAL time: 1h 11min 18s 986ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-VF2: 499778 / 501520 (99.65%)
=======================
      passed: 499778
      failed: 1734
     skipped: 39
      killed: 8
------------------------
 TOTAL tests: 501559
VIRTUAL time: 20h 55min 7s 21ms
   REAL time: 2h 18min 24s 878ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 8cde53693255d7d0fa117f8c1bae90bfd35974f2
CI: bf2351b4b801509d156b00d7a6d660aed10fd64f
REPO: dotnet/runtime
BRANCH: main
CONFIG: Release
LIB_CONFIG: Release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants