Have CSE correctly track float/simd and mask registers for their own enregistration #117118

tannergooding · 2025-06-28T16:53:13Z

This updates CSE to correctly track float/simd and mask registers for their own enregistration.

Previously the logic attempted to skip the handling of these types or even misclassified simd/mask under the integer enregistration limits. This could lead to weird diffs and register selection in some cases due to suboptimal CSE considerations.

As is typical with these changes, it's "difficult" to assess whether its an improvement or not just by looking at the diffs, because the CSE decisions impact other optimizations, register selection, etc. The diffs are primarily impacting methods that use floating-point or SIMD and generally look positive overall.

In general, we're ending up with fewer things marked as CSE moderate and instead more being marked as CSE aggressive instead. This is because we're no longer eating up the CSE budget by classifying everything as integer. This leads to many cases with more total CSEs:

-;  V94 cse0         [V94,T11] (  3,  1.50)    long  ->   x1         "CSE #05: moderate"
-;  V95 cse1         [V95,T02] (  3, 12   )    long  ->   x5         "CSE #02: moderate"
-;  V96 cse2         [V96,T14] (  4, 15.84)   simd8  ->  d18         "CSE #07: moderate"
-;  V97 cse3         [V97,T15] (  4, 15.84)   simd8  ->  d19         "CSE #08: moderate"
+;  V94 cse0         [V94,T12] (  3,  1.50)    long  ->   x1         "CSE #05: moderate"
+;  V95 cse1         [V95,T02] (  3, 12   )    long  ->   x5         "CSE #02: aggressive"
+;  V96 cse2         [V96,T15] (  4, 15.84)   simd8  ->  d18         "CSE #07: aggressive"
+;  V97 cse3         [V97,T16] (  4, 15.84)   simd8  ->  d19         "CSE #08: aggressive"
+;  V98 cse4         [V98,T03] (  3, 11.88)    long  ->   x4         "CSE #06: aggressive"

This in turn shifts some temp numbers and impacts everything else in the method. Most of which look to be positive changes, such as:

-						;; size=236 bbWeight=1 PerfScore 81.00
+						;; size=220 bbWeight=1 PerfScore 73.00

Which is because we no longer reload some constants and do other computations unnecessarily:

; gcrRegs +[x19]
             ldr     d16, [@RWD00]
             ldr     d17, [@RWD08]
-            fmin    d16, d16, d17
-            str     d16, [fp, #0x30]	// [V08 cse0]
-            str     d16, [x19, #0x08]
-            ldr     d17, [@RWD16]
-            ldr     d18, [@RWD24]
-            fmin    d17, d17, d18
-            str     d17, [fp, #0x28]	// [V09 cse1]
-            str     d17, [x19, #0x10]
-            ldr     d18, [@RWD00]
-            ldr     d19, [@RWD08]
-            fmax    d18, d18, d19
-            str     d18, [fp, #0x20]	// [V10 cse2]
-            str     d18, [x19, #0x18]
+            fmin    d18, d16, d17
+            str     d18, [fp, #0x30]	// [V08 cse0]
+            str     d18, [x19, #0x08]
             ldr     d19, [@RWD16]
             ldr     d20, [@RWD24]
-            fmax    d19, d19, d20
-            str     d19, [fp, #0x18]	// [V11 cse3]
-            str     d19, [x19, #0x20]
+            fmin    d21, d19, d20
+            str     d21, [fp, #0x28]	// [V09 cse1]
+            str     d21, [x19, #0x10]
+            fmax    d16, d16, d17
+            str     d16, [fp, #0x20]	// [V10 cse2]
+            str     d16, [x19, #0x18]
+            fmax    d17, d19, d20
+            str     d17, [fp, #0x18]	// [V11 cse3]
+            str     d17, [x19, #0x20]

There's some tuning to be done, however, because there's some places where things like zero constants are being CSE'd where it'd be better to have morph mark against that so we can still optimize accordingly (which should be a simple fix):

             mov     w3, wzr
 						;; size=16 bbWeight=0.50 PerfScore 1.00
 G_M48160_IG04:        ; bbWeight=3.96, gcrefRegs=0000 {}, byrefRegs=0005 {x0 x2}, byref, isz
-            ldr     d18, [x0, w3, UXTW #3]
-            ldr     d19, [x2, w3, UXTW #3]
+            mov     w4, w3
+            ldr     d18, [x0, x4, LSL #3]
+            ldr     d19, [x2, x4, LSL #3]

There are many similar cases on x64 and with some similar fine tuning that could be done.

…regCount

tannergooding · 2025-06-28T20:46:34Z

src/coreclr/jit/optcse.cpp

+// Set the cut off values to use for deciding when we want to use aggressive, moderate or conservative
+//
+// The value of aggressiveRefCnt and moderateRefCnt start off as zero and
+// when enregCount reached a certain value we assign the current LclVar
+// (weighted) ref count to aggressiveRefCnt or moderateRefCnt.
+//
+//
+// On Windows x64 this yields:
+// CNT_AGGRESSIVE_ENREG == 12 and CNT_MODERATE_ENREG == 38
+// Thus we will typically set the cutoff values for
+//   aggressiveRefCnt based upon the weight of T13 (the 13th tracked LclVar)
+//   moderateRefCnt based upon the weight of T39 (the 39th tracked LclVar)
+//
+// For other architecture and platforms these values dynamically change
+// based upon the number of callee saved and callee scratch registers.
+//
+#define CNT_AGGRESSIVE_ENREG ((CNT_CALLEE_ENREG * 3) / 2)
+#define CNT_MODERATE_ENREG   ((CNT_CALLEE_ENREG * 3) + (CNT_CALLEE_TRASH * 2))
+
+#define CNT_AGGRESSIVE_ENREG_FLT ((CNT_CALLEE_ENREG_FLOAT * 3) / 2)
+#define CNT_MODERATE_ENREG_FLT   ((CNT_CALLEE_ENREG_FLOAT * 3) + (CNT_CALLEE_TRASH_FLOAT * 2))
+
+#define CNT_AGGRESSIVE_ENREG_MSK ((CNT_CALLEE_ENREG_MASK * 3) / 2)
+#define CNT_MODERATE_ENREG_MSK   ((CNT_CALLEE_ENREG_MASK * 3) + (CNT_CALLEE_TRASH_MASK * 2))


This just hoists these defines for broader reuse and avoids duplicating the computation in a few different places, helping ensure they stay consistent.

tannergooding · 2025-06-28T20:48:05Z

src/coreclr/jit/optcse.cpp

+        if (varTypeUsesIntReg(varTyp))
        {
-            enregCount++; // The primitive types, including TYP_SIMD types use one register
+            enregCountInt++;


This is an example of where we would previously skip floating-point values but this would include simd and mask, despite simd using the same register set as float.

This meant the enregistration count would get off by a method utilizing SIMD, which is undesirable.

The new logic correctly tracks the enregistration count of the 3 different register types that exist today.

tannergooding · 2025-06-28T20:55:27Z

src/coreclr/jit/optcse.cpp

+    unsigned   regAvailEstimateInt = CNT_MODERATE_ENREG + 1;
+    unsigned   regAvailEstimateFlt = CNT_MODERATE_ENREG_FLT + 1;
+    unsigned   regAvailEstimateMsk = CNT_MODERATE_ENREG_MSK + 1;


This logic is already using CNT_CALLEE_ENREG for each. However, I'm not convinced it is "optimal".

Really we have CALLEE_SAVED and CALLEE_TRASH sets for each register set. Which is ideal to use depends in part whether or not there are calls in the method.

If there are no calls, then it is better to use CALLEE_TRASH for example, which would give us a bigger set of registers for use. This is because there is no consideration of needing to save them and the caller will have already "spent" the cost of spill/restore if it was using them.

However, if there are calls then we want to prefer CALLEE_TRASH for anything that doesn't live across a call boundary and CALLEE_SAVE for anything that does live across a call boundary. This allows us to take advantage of who needs to pay what costs, whether something is already presumed trashed by a method higher in the stack, and gives us access to the "whole" register set for enregistration purposes, so the most values can live in register without spills.

I kept this "as is" for now though, and think we should look at this more deeply in the future.

tannergooding · 2025-06-28T20:56:55Z

src/coreclr/jit/optcse.cpp

@@ -4016,12 +4072,12 @@ void CSE_Heuristic::Initialize()
        }

 #ifdef TARGET_X86
-        // Treat floating point and 64 bit integers as always on the stack
-        if (varTypeIsFloating(varDsc->TypeGet()) || varTypeIsLong(varDsc->TypeGet()))


This (for floating-point) is a consideration that hasn't been true for years. We've required and preferred using the SSE/SSE2 registers for years on x86 and only use the x87 stack space where required. So we really want to just treat it like any other enregisterable value, same as on x64.

tannergooding · 2025-06-28T20:58:19Z

src/coreclr/jit/optcse.cpp

-        // enregCount only tracks the uses of integer registers.
-        //
-        // We could track floating point register usage separately
-        // but it isn't worth the additional complexity as floating point CSEs
-        // are rare and we typically have plenty of floating point register available.
-        //
-        if (!varTypeIsFloating(varTyp))
+        unsigned enregCount;
+        unsigned cntAggressiveEnreg;
+        unsigned cntModerateEnreg;
+
+        if (varTypeUsesIntReg(varTyp))
        {
-            enregCount++; // The primitive types, including TYP_SIMD types use one register


This is the other place where we said we were tracking integer registers only, but then actually were tracking SIMD and Mask values, which pessimizes things for no good reason.

tannergooding · 2025-06-28T21:15:31Z

src/coreclr/jit/optcse.cpp

        if (varTypeIsFloating(candidate->Expr()->TypeGet()))
        {
            // floating point loads/store encode larger
            cse_def_cost += 2;
            cse_use_cost += 1;
        }


I don't think these costs are "accurate" and likely deserve more investigation in the future. It also isn't accounting for SIMD/Mask. -- We have a similar issue with some of the costs in the gen tree computation, where a lot of them were modeled around the x87 FPU and never updated to account for modern SIMD considerations.

Here's the rough costs of loads:

a 32-bit integer move: encoding: 2 bytes; execution: 2-5 cycles

a 64-bit integer move: encoding: 3 bytes; execution: 2-5 cycles

a 32-bit floating-point move: encoding 4 bytes; execution: 4-7 cycles

a 64-bit floating-point move: encoding 4 bytes; execution: 4-7 cycles

a 128-bit simd move: encoding 3-4 bytes; execution: 4-7 cycles

a 256-bit simd move: encoding 4-5 bytes; execution: 5-8 cycles

a 512-bit simd move: encoding 6 bytes; execution: 5-9 cycles

Stores tend to be up to more expensive than loads. You get roughly 4-10 cycles for simd and floating-point, but nearly 2-10 cycles for integers

tannergooding · 2025-06-28T21:18:57Z

src/coreclr/jit/optcse.cpp

-#ifdef FEATURE_SIMD
-        // SIMD types may cause a SIMD register to be spilled/restored in the prolog and epilog.
-        //
-        if (varTypeIsSIMD(candidate->Expr()->TypeGet()))
-        {
-            // We don't have complete information about when these extra spilled/restore will be needed.
-            // Instead we are conservative and assume that each SIMD CSE that is live across a call
-            // will cause an additional spill/restore in the prolog and epilog.
-            //
-            int spillSimdRegInProlog = 1;
-
-#if defined(TARGET_XARCH)
-            // If we have a SIMD32/64 that is live across a call we have even higher spill costs
-            //
-            if (candidate->Expr()->TypeIs(TYP_SIMD32, TYP_SIMD64))
-            {
-                // Additionally for a simd32 CSE candidate we assume that and second spilled/restore will be needed.
-                // (to hold the upper half of the simd32 register that isn't preserved across the call)
-                //
-                spillSimdRegInProlog++;
-
-                // We also increase the CSE use cost here to because we may have to generate instructions
-                // to move the upper half of the simd32 before and after a call.
-                //
-                cse_use_cost += 2;
-            }
-#endif // TARGET_XARCH
-
-            extra_yes_cost = (BB_UNITY_WEIGHT_UNSIGNED * spillSimdRegInProlog) * 3;
-        }
-#endif // FEATURE_SIMD


This logic hasn't really been applicable in a long time. We have single instruction spill/reload regardless of SIMD size, it's just that TYP_SIMD32/64 require spilling since the upper bits are volatile and we may only spill 128-bits instead of all bits, if possible (and which is no more expensive to do).

A long time ago we did have more complex logic for upper save/restore, but now we can just follow the normal callee save/restore consideration instead.

tannergooding · 2025-06-28T21:33:44Z

CC. @dotnet/jit-contrib. This should be ready for review.

There's a minor (+0.02% for most, but up to +0.04% on x64 Windows) throughput hit, but this gets us more into a correct shape for CSE considerations and removes some historical quirks where we were eating the integer CSE budget due to floating-point and mask types being in the method. So I think it's worth taking and then doing some follow-up targeted fixes (likely around not CSE'ing certain containable constants) where possible.

risc-vv · 2025-06-28T21:35:38Z

@dotnet/samsung Could you please take a look? These changes may be related to riscv64.

risc-vv · 2025-06-28T21:35:58Z

RISC-V Release-CLR-QEMU: 9086 / 9116 (99.67%)

=======================
      passed: 9086
      failed: 2
     skipped: 600
      killed: 28
------------------------
 TOTAL tests: 9716
VIRTUAL time: 35h 20min 51s 144ms
   REAL time: 36min 7s 681ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9087 / 9117 (99.67%)

=======================
      passed: 9087
      failed: 2
     skipped: 600
      killed: 28
------------------------
 TOTAL tests: 9717
VIRTUAL time: 10h 43min 39s 782ms
   REAL time: 43min 36s 331ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-QEMU: 335773 / 336836 (99.68%)

=======================
      passed: 335773
      failed: 1054
     skipped: 39
      killed: 9
------------------------
 TOTAL tests: 336875
VIRTUAL time: 32h 34min 59s 596ms
   REAL time: 1h 11min 18s 986ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-VF2: 499778 / 501520 (99.65%)

=======================
      passed: 499778
      failed: 1734
     skipped: 39
      killed: 8
------------------------
 TOTAL tests: 501559
VIRTUAL time: 20h 55min 7s 21ms
   REAL time: 2h 18min 24s 878ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 8cde53693255d7d0fa117f8c1bae90bfd35974f2
CI: bf2351b4b801509d156b00d7a6d660aed10fd64f
REPO: dotnet/runtime
BRANCH: main
CONFIG: Release
LIB_CONFIG: Release

tannergooding added the NO-REVIEW Experimental/testing PR, do NOT review it label Jun 28, 2025

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 28, 2025

dotnet-policy-service bot assigned tannergooding Jun 28, 2025

Don't have CSE track simd or mask registers as part of the integer en…

6707408

…regCount

tannergooding force-pushed the optcse-int branch from 3352e17 to 6707408 Compare June 28, 2025 17:12

Have optcse track non-integer enregistration counts

8cde536

tannergooding force-pushed the optcse-int branch from ee5c6bb to 8cde536 Compare June 28, 2025 19:22

tannergooding changed the title ~~Don't have CSE track simd or mask registers as part of the integer enregCount~~ Have CSE correctly track simd and mask registers for their own enregistration Jun 28, 2025

tannergooding changed the title ~~Have CSE correctly track simd and mask registers for their own enregistration~~ Have CSE correctly track float/simd and mask registers for their own enregistration Jun 28, 2025

tannergooding commented Jun 28, 2025

View reviewed changes

tannergooding removed the NO-REVIEW Experimental/testing PR, do NOT review it label Jun 28, 2025

tannergooding marked this pull request as ready for review June 28, 2025 21:34

build-analysis bot mentioned this pull request Jun 28, 2025

browser-wasm Windows build error #116746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Have CSE correctly track float/simd and mask registers for their own enregistration #117118

Have CSE correctly track float/simd and mask registers for their own enregistration #117118

tannergooding commented Jun 28, 2025 •

edited

Loading

Uh oh!

tannergooding Jun 28, 2025

Uh oh!

tannergooding Jun 28, 2025

Uh oh!

tannergooding Jun 28, 2025

Uh oh!

tannergooding Jun 28, 2025

Uh oh!

tannergooding Jun 28, 2025

Uh oh!

tannergooding Jun 28, 2025

Uh oh!

tannergooding Jun 28, 2025

Uh oh!

tannergooding commented Jun 28, 2025

Uh oh!

risc-vv commented Jun 28, 2025

Uh oh!

risc-vv commented Jun 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Have CSE correctly track float/simd and mask registers for their own enregistration #117118

Are you sure you want to change the base?

Have CSE correctly track float/simd and mask registers for their own enregistration #117118

Conversation

tannergooding commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Jun 28, 2025

Uh oh!

risc-vv commented Jun 28, 2025

Uh oh!

risc-vv commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tannergooding commented Jun 28, 2025 •

edited

Loading

risc-vv commented Jun 28, 2025 •

edited

Loading