-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Have CSE correctly track float/simd and mask registers for their own enregistration #117118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3352e17
to
6707408
Compare
ee5c6bb
to
8cde536
Compare
// Set the cut off values to use for deciding when we want to use aggressive, moderate or conservative | ||
// | ||
// The value of aggressiveRefCnt and moderateRefCnt start off as zero and | ||
// when enregCount reached a certain value we assign the current LclVar | ||
// (weighted) ref count to aggressiveRefCnt or moderateRefCnt. | ||
// | ||
// | ||
// On Windows x64 this yields: | ||
// CNT_AGGRESSIVE_ENREG == 12 and CNT_MODERATE_ENREG == 38 | ||
// Thus we will typically set the cutoff values for | ||
// aggressiveRefCnt based upon the weight of T13 (the 13th tracked LclVar) | ||
// moderateRefCnt based upon the weight of T39 (the 39th tracked LclVar) | ||
// | ||
// For other architecture and platforms these values dynamically change | ||
// based upon the number of callee saved and callee scratch registers. | ||
// | ||
#define CNT_AGGRESSIVE_ENREG ((CNT_CALLEE_ENREG * 3) / 2) | ||
#define CNT_MODERATE_ENREG ((CNT_CALLEE_ENREG * 3) + (CNT_CALLEE_TRASH * 2)) | ||
|
||
#define CNT_AGGRESSIVE_ENREG_FLT ((CNT_CALLEE_ENREG_FLOAT * 3) / 2) | ||
#define CNT_MODERATE_ENREG_FLT ((CNT_CALLEE_ENREG_FLOAT * 3) + (CNT_CALLEE_TRASH_FLOAT * 2)) | ||
|
||
#define CNT_AGGRESSIVE_ENREG_MSK ((CNT_CALLEE_ENREG_MASK * 3) / 2) | ||
#define CNT_MODERATE_ENREG_MSK ((CNT_CALLEE_ENREG_MASK * 3) + (CNT_CALLEE_TRASH_MASK * 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just hoists these defines for broader reuse and avoids duplicating the computation in a few different places, helping ensure they stay consistent.
if (varTypeUsesIntReg(varTyp)) | ||
{ | ||
enregCount++; // The primitive types, including TYP_SIMD types use one register | ||
enregCountInt++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of where we would previously skip floating-point values but this would include simd
and mask
, despite simd
using the same register set as float
.
This meant the enregistration count would get off by a method utilizing SIMD, which is undesirable.
The new logic correctly tracks the enregistration count of the 3 different register types that exist today.
unsigned regAvailEstimateInt = CNT_MODERATE_ENREG + 1; | ||
unsigned regAvailEstimateFlt = CNT_MODERATE_ENREG_FLT + 1; | ||
unsigned regAvailEstimateMsk = CNT_MODERATE_ENREG_MSK + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is already using CNT_CALLEE_ENREG
for each. However, I'm not convinced it is "optimal".
Really we have CALLEE_SAVED
and CALLEE_TRASH
sets for each register set. Which is ideal to use depends in part whether or not there are calls in the method.
If there are no calls, then it is better to use CALLEE_TRASH
for example, which would give us a bigger set of registers for use. This is because there is no consideration of needing to save them and the caller will have already "spent" the cost of spill/restore if it was using them.
However, if there are calls then we want to prefer CALLEE_TRASH
for anything that doesn't live across a call boundary and CALLEE_SAVE
for anything that does live across a call boundary. This allows us to take advantage of who needs to pay what costs, whether something is already presumed trashed by a method higher in the stack, and gives us access to the "whole" register set for enregistration purposes, so the most values can live in register without spills.
I kept this "as is" for now though, and think we should look at this more deeply in the future.
@@ -4016,12 +4072,12 @@ void CSE_Heuristic::Initialize() | |||
} | |||
|
|||
#ifdef TARGET_X86 | |||
// Treat floating point and 64 bit integers as always on the stack | |||
if (varTypeIsFloating(varDsc->TypeGet()) || varTypeIsLong(varDsc->TypeGet())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (for floating-point) is a consideration that hasn't been true for years. We've required and preferred using the SSE/SSE2 registers for years on x86 and only use the x87 stack space where required. So we really want to just treat it like any other enregisterable value, same as on x64.
// enregCount only tracks the uses of integer registers. | ||
// | ||
// We could track floating point register usage separately | ||
// but it isn't worth the additional complexity as floating point CSEs | ||
// are rare and we typically have plenty of floating point register available. | ||
// | ||
if (!varTypeIsFloating(varTyp)) | ||
unsigned enregCount; | ||
unsigned cntAggressiveEnreg; | ||
unsigned cntModerateEnreg; | ||
|
||
if (varTypeUsesIntReg(varTyp)) | ||
{ | ||
enregCount++; // The primitive types, including TYP_SIMD types use one register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other place where we said we were tracking integer registers only, but then actually were tracking SIMD and Mask values, which pessimizes things for no good reason.
if (varTypeIsFloating(candidate->Expr()->TypeGet())) | ||
{ | ||
// floating point loads/store encode larger | ||
cse_def_cost += 2; | ||
cse_use_cost += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these costs are "accurate" and likely deserve more investigation in the future. It also isn't accounting for SIMD/Mask. -- We have a similar issue with some of the costs in the gen tree computation, where a lot of them were modeled around the x87 FPU and never updated to account for modern SIMD considerations.
Here's the rough costs of loads:
- a 32-bit integer move:
encoding: 2 bytes; execution: 2-5 cycles
- a 64-bit integer move:
encoding: 3 bytes; execution: 2-5 cycles
- a 32-bit floating-point move:
encoding 4 bytes; execution: 4-7 cycles
- a 64-bit floating-point move:
encoding 4 bytes; execution: 4-7 cycles
- a 128-bit simd move:
encoding 3-4 bytes; execution: 4-7 cycles
- a 256-bit simd move:
encoding 4-5 bytes; execution: 5-8 cycles
- a 512-bit simd move:
encoding 6 bytes; execution: 5-9 cycles
Stores tend to be up to more expensive than loads. You get roughly 4-10 cycles
for simd and floating-point, but nearly 2-10 cycles
for integers
#ifdef FEATURE_SIMD | ||
// SIMD types may cause a SIMD register to be spilled/restored in the prolog and epilog. | ||
// | ||
if (varTypeIsSIMD(candidate->Expr()->TypeGet())) | ||
{ | ||
// We don't have complete information about when these extra spilled/restore will be needed. | ||
// Instead we are conservative and assume that each SIMD CSE that is live across a call | ||
// will cause an additional spill/restore in the prolog and epilog. | ||
// | ||
int spillSimdRegInProlog = 1; | ||
|
||
#if defined(TARGET_XARCH) | ||
// If we have a SIMD32/64 that is live across a call we have even higher spill costs | ||
// | ||
if (candidate->Expr()->TypeIs(TYP_SIMD32, TYP_SIMD64)) | ||
{ | ||
// Additionally for a simd32 CSE candidate we assume that and second spilled/restore will be needed. | ||
// (to hold the upper half of the simd32 register that isn't preserved across the call) | ||
// | ||
spillSimdRegInProlog++; | ||
|
||
// We also increase the CSE use cost here to because we may have to generate instructions | ||
// to move the upper half of the simd32 before and after a call. | ||
// | ||
cse_use_cost += 2; | ||
} | ||
#endif // TARGET_XARCH | ||
|
||
extra_yes_cost = (BB_UNITY_WEIGHT_UNSIGNED * spillSimdRegInProlog) * 3; | ||
} | ||
#endif // FEATURE_SIMD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic hasn't really been applicable in a long time. We have single instruction spill/reload regardless of SIMD size, it's just that TYP_SIMD32/64 require spilling since the upper bits are volatile and we may only spill 128-bits instead of all bits, if possible (and which is no more expensive to do).
A long time ago we did have more complex logic for upper save/restore, but now we can just follow the normal callee save/restore consideration instead.
CC. @dotnet/jit-contrib. This should be ready for review. There's a minor (+0.02% for most, but up to +0.04% on x64 Windows) throughput hit, but this gets us more into a correct shape for CSE considerations and removes some historical quirks where we were eating the integer CSE budget due to floating-point and mask types being in the method. So I think it's worth taking and then doing some follow-up targeted fixes (likely around not CSE'ing certain containable constants) where possible. |
@dotnet/samsung Could you please take a look? These changes may be related to riscv64. |
RISC-V Release-CLR-QEMU: 9086 / 9116 (99.67%)
report.xml, report.md, failures.xml, testclr_details.tar.zst RISC-V Release-CLR-VF2: 9087 / 9117 (99.67%)
report.xml, report.md, failures.xml, testclr_details.tar.zst RISC-V Release-FX-QEMU: 335773 / 336836 (99.68%)
report.xml, report.md, failures.xml, testclr_details.tar.zst RISC-V Release-FX-VF2: 499778 / 501520 (99.65%)
report.xml, report.md, failures.xml, testclr_details.tar.zst Build information and commandsGIT: |
This updates CSE to correctly track float/simd and mask registers for their own enregistration.
Previously the logic attempted to skip the handling of these types or even misclassified simd/mask under the integer enregistration limits. This could lead to weird diffs and register selection in some cases due to suboptimal CSE considerations.
As is typical with these changes, it's "difficult" to assess whether its an improvement or not just by looking at the diffs, because the CSE decisions impact other optimizations, register selection, etc. The diffs are primarily impacting methods that use floating-point or SIMD and generally look positive overall.
In general, we're ending up with fewer things marked as
CSE moderate
and instead more being marked asCSE aggressive
instead. This is because we're no longer eating up the CSE budget by classifying everything as integer. This leads to many cases with more total CSEs:This in turn shifts some temp numbers and impacts everything else in the method. Most of which look to be positive changes, such as:
Which is because we no longer reload some constants and do other computations unnecessarily:
There's some tuning to be done, however, because there's some places where things like zero constants are being CSE'd where it'd be better to have
morph
mark against that so we can still optimize accordingly (which should be a simple fix):There are many similar cases on x64 and with some similar fine tuning that could be done.