-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Add reasoning about loop trip counts and optimize counted loops into downwards counted loops #102261
Conversation
This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial. The trip count of a loop is defined to be the number of times the header block is entered from a back edge. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths. To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip count if we can prove that `n < int.MaxValue`. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases. Converting a counted loop to a downwards counting loop is beneficial if the index is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. This transformation does not have that many hits (as you may imagine, the indices of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an index except the mutation and the loop test. The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases. Example: ```csharp private static int Foo(int[] arr, int start, int count) { int sum = 0; for (int i = 0; i < count; i++) { sum += arr[start++]; } return sum; } ``` ```diff @@ -1,20 +1,18 @@ G_M42127_IG02: ;; offset=0x0004 xor eax, eax - xor r10d, r10d test r8d, r8d jle SHORT G_M42127_IG05 - ;; size=10 bbWeight=1 PerfScore 1.75 -G_M42127_IG03: ;; offset=0x000E - mov r9d, dword ptr [rcx+0x08] + ;; size=7 bbWeight=1 PerfScore 1.50 +G_M42127_IG03: ;; offset=0x000B + mov r10d, dword ptr [rcx+0x08] ;; size=4 bbWeight=0.25 PerfScore 0.50 -G_M42127_IG04: ;; offset=0x0012 - lea r9d, [rdx+0x01] +G_M42127_IG04: ;; offset=0x000F + lea r10d, [rdx+0x01] cmp edx, dword ptr [rcx+0x08] jae SHORT G_M42127_IG06 mov edx, edx add eax, dword ptr [rcx+4*rdx+0x10] - inc r10d - cmp r10d, r8d - mov edx, r9d - jl SHORT G_M42127_IG04 - ;; size=26 bbWeight=4 PerfScore 38.00 + dec r8d + mov edx, r10d + jne SHORT G_M42127_IG04 + ;; size=23 bbWeight=4 PerfScore 37.00 ``` Fix dotnet#100915
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
newValue = FoldBinop<uint32_t>(binop->Oper, static_cast<uint32_t>(cns1->Value), | ||
static_cast<uint32_t>(cns2->Value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this since signed overflow is undefined behavior while we want to have wraparound behavior.
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
cc @dotnet/jit-contrib PTAL @EgorBo @AndyAyersMS (when you are back) Diffs. Actually more hits than I expected. Somewhat large-ish TP regressions since this computes the DFS, loops and possible dominators in the IV opts phase for all targets now. I don't expect many actual perf improvements from a transformation like this, but it has the knock-on effect of often freeing up a register inside the loop, which is especially important on x64. In many cases this transformation is going to mean that strength reduction does not increase register pressure when it kicks in. I collected some metrics for this transformation and the reasons we give up on it. They are ordered from the checks we do early to late, but note that in many cases resolving earlier reasons would just cause us to give up due to a later one. The stats are from
The locals we try to see if we can remove are the ones that have phis in the header. We consider them removable if they only have uses in a "self update" and in the exit test. More specifically the reasons why we do not consider locals removable break down as:
We should be able to handle loops with multiple exits, and we could also handle cases where the loop test has side effects by extracting the side effects (and validating that the local does not feed into them). We only manage to compute the trip count for 281/807 loops that pass all the earlier checks. I would expect most of the remaining loops to have computable trip counts with some more work on the symbolic reasoning. So this might be something good to follow up on. |
Just for completeness, I took a look at the reasons we fail to compute the trip count:
The second last property is normally ensured by loop inversion introducing a zero-trip test, by existing zero trip tests in the code, or by properties due to standard invariants (like non-negativeness of array/span lengths). I would expect almost all of these should be provable with some stronger inference logic. |
@AndyAyersMS @EgorBo Can you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good. Left a few comments that you can address subsequently.
|
||
if (visitResult == BasicBlockVisit::Abort) | ||
{ | ||
// Live into an exit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a todo here? We should be able to eliminate these uses by materializing the final value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I guess it requires some analysis to figure out if the values required to compute the final IV are available in the exit.
} | ||
|
||
GenTree* rootNode = stmt->GetRootNode(); | ||
if (!rootNode->OperIsLocalStore()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are other uses does it make sense to consider rewriting them in terms of the new down-counting IV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure... seems like at that point we're just moving the costs around. Maybe if some of the uses are on very rarely executed paths.
|
||
Scev* steppedVal = NewBinop(ScevOper::Add, rhs, step); | ||
steppedVal = Simplify(steppedVal); | ||
ValueNum steppedValVN = MaterializeVN(steppedVal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a little roundabout to go from SCEV -> IR -> VN, though I think it's fine as long as we're not doing too much speculative IR creation this way.
I think you've talked about integrating scev and vn more closely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaterializeVN
does not create any IR, it directly creates the VN from the SCEV. Materialize
will always create both IR and VN (to be able to attach the VN to the tree). I wasn't totally sure whether or not that's the behavior we want, but we don't really materialize a lot of IR so it seemed like it wouldn't be expensive, and it's nice to try to keep the invariant that IR has proper VNs at this point.
It would be nice to express SCEVs directly using VNs. The main problem to solve there is that VNs are reduced "too much", making it non-trivial to materialize them into IR. OTOH SCEVs have the property that they can always be materialized in a straightforward way in the preheader. Well, at least that's my current belief.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looking again it seems clear that VNs are computed first, then (optionally) IR, then the two are attached. Not sure what I thought I saw the first time.
src/coreclr/jit/scev.cpp
Outdated
// The SCEV returned here is equal to the trip count when the exiting block | ||
// dominates all backedges and when it is the only exit of the loop. | ||
// | ||
// The trip count of the loop is defined as the number of times the header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit -- I think the usual definition is the number of times the header executes, so maybe consider calling this the backedge count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. Let me fix that.
…into downwards counted loops (dotnet#102261) This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial. The trip count of a loop is defined to be the number of times the header block is entered. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths. To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip count if we can prove that `n < int.MaxValue`. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases. Converting a counted loop to a downwards counting loop is beneficial if the induction variable is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. More importantly this usually frees up a register inside the loop. This transformation does not have that many hits (as one can imagine, the IVs of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an IV except the mutation and the loop test. The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases. TP regressions are going to be pretty large for this change: - This enables DFS tree/loop finding in IV opts phase outside win-x64, which has cost around 0.4% TP on its own - This optimization furthermore requires us to build dominators, which comes with its own TP cost Long term we could remove these costs if we could avoid changing control flow in assertion prop and move RBO to the end of the opts loop (letting all control flow changes happen there). But for now I think we just have to pay some of the costs to allow us to do these optimizations. Example: ```csharp private static int Foo(int[] arr, int start, int count) { int sum = 0; for (int i = 0; i < count; i++) { sum += arr[start]; start++; } return sum; } ``` ```diff @@ -1,19 +1,17 @@ G_M42127_IG02: ;; offset=0x0004 xor eax, eax - xor r10d, r10d test r8d, r8d jle SHORT G_M42127_IG05 - ;; size=10 bbWeight=1 PerfScore 1.75 -G_M42127_IG03: ;; offset=0x000E - mov r9d, dword ptr [rcx+0x08] + ;; size=7 bbWeight=1 PerfScore 1.50 +G_M42127_IG03: ;; offset=0x000B + mov r10d, dword ptr [rcx+0x08] mov edx, edx ;; size=6 bbWeight=0.25 PerfScore 0.56 -G_M42127_IG04: ;; offset=0x0014 +G_M42127_IG04: ;; offset=0x0011 cmp edx, dword ptr [rcx+0x08] jae SHORT G_M42127_IG06 add eax, dword ptr [rcx+4*rdx+0x10] inc edx - inc r10d - cmp r10d, r8d - jl SHORT G_M42127_IG04 - ;; size=19 bbWeight=4 PerfScore 35.00 + dec r8d + jne SHORT G_M42127_IG04 + ;; size=16 bbWeight=4 PerfScore 34.00 ``` Fix dotnet#100915
This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial.
The trip count of a loop is defined to be the number of times the header block is entered. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths.
To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop
for (int i = 0; i <= n; i++)
only has a determinable trip count if we can prove thatn < int.MaxValue
. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases.Converting a counted loop to a downwards counting loop is beneficial if the induction variable is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. More importantly this usually frees up a register inside the loop.
This transformation does not have that many hits (as you may imagine, the IVs of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an IV except the mutation and the loop test.
The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases.
TP regressions are going to be pretty large for this change:
Long term we could remove these costs if we could avoid changing control flow in assertion prop and move RBO to the end of the opts loop (letting all control flow changes happen there). But for now I think we just have to pay some of the costs to allow us to do these optimizations.
Example:
Fix #100915