JIT: Add reasoning about loop trip counts and optimize counted loops into downwards counted loops #102261

jakobbotsch · 2024-05-15T14:56:19Z

This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial.

The trip count of a loop is defined to be the number of times the header block is entered. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths.

To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop for (int i = 0; i <= n; i++) only has a determinable trip count if we can prove that n < int.MaxValue. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases.

Converting a counted loop to a downwards counting loop is beneficial if the induction variable is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. More importantly this usually frees up a register inside the loop.

This transformation does not have that many hits (as you may imagine, the IVs of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an IV except the mutation and the loop test.

The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases.

TP regressions are going to be pretty large for this change:

This enables DFS tree/loop finding in IV opts phase outside win-x64, which has cost around 0.4% TP on its own
This optimization furthermore requires us to build dominators, which comes with its own TP cost

Long term we could remove these costs if we could avoid changing control flow in assertion prop and move RBO to the end of the opts loop (letting all control flow changes happen there). But for now I think we just have to pay some of the costs to allow us to do these optimizations.

Example:

private static int Foo(int[] arr, int start, int count)
{
    int sum = 0;
    for (int i = 0; i < count; i++)
    {
        sum += arr[start];
        start++;
    }

    return sum;
}

@@ -1,19 +1,17 @@
 G_M42127_IG02:  ;; offset=0x0004
        xor      eax, eax
-       xor      r10d, r10d
        test     r8d, r8d
        jle      SHORT G_M42127_IG05
-						;; size=10 bbWeight=1 PerfScore 1.75
-G_M42127_IG03:  ;; offset=0x000E
-       mov      r9d, dword ptr [rcx+0x08]
+						;; size=7 bbWeight=1 PerfScore 1.50
+G_M42127_IG03:  ;; offset=0x000B
+       mov      r10d, dword ptr [rcx+0x08]
        mov      edx, edx
 						;; size=6 bbWeight=0.25 PerfScore 0.56
-G_M42127_IG04:  ;; offset=0x0014
+G_M42127_IG04:  ;; offset=0x0011
        cmp      edx, dword ptr [rcx+0x08]
        jae      SHORT G_M42127_IG06
        add      eax, dword ptr [rcx+4*rdx+0x10]
        inc      edx
-       inc      r10d
-       cmp      r10d, r8d
-       jl       SHORT G_M42127_IG04
-						;; size=19 bbWeight=4 PerfScore 35.00
+       dec      r8d
+       jne      SHORT G_M42127_IG04
+						;; size=16 bbWeight=4 PerfScore 34.00

Fix #100915

This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial. The trip count of a loop is defined to be the number of times the header block is entered from a back edge. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths. To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip count if we can prove that `n < int.MaxValue`. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases. Converting a counted loop to a downwards counting loop is beneficial if the index is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. This transformation does not have that many hits (as you may imagine, the indices of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an index except the mutation and the loop test. The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases. Example: ```csharp private static int Foo(int[] arr, int start, int count) { int sum = 0; for (int i = 0; i < count; i++) { sum += arr[start++]; } return sum; } ``` ```diff @@ -1,20 +1,18 @@ G_M42127_IG02: ;; offset=0x0004 xor eax, eax - xor r10d, r10d test r8d, r8d jle SHORT G_M42127_IG05 - ;; size=10 bbWeight=1 PerfScore 1.75 -G_M42127_IG03: ;; offset=0x000E - mov r9d, dword ptr [rcx+0x08] + ;; size=7 bbWeight=1 PerfScore 1.50 +G_M42127_IG03: ;; offset=0x000B + mov r10d, dword ptr [rcx+0x08] ;; size=4 bbWeight=0.25 PerfScore 0.50 -G_M42127_IG04: ;; offset=0x0012 - lea r9d, [rdx+0x01] +G_M42127_IG04: ;; offset=0x000F + lea r10d, [rdx+0x01] cmp edx, dword ptr [rcx+0x08] jae SHORT G_M42127_IG06 mov edx, edx add eax, dword ptr [rcx+4*rdx+0x10] - inc r10d - cmp r10d, r8d - mov edx, r9d - jl SHORT G_M42127_IG04 - ;; size=26 bbWeight=4 PerfScore 38.00 + dec r8d + mov edx, r10d + jne SHORT G_M42127_IG04 + ;; size=23 bbWeight=4 PerfScore 37.00 ``` Fix dotnet#100915

jakobbotsch · 2024-05-16T18:45:48Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

azure-pipelines · 2024-05-16T18:46:07Z

Azure Pipelines successfully started running 2 pipeline(s).

jakobbotsch · 2024-05-16T21:03:20Z

src/coreclr/jit/scev.cpp

+                    newValue = FoldBinop<uint32_t>(binop->Oper, static_cast<uint32_t>(cns1->Value),
+                                                   static_cast<uint32_t>(cns2->Value));


Changed this since signed overflow is undefined behavior while we want to have wraparound behavior.

src/coreclr/jit/scev.cpp

jakobbotsch · 2024-05-16T22:19:59Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

azure-pipelines · 2024-05-16T22:20:12Z

Azure Pipelines successfully started running 2 pipeline(s).

jakobbotsch · 2024-05-17T12:37:14Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

azure-pipelines · 2024-05-17T12:37:32Z

Azure Pipelines successfully started running 2 pipeline(s).

jakobbotsch · 2024-05-21T14:27:04Z

cc @dotnet/jit-contrib PTAL @EgorBo @AndyAyersMS (when you are back)

Diffs. Actually more hits than I expected. Somewhat large-ish TP regressions since this computes the DFS, loops and possible dominators in the IV opts phase for all targets now.

I don't expect many actual perf improvements from a transformation like this, but it has the knock-on effect of often freeing up a register inside the loop, which is especially important on x64. In many cases this transformation is going to mean that strength reduction does not increase register pressure when it kicks in.

I collected some metrics for this transformation and the reasons we give up on it. They are ordered from the checks we do early to late, but note that in many cases resolving earlier reasons would just cause us to give up due to a later one. The stats are from libraries_tests.run.windows.x64.Release.mch.

Loop has multiple exits: 75480
Loop exit is not BBJ_COND: 14
Exit test node has side effects: 13318
Exit condition already has a 0 operand: 7999
Loop has no removable locals when doing the transformation: 27965
Exit does not dominate all backedges: 5
Could not compute trip count: 526
Could not materialize IR for trip count: 0
Final number of loops transformed: 281

The locals we try to see if we can remove are the ones that have phis in the header. We consider them removable if they only have uses in a "self update" and in the exit test. More specifically the reasons why we do not consider locals removable break down as:

Local has a non-store/non-test use: 13768
Local is used as the source of a store to a different local: 12343
Self store data node involves a side effect (which the local potentially could feed into): 132

We should be able to handle loops with multiple exits, and we could also handle cases where the loop test has side effects by extracting the side effects (and validating that the local does not feed into them).

We only manage to compute the trip count for 281/807 loops that pass all the earlier checks. I would expect most of the remaining loops to have computable trip counts with some more work on the symbolic reasoning. So this might be something good to follow up on.

jakobbotsch · 2024-05-21T15:10:28Z

We only manage to compute the trip count for 281/807 loops that pass all the earlier checks. I would expect most of the remaining loops to have computable trip counts with some more work on the symbolic reasoning. So this might be something good to follow up on.

Just for completeness, I took a look at the reasons we fail to compute the trip count:

Unsupported exit condition (not GT_LT, GT_LE, GT_GT or GT_GE): 2
Operands not integral: 0
Could not analyze operands into SCEV: 167
Operands are GC types: 0
Operands have no add recurrence: 0
Operands are both variant or both invariant: 0
Trip count may overflow before the loop exits: 1
Could not prove that the analyzed lower bound is <= the analyzed upper bound: 356
The step is not 1 or -1: 0

The second last property is normally ensured by loop inversion introducing a zero-trip test, by existing zero trip tests in the code, or by properties due to standard invariants (like non-negativeness of array/span lengths). I would expect almost all of these should be provable with some stronger inference logic.

jakobbotsch · 2024-05-29T14:39:06Z

@AndyAyersMS @EgorBo Can you please take a look?

AndyAyersMS

Overall this looks good. Left a few comments that you can address subsequently.

AndyAyersMS · 2024-05-29T15:00:38Z

src/coreclr/jit/inductionvariableopts.cpp

+
+        if (visitResult == BasicBlockVisit::Abort)
+        {
+            // Live into an exit.


Maybe a todo here? We should be able to eliminate these uses by materializing the final value.

Makes sense. I guess it requires some analysis to figure out if the values required to compute the final IV are available in the exit.

AndyAyersMS · 2024-05-29T16:20:53Z

src/coreclr/jit/inductionvariableopts.cpp

+            }
+
+            GenTree* rootNode = stmt->GetRootNode();
+            if (!rootNode->OperIsLocalStore())


If there are other uses does it make sense to consider rewriting them in terms of the new down-counting IV?

I'm not sure... seems like at that point we're just moving the costs around. Maybe if some of the uses are on very rarely executed paths.

AndyAyersMS · 2024-05-29T16:37:03Z

src/coreclr/jit/scev.cpp

+
+    Scev* steppedVal      = NewBinop(ScevOper::Add, rhs, step);
+    steppedVal            = Simplify(steppedVal);
+    ValueNum steppedValVN = MaterializeVN(steppedVal);


It seems a little roundabout to go from SCEV -> IR -> VN, though I think it's fine as long as we're not doing too much speculative IR creation this way.

I think you've talked about integrating scev and vn more closely?

MaterializeVN does not create any IR, it directly creates the VN from the SCEV. Materialize will always create both IR and VN (to be able to attach the VN to the tree). I wasn't totally sure whether or not that's the behavior we want, but we don't really materialize a lot of IR so it seemed like it wouldn't be expensive, and it's nice to try to keep the invariant that IR has proper VNs at this point.

It would be nice to express SCEVs directly using VNs. The main problem to solve there is that VNs are reduced "too much", making it non-trivial to materialize them into IR. OTOH SCEVs have the property that they can always be materialized in a straightforward way in the preheader. Well, at least that's my current belief.

Thanks. Looking again it seems clear that VNs are computed first, then (optionally) IR, then the two are attached. Not sure what I thought I saw the first time.

AndyAyersMS · 2024-05-29T16:58:23Z

src/coreclr/jit/scev.cpp

+//   The SCEV returned here is equal to the trip count when the exiting block
+//   dominates all backedges and when it is the only exit of the loop.
+//
+//   The trip count of the loop is defined as the number of times the header


Nit -- I think the usual definition is the number of times the header executes, so maybe consider calling this the backedge count?

Ah, good point. Let me fix that.

This reverts commit af98790.

…into downwards counted loops (dotnet#102261) This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial. The trip count of a loop is defined to be the number of times the header block is entered. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths. To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip count if we can prove that `n < int.MaxValue`. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases. Converting a counted loop to a downwards counting loop is beneficial if the induction variable is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. More importantly this usually frees up a register inside the loop. This transformation does not have that many hits (as one can imagine, the IVs of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an IV except the mutation and the loop test. The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases. TP regressions are going to be pretty large for this change: - This enables DFS tree/loop finding in IV opts phase outside win-x64, which has cost around 0.4% TP on its own - This optimization furthermore requires us to build dominators, which comes with its own TP cost Long term we could remove these costs if we could avoid changing control flow in assertion prop and move RBO to the end of the opts loop (letting all control flow changes happen there). But for now I think we just have to pay some of the costs to allow us to do these optimizations. Example: ```csharp private static int Foo(int[] arr, int start, int count) { int sum = 0; for (int i = 0; i < count; i++) { sum += arr[start]; start++; } return sum; } ``` ```diff @@ -1,19 +1,17 @@ G_M42127_IG02: ;; offset=0x0004 xor eax, eax - xor r10d, r10d test r8d, r8d jle SHORT G_M42127_IG05 - ;; size=10 bbWeight=1 PerfScore 1.75 -G_M42127_IG03: ;; offset=0x000E - mov r9d, dword ptr [rcx+0x08] + ;; size=7 bbWeight=1 PerfScore 1.50 +G_M42127_IG03: ;; offset=0x000B + mov r10d, dword ptr [rcx+0x08] mov edx, edx ;; size=6 bbWeight=0.25 PerfScore 0.56 -G_M42127_IG04: ;; offset=0x0014 +G_M42127_IG04: ;; offset=0x0011 cmp edx, dword ptr [rcx+0x08] jae SHORT G_M42127_IG06 add eax, dword ptr [rcx+4*rdx+0x10] inc edx - inc r10d - cmp r10d, r8d - jl SHORT G_M42127_IG04 - ;; size=19 bbWeight=4 PerfScore 35.00 + dec r8d + jne SHORT G_M42127_IG04 + ;; size=16 bbWeight=4 PerfScore 34.00 ``` Fix dotnet#100915

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 15, 2024

Add function header

2bd7cbb

dotnet-policy-service bot assigned jakobbotsch May 15, 2024

jakobbotsch added 6 commits May 15, 2024 16:59

Switch terminology

c59de70

Fix build

19bf7e2

Fix linux build

85241f7

More clang fixes

a6fa41f

Clean ups

4626962

Take type into account when printing constants

4062a77

build-analysis bot mentioned this pull request May 15, 2024

NativeAOT legs timing out in CI #102239

Closed

jakobbotsch added 4 commits May 16, 2024 10:19

Fix long constants on x86

b782045

Fix a bug

3c669c0

Add stress validation

9e66cf7

Fix linux build

a6738ea

dotnet deleted a comment from azure-pipelines bot May 16, 2024

Fix long multiplications on x86

e588b44

Stress test by just transforming more loops

5678a74

jakobbotsch commented May 16, 2024

View reviewed changes

build-analysis bot mentioned this pull request May 16, 2024

win-arm64 tests failing with Assert failure: (uImageBase == uImageBaseFromOS) && (memcmp(pFunctionEntry, pFunctionEntryFromOS, sizeof(RUNTIME_FUNCTION)) == 0) #102337

Closed

jakobbotsch commented May 16, 2024

View reviewed changes

src/coreclr/jit/scev.cpp Outdated Show resolved Hide resolved

Disallow when JTRUE has side effects even under stress

c71c39c

jakobbotsch added 2 commits May 17, 2024 11:58

Fix a bug; add a test

370d098

Remove questionable transform

d0f8fd2

jakobbotsch marked this pull request as ready for review May 21, 2024 13:12

jakobbotsch requested review from AndyAyersMS and EgorBo May 21, 2024 14:27

AndyAyersMS approved these changes May 29, 2024

View reviewed changes

jakobbotsch added 4 commits May 29, 2024 20:30

Lots of metrics

af98790

Revert "Lots of metrics"

5275277

This reverts commit af98790.

Fix terminology: trip count -> backedge count

402571e

Minor nits

37f6085

This was referenced May 30, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

jakobbotsch merged commit 1ded19e into dotnet:main May 30, 2024
109 of 113 checks passed

jakobbotsch deleted the trip-count branch May 30, 2024 09:01

tannergooding mentioned this pull request Jun 4, 2024

Test failure: JIT/Regression/JitBlue/Runtime_55107/Runtime_55107/Runtime_55107.cmd #102978

Closed

github-actions bot locked and limited conversation to collaborators Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Add reasoning about loop trip counts and optimize counted loops into downwards counted loops #102261

JIT: Add reasoning about loop trip counts and optimize counted loops into downwards counted loops #102261

jakobbotsch commented May 15, 2024 •

edited

Loading

jakobbotsch commented May 16, 2024

azure-pipelines bot commented May 16, 2024

jakobbotsch May 16, 2024

jakobbotsch commented May 16, 2024

azure-pipelines bot commented May 16, 2024

jakobbotsch commented May 17, 2024

azure-pipelines bot commented May 17, 2024

jakobbotsch commented May 21, 2024

jakobbotsch commented May 21, 2024

jakobbotsch commented May 29, 2024

AndyAyersMS left a comment

AndyAyersMS May 29, 2024

jakobbotsch May 29, 2024

AndyAyersMS May 29, 2024

jakobbotsch May 29, 2024

AndyAyersMS May 29, 2024

jakobbotsch May 29, 2024

AndyAyersMS May 29, 2024

AndyAyersMS May 29, 2024

jakobbotsch May 29, 2024

		newValue = FoldBinop<uint32_t>(binop->Oper, static_cast<uint32_t>(cns1->Value),
		static_cast<uint32_t>(cns2->Value));

JIT: Add reasoning about loop trip counts and optimize counted loops into downwards counted loops #102261

JIT: Add reasoning about loop trip counts and optimize counted loops into downwards counted loops #102261

Conversation

jakobbotsch commented May 15, 2024 • edited Loading

jakobbotsch commented May 16, 2024

azure-pipelines bot commented May 16, 2024

Choose a reason for hiding this comment

jakobbotsch commented May 16, 2024

azure-pipelines bot commented May 16, 2024

jakobbotsch commented May 17, 2024

azure-pipelines bot commented May 17, 2024

jakobbotsch commented May 21, 2024

jakobbotsch commented May 21, 2024

jakobbotsch commented May 29, 2024

AndyAyersMS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobbotsch commented May 15, 2024 •

edited

Loading