Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Add reasoning about loop trip counts and optimize counted loops into downwards counted loops #102261

Merged
merged 21 commits into from
May 30, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented May 15, 2024

This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial.

The trip count of a loop is defined to be the number of times the header block is entered. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths.

To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop for (int i = 0; i <= n; i++) only has a determinable trip count if we can prove that n < int.MaxValue. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases.

Converting a counted loop to a downwards counting loop is beneficial if the induction variable is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. More importantly this usually frees up a register inside the loop.

This transformation does not have that many hits (as you may imagine, the IVs of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an IV except the mutation and the loop test.

The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases.

TP regressions are going to be pretty large for this change:

  • This enables DFS tree/loop finding in IV opts phase outside win-x64, which has cost around 0.4% TP on its own
  • This optimization furthermore requires us to build dominators, which comes with its own TP cost

Long term we could remove these costs if we could avoid changing control flow in assertion prop and move RBO to the end of the opts loop (letting all control flow changes happen there). But for now I think we just have to pay some of the costs to allow us to do these optimizations.

Example:

private static int Foo(int[] arr, int start, int count)
{
    int sum = 0;
    for (int i = 0; i < count; i++)
    {
        sum += arr[start];
        start++;
    }

    return sum;
}
@@ -1,19 +1,17 @@
 G_M42127_IG02:  ;; offset=0x0004
        xor      eax, eax
-       xor      r10d, r10d
        test     r8d, r8d
        jle      SHORT G_M42127_IG05
-						;; size=10 bbWeight=1 PerfScore 1.75
-G_M42127_IG03:  ;; offset=0x000E
-       mov      r9d, dword ptr [rcx+0x08]
+						;; size=7 bbWeight=1 PerfScore 1.50
+G_M42127_IG03:  ;; offset=0x000B
+       mov      r10d, dword ptr [rcx+0x08]
        mov      edx, edx
 						;; size=6 bbWeight=0.25 PerfScore 0.56
-G_M42127_IG04:  ;; offset=0x0014
+G_M42127_IG04:  ;; offset=0x0011
        cmp      edx, dword ptr [rcx+0x08]
        jae      SHORT G_M42127_IG06
        add      eax, dword ptr [rcx+4*rdx+0x10]
        inc      edx
-       inc      r10d
-       cmp      r10d, r8d
-       jl       SHORT G_M42127_IG04
-						;; size=19 bbWeight=4 PerfScore 35.00
+       dec      r8d
+       jne      SHORT G_M42127_IG04
+						;; size=16 bbWeight=4 PerfScore 34.00

Fix #100915

This builds out some initial reasoning about trip counts of loops and
utilizes it to convert upwards counted loops into downwards counted
loops when beneficial.

The trip count of a loop is defined to be the number of times the header
block is entered from a back edge. When this value can be computed the
loop is called counted. The computation here is symbolic and can reason
in terms of variables, such as array or span lengths.

To be able to compute the trip count requires the JIT to reason about
overflow and to prove various conditions related to the start and end
values of the loop. For example, a loop `for (int i = 0; i <= n; i++)`
only has a determinable trip count if we can prove that `n <
int.MaxValue`. The implementation here utilizes the logic provided by
RBO to prove these conditions. In many cases we aren't able to prove
them and thus must give up, but this should be improvable in an
incremental fashion to handle common cases.

Converting a counted loop to a downwards counting loop is beneficial if
the index is not being used for anything else but the loop test. In
those cases our target platforms are able to combine the decrement with
the exit test into a single instruction.

This transformation does not have that many hits (as you may imagine,
the indices of loops are usually used for something else). However, once
strength reduction is implemented we expect that this transformation
will be significantly more important since strength reduction in many
cases is going to remove all uses of an index except the mutation and
the loop test.

The reasoning about trip counts is itself also needed by strength
reduction which also needs it to prove no overflow in various cases.

Example:

```csharp
private static int Foo(int[] arr, int start, int count)
{
    int sum = 0;
    for (int i = 0; i < count; i++)
    {
        sum += arr[start++];
    }

    return sum;
}
```

```diff
@@ -1,20 +1,18 @@
 G_M42127_IG02:  ;; offset=0x0004
        xor      eax, eax
-       xor      r10d, r10d
        test     r8d, r8d
        jle      SHORT G_M42127_IG05
-						;; size=10 bbWeight=1 PerfScore 1.75
-G_M42127_IG03:  ;; offset=0x000E
-       mov      r9d, dword ptr [rcx+0x08]
+						;; size=7 bbWeight=1 PerfScore 1.50
+G_M42127_IG03:  ;; offset=0x000B
+       mov      r10d, dword ptr [rcx+0x08]
 						;; size=4 bbWeight=0.25 PerfScore 0.50
-G_M42127_IG04:  ;; offset=0x0012
-       lea      r9d, [rdx+0x01]
+G_M42127_IG04:  ;; offset=0x000F
+       lea      r10d, [rdx+0x01]
        cmp      edx, dword ptr [rcx+0x08]
        jae      SHORT G_M42127_IG06
        mov      edx, edx
        add      eax, dword ptr [rcx+4*rdx+0x10]
-       inc      r10d
-       cmp      r10d, r8d
-       mov      edx, r9d
-       jl       SHORT G_M42127_IG04
-						;; size=26 bbWeight=4 PerfScore 38.00
+       dec      r8d
+       mov      edx, r10d
+       jne      SHORT G_M42127_IG04
+						;; size=23 bbWeight=4 PerfScore 37.00
```

Fix dotnet#100915
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 15, 2024
@dotnet dotnet deleted a comment from azure-pipelines bot May 16, 2024
@dotnet dotnet deleted a comment from azure-pipelines bot May 16, 2024
@dotnet dotnet deleted a comment from azure-pipelines bot May 16, 2024
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Comment on lines +1024 to +1025
newValue = FoldBinop<uint32_t>(binop->Oper, static_cast<uint32_t>(cns1->Value),
static_cast<uint32_t>(cns2->Value));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this since signed overflow is undefined behavior while we want to have wraparound behavior.

src/coreclr/jit/scev.cpp Outdated Show resolved Hide resolved
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch jakobbotsch marked this pull request as ready for review May 21, 2024 13:12
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @EgorBo @AndyAyersMS (when you are back)

Diffs. Actually more hits than I expected. Somewhat large-ish TP regressions since this computes the DFS, loops and possible dominators in the IV opts phase for all targets now.

I don't expect many actual perf improvements from a transformation like this, but it has the knock-on effect of often freeing up a register inside the loop, which is especially important on x64. In many cases this transformation is going to mean that strength reduction does not increase register pressure when it kicks in.

I collected some metrics for this transformation and the reasons we give up on it. They are ordered from the checks we do early to late, but note that in many cases resolving earlier reasons would just cause us to give up due to a later one. The stats are from libraries_tests.run.windows.x64.Release.mch.

  • Loop has multiple exits: 75480
  • Loop exit is not BBJ_COND: 14
  • Exit test node has side effects: 13318
  • Exit condition already has a 0 operand: 7999
  • Loop has no removable locals when doing the transformation: 27965
  • Exit does not dominate all backedges: 5
  • Could not compute trip count: 526
  • Could not materialize IR for trip count: 0
  • Final number of loops transformed: 281

The locals we try to see if we can remove are the ones that have phis in the header. We consider them removable if they only have uses in a "self update" and in the exit test. More specifically the reasons why we do not consider locals removable break down as:

  • Local has a non-store/non-test use: 13768
  • Local is used as the source of a store to a different local: 12343
  • Self store data node involves a side effect (which the local potentially could feed into): 132

We should be able to handle loops with multiple exits, and we could also handle cases where the loop test has side effects by extracting the side effects (and validating that the local does not feed into them).

We only manage to compute the trip count for 281/807 loops that pass all the earlier checks. I would expect most of the remaining loops to have computable trip counts with some more work on the symbolic reasoning. So this might be something good to follow up on.

@jakobbotsch
Copy link
Member Author

We only manage to compute the trip count for 281/807 loops that pass all the earlier checks. I would expect most of the remaining loops to have computable trip counts with some more work on the symbolic reasoning. So this might be something good to follow up on.

Just for completeness, I took a look at the reasons we fail to compute the trip count:

  • Unsupported exit condition (not GT_LT, GT_LE, GT_GT or GT_GE): 2
  • Operands not integral: 0
  • Could not analyze operands into SCEV: 167
  • Operands are GC types: 0
  • Operands have no add recurrence: 0
  • Operands are both variant or both invariant: 0
  • Trip count may overflow before the loop exits: 1
  • Could not prove that the analyzed lower bound is <= the analyzed upper bound: 356
  • The step is not 1 or -1: 0

The second last property is normally ensured by loop inversion introducing a zero-trip test, by existing zero trip tests in the code, or by properties due to standard invariants (like non-negativeness of array/span lengths). I would expect almost all of these should be provable with some stronger inference logic.

@jakobbotsch
Copy link
Member Author

@AndyAyersMS @EgorBo Can you please take a look?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. Left a few comments that you can address subsequently.


if (visitResult == BasicBlockVisit::Abort)
{
// Live into an exit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a todo here? We should be able to eliminate these uses by materializing the final value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I guess it requires some analysis to figure out if the values required to compute the final IV are available in the exit.

}

GenTree* rootNode = stmt->GetRootNode();
if (!rootNode->OperIsLocalStore())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are other uses does it make sense to consider rewriting them in terms of the new down-counting IV?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... seems like at that point we're just moving the costs around. Maybe if some of the uses are on very rarely executed paths.


Scev* steppedVal = NewBinop(ScevOper::Add, rhs, step);
steppedVal = Simplify(steppedVal);
ValueNum steppedValVN = MaterializeVN(steppedVal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little roundabout to go from SCEV -> IR -> VN, though I think it's fine as long as we're not doing too much speculative IR creation this way.

I think you've talked about integrating scev and vn more closely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaterializeVN does not create any IR, it directly creates the VN from the SCEV. Materialize will always create both IR and VN (to be able to attach the VN to the tree). I wasn't totally sure whether or not that's the behavior we want, but we don't really materialize a lot of IR so it seemed like it wouldn't be expensive, and it's nice to try to keep the invariant that IR has proper VNs at this point.

It would be nice to express SCEVs directly using VNs. The main problem to solve there is that VNs are reduced "too much", making it non-trivial to materialize them into IR. OTOH SCEVs have the property that they can always be materialized in a straightforward way in the preheader. Well, at least that's my current belief.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looking again it seems clear that VNs are computed first, then (optionally) IR, then the two are attached. Not sure what I thought I saw the first time.

// The SCEV returned here is equal to the trip count when the exiting block
// dominates all backedges and when it is the only exit of the loop.
//
// The trip count of the loop is defined as the number of times the header
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit -- I think the usual definition is the number of times the header executes, so maybe consider calling this the backedge count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. Let me fix that.

@jakobbotsch jakobbotsch merged commit 1ded19e into dotnet:main May 30, 2024
109 of 113 checks passed
@jakobbotsch jakobbotsch deleted the trip-count branch May 30, 2024 09:01
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
…into downwards counted loops (dotnet#102261)

This builds out some initial reasoning about trip counts of loops and utilizes
it to convert upwards counted loops into downwards counted loops when
beneficial.

The trip count of a loop is defined to be the number of times the header block
is entered. When this value can be computed the loop is called counted. The
computation here is symbolic and can reason in terms of variables, such as array
or span lengths.

To be able to compute the trip count requires the JIT to reason about overflow
and to prove various conditions related to the start and end values of the loop.
For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip
count if we can prove that `n < int.MaxValue`. The implementation here utilizes
the logic provided by RBO to prove these conditions. In many cases we aren't
able to prove them and thus must give up, but this should be improvable in an
incremental fashion to handle common cases.

Converting a counted loop to a downwards counting loop is beneficial if the
induction variable is not being used for anything else but the loop test. In
those cases our target platforms are able to combine the decrement with the exit
test into a single instruction. More importantly this usually frees up a
register inside the loop.

This transformation does not have that many hits (as one can imagine, the IVs of
loops are usually used for something else). However, once strength reduction is
implemented we expect that this transformation will be significantly more
important since strength reduction in many cases is going to remove all uses of
an IV except the mutation and the loop test.

The reasoning about trip counts is itself also needed by strength reduction
which also needs it to prove no overflow in various cases.

TP regressions are going to be pretty large for this change:
- This enables DFS tree/loop finding in IV opts phase outside win-x64, which has
  cost around 0.4% TP on its own
- This optimization furthermore requires us to build dominators, which comes
  with its own TP cost

Long term we could remove these costs if we could avoid changing control flow in
assertion prop and move RBO to the end of the opts loop (letting all control
flow changes happen there). But for now I think we just have to pay some of the
costs to allow us to do these optimizations.

Example:

```csharp
private static int Foo(int[] arr, int start, int count)
{
    int sum = 0;
    for (int i = 0; i < count; i++)
    {
        sum += arr[start];
        start++;
    }

    return sum;
}
```

```diff
@@ -1,19 +1,17 @@
 G_M42127_IG02:  ;; offset=0x0004
        xor      eax, eax
-       xor      r10d, r10d
        test     r8d, r8d
        jle      SHORT G_M42127_IG05
-						;; size=10 bbWeight=1 PerfScore 1.75
-G_M42127_IG03:  ;; offset=0x000E
-       mov      r9d, dword ptr [rcx+0x08]
+						;; size=7 bbWeight=1 PerfScore 1.50
+G_M42127_IG03:  ;; offset=0x000B
+       mov      r10d, dword ptr [rcx+0x08]
        mov      edx, edx
 						;; size=6 bbWeight=0.25 PerfScore 0.56
-G_M42127_IG04:  ;; offset=0x0014
+G_M42127_IG04:  ;; offset=0x0011
        cmp      edx, dword ptr [rcx+0x08]
        jae      SHORT G_M42127_IG06
        add      eax, dword ptr [rcx+4*rdx+0x10]
        inc      edx
-       inc      r10d
-       cmp      r10d, r8d
-       jl       SHORT G_M42127_IG04
-						;; size=19 bbWeight=4 PerfScore 35.00
+       dec      r8d
+       jne      SHORT G_M42127_IG04
+						;; size=16 bbWeight=4 PerfScore 34.00
```

Fix dotnet#100915
@github-actions github-actions bot locked and limited conversation to collaborators Jun 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Optimize counted loops by reversing them
2 participants