Skip to content

Reduction #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 226 additions & 0 deletions text/XXX-reduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Reduction Strategies

## Summary Table

| Strategy | Performance | Normalization | Arguments | Reduction Flags | Refolding | Effect on Qed |
|---------------|----------------|---------------|---------------------------|--------------------------|------------------|----------------------------------------------------------------------------------------|
| vm_compute | ✅✅✅ | Full | ❌ | ❌ | ❌ | ✅✅✅ Bespoke Cast honored by Qed |
| lazy | ✅✅ | Full / Head | ❌ | ✅✅ | ❌ | ✅✅ Default cast does not store reduction flags |
| cbv,compute | ✅ | Full / Head | ❌ | ❌ | ❌ | ✅ Default cast coincides but will use kernel reduction/conversion instead of cbv |
| hnf | ❔ | Head | ❔ | ❌ | ✅ | ❔ |
| simpl | ⛔⛔ | Full / Head | ✅/⛔ | ⛔ (only head) | ✅ | ⛔ Uses default cast |
| cbn | ⛔⛔⛔ | Full / Head | ✅✅ | ✅✅ | ✅✅ | ⛔ Uses default cast |
| **Proposals** |
| [kred](#kred) | ✅ | Full / Head | ✅✅ | ✅✅ | ✅ | ⛔ Uses default cast |
| [MSP](#msp) | ✅✅ (= lazy) | Any (quoting) | ❌ (not necessary) | ❌ (not necessary) | ❌ | ✅✅✅ Fully integrated into kernel reduction/conversion |

## Details

* `vm_compute`
* Fully normalizes Rocq terms
* Translates Rocq terms to OCaml bytecode for efficiency
* Generally faster than all strategies listed below
* Toplevel definitions have their weakly reduced values cached
* Translation does have a cost that is offset by precompiling toplevel definitions
* Does not support `Arguments`
* Does not support any reduction flags (not even `head`)
* Does not support fixpoint refolding
* Has a dedicated cast AST node that is honored by `Qed`
* `lazy`
* Kernel reduction used in typechecking / conversion
* The most efficient Rocq reduction engine written in OCaml
* Full normalization by default
* Does not support `Arguments`
* Supports all reduction flags
* Does not support fixpoint refolding ([\[Experiment\] Hack a global fixpoint expansion in conversion. by ppedrot · Pull Request #18672 · coq/coq](https://github.com/coq/coq/pull/18672) )
* Largely coincides with the default cast except for reduction flags that
will not be recorded in the cast.
* `cbv`, `compute`
* ???
* Eager reduction
* Does not support `Arguments`
* Supports all reduction flags
* Does not support fixpoint refolding
* Default cast will coincide in terms of the result but will use kernel
reduction/conversion to get there.
* `hnf`
* Head normal form
* Does not support Arguments (❔)
* Does not support reduction flags (❔)
* Supports fixpoint refolding
* `simpl`
* Rocq’s premier simplification engine
* Does not unfold aliases unless they uncover reduction steps
* Performance mixed
* Less efficient than `lazy` across the board but usually better than `cbn`
* Issues with primitive projections ([Primitive projections introduce slowdown in cbn and simpl · Issue #15720 · coq/coq](https://github.com/coq/coq/issues/15720) )
* Issues with nested fixpoints ([simpl is exponential in the number of nested fixpoints · Issue #13772 · coq/coq](https://github.com/coq/coq/issues/13772) )
* Issues without fixpoints or primitive projections ([!simpl scales exponentially even without nested fixpoints · Issue #16397 · coq/coq](https://github.com/coq/coq/issues/16397) )
* Partial support for `Arguments`
* `simpl` will ignore `Arguments : simpl never` in `match <HERE> with`
* Only supports `head` reduction flag
* Supports fixpoint refolding
* No dedicated cast, relies on default cast and thus kernel
reduction/conversion at `Qed` time which could lead to performance
issues
* `cbn`
* An alternative to `simpl`
* Eagerly reduces aliases even when they uncover no reduction steps
* Performance is bad
* Slower than any other reduction mechanism often by orders of magnitude
* Similar issues as with `simpl`, just worse
* Full support for `Arguments`
* Supports fixpoint refolding
* Also refolds aliases (❔)
* No dedicated cast, relies on default cast and thus kernel
reduction/conversion at `Qed` time which could lead to performance
issues


# Problematic Reduction Examples

## <a id="reif"></a> Reification with User Terms
Iris Proof Mode and BlueRock’s automation use reflection and computation to
perform changes to the shallowly embedded iris goal.

### Requirements

* Must only reduce computation introduced by the automation tactics
* Can be addressed with whitelist reduction flags, **and**
* by creating fresh copies of all functions used in the tactics.
* Both solutions are incredibly tedious and error-prone
* Needs a reasonably performant strategy (lazy or better)
* Reduction should ideally be captured by a bespoke cast so that it can be
replayed or coincide with kernel reduction for Qed performance

## <a id="rep"></a> Avoiding Term Repetition in Proof Terms

BlueRock's automation avoids term repetition in the proof term by accumulating
intermediate results inside the context of an ever-growing function type. The
arguments of the function are newly injected terms that cannot be computed from
existing values that have been passed to the function.

This approach works wonders for term size but Qed performance depends entirely
on the ability to force terms across the forall boundary of this function type
of dynamic arity. For example, in the following type, `1 + 1` is reduced twice:
once in checking the type of `f` and once in checking the return type.

```coq
let n := 1 + 1 in
forall (f : Fin.t n), Fin.t n

Unfortunately not all values can be forced and types such as telescopes will
produce terms of cubic size without any way to share their reduction.

### Requirements

* The solution must integrate into Qed, i.e. the kernel.
* It must support whitelists or otherwise allow specific selection of terms to
be reduced (see “Reification with User Terms” above)
* This eliminates vm_compute


## <a id="nice"></a> General “make goal look nice” Strategy

Any kind of longer-running automation that handles user terms probably needs to
integrate a generic simplification strategy to perform the same kind of
“normalization” that users rely on.

simpl and cbn are the obvious candidates but only cbn is safe to call on
arbitrary terms in arbitrary contexts without violating simpl never.
Unfortunately, this eliminates the faster of the two strategies and leaves us
with occasional catastrophic slowdowns in cbn.

### Requirements

* Must honor simpl never
* Must be quite fast so that it can be called after every goal change
* Must support Arguments
* Must support fixpoint refolding (as goals will otherwise become unreadable)
* Should be somewhat principled so that users can predict its results

# Proposed Reduction Strategies

## <a href="kred"></a> [kred: A faster, somewhat compatible cbn alternative](https://github.com/coq/coq/pull/19309)

`kred` is a half finished re-implementation of `cbn` based on the kernel’s
reduction machine. It is essentially `lazy` with support for `Arguments` and
fixpoint refolding. It tries to be as faithful as possible to `cbn` but will
sometimes deviate, especially when refolding fixpoints opportunities are too
expensive to find.

Another way to look at it is to call it a slightly less ambitious and slightly
less agressively refolding version of cbn with the benefit of delayed
substitution.

### Advantages over cbn

kred can be arbitrarily faster than `cbn` and its performance will often
coincide with `lazy` when terms can be fully reduced. It is always faster than
`cbn` even when terms are not fully reduced.

These properties make `kred` a promising candidate for [a General “make goal look nice” Strategy](#nice).

### Disadvantages

As mentioned, `kred` is not as aggressive about refolding and will sometimes
generate slightly different goals.

`kred` is unfinished. The current code has at least one bug that introduces
incorrect de-Bruijn indices in some specific cases.

kred.ml is basically a copy of cClosure.ml. Its refolding mechanism cannot be
integrated into the kernel because it is simply not sound
(https://github.com/coq/coq/pull/18672). Handling Arguments require substantial
changes that are unlikely to be ported to cClosure.ml. So `kred` essentially
doubles the amount of kernel code with the added complexity of the copy not
being a verbatim copy.


## <a href="msp"></a> [MSP: Multi Stage Programming](https://github.com/coq/coq/pull/17839)

The MSP prototype provides primitives (currently implemented using actual
“primitives” but a next version will use dedicated AST nodes) that freeze or
force reduction of terms. Those primitives can be nested to perform fine-grained
reduction of some but not other terms.

The very basic example below force-appends two lists containing user terms
without touching those user terms which are guarded by the `block` primitive.

```coq
run ([block P; block Q] ++ [block R; block S])
```

blocked terms may contain unblocked sub-terms which are going to be reduced up
to blocked sub-sub-terms, etc.

`run` is what triggers MSP to work. `block` and `unblock` have no meaning
outside of run.

MSP addresses the use cases [Reification with User Terms](#reif) and [Avoiding Term
Repetition in Proof Terms](#rep) completely and drastically improves on all existing
reduction strategies in the precision it offers for reducing specific terms.

### Advantages

The current MSP prototype has already been tested on examples representing these
use cases and it is able to overcome existing shortfalls both in complexity of
use as well as in performance.

### Disadvantages

MSP must live in the kernel for it to be useful at Qed time. It might be
possible to make it opt-in but that hardly changes anything.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is incompatible with `vm_compute` and `native_compute`.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at the draft!

I just remembered and re-read your comment in rocq-prover/rocq#13881 (comment). I suppose there really is no way to return unreduced terms from vm_compute, is there?

In any case, I'll probably try to expand on "incompatible" a little bit. It seems to me that at least some vm casts could be preserved in terms that use MSP with enough care.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently no, but that does not mean that it is that farfetched.

First, the VM is already able to manipulate and return axioms, existential variables, and free variables (which are fully normalized values, but from an operational point of view, they are fully unreduced terms). So, extending accumulators to support arbitrary Gallina term is trivial. One has to be a bit careful with open terms so that bound values are properly kept around. But, accumulators are already able to accumulate information (hence their name), so no major change here either.

Second, if the reduction is only delayed rather than just fully blocked, one has to make sure that the bytecode is still around to restart the reduction. But again, the VM already supports similar patterns, e.g., for the return clause of match constructs. The kernel even has some mechanisms to ensure that Gallina terms and their corresponding bytecode stay synchronized, because this is needed when caching weakly-reduced toplevel definitions (though, this fragile piece of code is probably not exercised as much as it should be).

The hardest part will be to make sure that the Gallina term in the accumulator and the corresponding bytecode agree on what the bound variables are. Currently, there is absolutely no similar mechanism in the bytecode compiler. To give a concrete example, consider

Definition foo x := let y := x in block (x + y).

The blocked Gallina term has two bound variables (x and y), while the corresponding bytecode has only one bound value x (y was optimized away by the compiler). Restarting the reduction by pushing two values on the VM stack would quickly lead to a memory corruption.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the detailed explanation. Are there already plans/ideas for handling the binder discrepancies?

Somewhat orthogonal: It just occurred to me while trying to explain this to coworkers that both vm_compute and native_compute are currently trivially compatible with MSP. MSP must not (and does not) change the normal forms of terms and both vm_compute and native_compute perform full normalization. The only difference from MSP is in the result of weak head reduction (and even there it only changes the shape of arguments to the head constructor/constant/etc, not to the head itself). I should expand on this a bit in this document to avoid confusion.

So until vm_compute/native_compute supports weak head reduction I don't think they are affected at all. All we need is to treat the primitives as boxing/unboxing operators of some box type.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there already plans/ideas for handling the binder discrepancies?

No. In fact, last time I suggested that the bytecode compiler should handle let-in constructs in a slightly more predictable way, the idea did not live long: rocq-prover/rocq#13234

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSP must not (and does not) change the normal forms of terms and both vm_compute and native_compute perform full normalization

If this is the intention, then support un vm_compute is Very Easy™ (I don't remember how native compute works, but I don't see why it would be a problem). You simply reserve one extra tag, and otherwise compile block t like fun x => t (better: compile it as a 0-argument closure), so that block t is a weak-head normal form. When it comes to reconstruction, force the closure and evaluate t (which is exactly what you do with functions). This should let you control the order of reduction as you need (I say “should” because your document doesn't really specify the semantics of block/unblock/run (it should!)).

You need to use a different tag for block t than for fun x => t because it's going to behave differently (e.g. the normal form of fun x => t is fun x => t', but if I understand correctly, the normal form of block t is t' (in both case I take t' to be the normal form of t)).

Supporting weak-head reduction in the VM is essentially the same problem as @silene describes above. Normalisation-by-evaluation really is a surprising sweet spot: you simplify the problem a lot when you're only asking to reconstruct normal-form terms.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I say “should” because your document doesn't really specify the semantics of block/unblock/run (it should!)).

Sorry for that. I actually opened this PR way too early by accident and that's why the document is still in pretty rudimentary shape. The actual idea was to pass around a draft (in the form of a PR against a fork of this repo, not this repo directly) in order to collect additional challenging use cases for reduction before we (possibly) resume work on discussing/specifying/implementing MSP (and kred). But I somehow ended up targeting the wrong repo and now I am getting super useful feedback here so I really want keep this PR open.

I'll update the document with some more details on MSP next week. For now, the best summary is the old prototype itself. The PR description is probably slightly outdated but the code was very close to the desired semantics. The primitives are in theories/Force/Force.v and there are tests that should be helpful to illustrate the intended spec in test-suite/success/force.v

Your description of supporting MSP in the absence of weak-head reduction sounds correct to me except that block t simply reduces to block t'. @block T t has type Blocked T which can be thought of a record containing T. Both unblock and run function as projections in this simplified view. I don't remember the exact reason but Blocked T cannot be convertible to T. It probably involves canonicity.

This should let you control the order of reduction as you need

I really wonder if this could end up helping simplify the discussion around bytecode compilation/optimization. I see that rocq-prover/rocq#13234 contains strong arguments for pattern matching being the only viable solution for predictable reduction performance and that also matches my personal experience (although I can almost never use vm_compute so I don't know much about the performance pitfalls there). MSP would indeed be one possible way to specify the order of reduction even in vm_compute. If that could help simplify the bytecode compilation by making some "optimizations" unnecessary, maybe it could also make implementing weak-head reduction easier in the long run by keeping the bytecode very close to the original terms. (Weak-head reduction in vm_compute would be incredibly useful independently of anything discussed here.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSP would indeed be one possible way to specify the order of reduction even in vm_compute. If that could help simplify the bytecode compilation by making some "optimizations" unnecessary

MSP does not make the above optimization unnecessary. (The only purpose of this optimization is to allow the garbage collector to free memory earlier in some very rare cases. This optimization is not even part of OCaml's bytecode compiler, for good reasons. It is specific to Coq.) And except for peephole ones, there is no other optimization in the bytecode compiler anyway. So, MSP and bytecode optimizations are two completely separate issues.

Automation-relevant theory must be rewritten to incorporate `block`, `unblock`,
and `run` in all the necessary places. This cannot be avoided with this
approach.

The current prototype introduces a sizeable performance regression for
developments that do not use the feature (+7% or so in developments that stress
the kernel) because of the way it has to find applications of the new primitives
before they are (mis-)handled in the usual way by the rest of the machine.

The next iteration will use dedicated `constr` nodes which is a way more invasive
change but should allow us to avoid any performance impact on developments that
do not use the feature.