Idea: convenient caching in LNode #72

qwertie · 2018-12-07T02:56:59Z

qwertie
Dec 7, 2018
Maintainer

A Loyc tree node (LNode) typically represents some specific concept, such as a data type name or a class definition. I've been thinking for a long time that we needed some way to conveniently obtain access to convenient "wrapper" around an LNode that would make it act like some specific type. Now I suspect it's worthwhile to go a step further and provide caching functionality.

Here are two specific things that people might want to do:

One: validate a node as being of a specific type and get a wrapper that provides convenient access to it. For example, I could define a structure like this to encapsulate a class definition:

    struct TypeNode {
      public LNode Node { get; private set; }
      public TypeNode(LNode n) { Node = n; }
      public LNode Name { get { return Node.Args[0]; } }
      public VList<LNode> BaseTypes { get { return Node.Args[1].Args; } }
      public LNode Body { get { return Node.Args[2]; } }
      public Symbol Kind { get { return Node.Name; } }
      public bool IsValid { get { return EcsValidators.SpaceDefinitionKind(n) != null; } }
      public static explicit operator TypeNode(LNode n) {
        var r = new TypeNode(n);
        if (!r.IsValid) throw new ArgumentException();
        return r;
      }
    }

Okay, you can already do this today, and you could use LeMP to generate structures for many node types quickly. However, perhaps in some scenarios you would convert the same Loyc tree to a wrapper many times, requiring validation each time.

Two: compute something from an LNode. We could add caching functionality of some sort to Loyc trees for this. Since nodes are immutable, the idea would be to associate immutable data with the nodes.

My idea is to add some kind of IDictionary<Identification,object> to each node, with restricted access: the dictionary doesn't exist at first, it is created only when needed, and the values associated with each Identification are computed automatically using some kind of global "registrar" or function table. What is Identification? I'm uncertain; it might just be Symbol. Anyway, if you have an LNode N then N[Id] gets the object associated with Id. If there is no object associated with Id, LNode will look up the helper function h associated with Id in the registrar and call h(N). h returns a value or object that is returned from N[Id], and also cached, so that if you call N[Id] again, you get the same object.

At any time it should be possible to discard all cached info, even recursively, e.g. N.ClearCache().

So, let's say you're implementing a Java compiler with Loyc trees. You might create a series of Identification objects, one for each kind of node in Java: Node.Package, Node.Expr, Node.ForLoop etc. Then you can call N[Node.ForLoop] to obtain the "while loop" representation of the node.

Ahh, but it's an object - no type information. Can we do something about that? Well, I don't think it's possible to avoid a typecast, but maybe we can make the cast more convenient. First of all, I can't recall at this moment whether node<ForLoop>[Node.ForLoop] is legal C# syntax, but if so, we can allow that. But it would be better if we could allow node<ForLoop>[], or if that's not legal syntax, node.Get<ForLoop>() (yuck). One way to do this would be to auto-generate an Identification object for each type as it is requested. Then, LNode could use reflection to find a static function ForLoop.From(LNode) function which would be used to obtain the ForLoop object. However, it would be nice to avoid reflection because sometimes .NET AOT compilation doesn't support it.

Then there's the matter of efficiency. Sometimes it is cheap to compute an associated value for a node; other times it is expensive. Sometimes the expense depends on the node (this is an issue for computing hashcodes - hashing an identifier node is trivial, hashing a large class is not.) It would be nice to have some kind of mechanism to avoid caching things that are cheap to compute, particularly if it means avoiding the creation of a relatively expensive dictionary of cached things. So, instead of simply returning a value, the helper function h could return two values: one object to be returned and one integer representing how willing LNode should be to cache the value (typically proportional to how expensive it was to compute the value or validate that the node syntax was correct). The caching strategy could be user-customizable, within reason... changing strategies should not affect nodes that already exist and have caches.

The other element of efficiency is memory usage. A likely optimization is to avoid allocating an entire dictionary when there is only one cached value. In any case, this feature would require adding 1 or 2 words (references) to every LNode.

So @jonathanvdc, does this sound like a useful feature? Is it worth a cost in memory-per-node to support the mere possibility of it?

jonathanvdc · 2018-12-07T14:46:09Z

jonathanvdc
Dec 7, 2018

This sounds like an incredibly useful feature! Your first point seems especially relevant to my situation. Probably my biggest issue with the current state of Loyc is that the rules governing LNode-based data structures are implicit. Whenever I write a non-trivial macro, I inevitably resort to using LeMP's built-in editor and/or LeMP-repl in order to discover how data is stored in the LNode representation of some C# construct. It would be incredibly convenient if all of this information was encoded in type definitions whose correct usage can be enforced by the type checker.

I'm not entirely in agreement with regard to the caching implementation, though. I feel like the primary use case for this feature would be to have a different "view" of an LNode so you can treat it as more rigidly structured data, exactly like your TypeNode example. I agree that such a view might also want to cache some information for efficiency reasons.

But why bake (cached) properties of the view into the LNode? Why not just implement the view as a regular type, as in your TypeNode example, and cache the view instead? That'd have the extra benefit of being type-safe. A ConditionalWeakTable could accommodate this use case quite well.

Here's what that would look like in practice:

    class TypeNode {
      private TypeNode(LNode n) {
        Name = n.Args[0];
        // ...
      }

      public readonly LNode Name;

      // ...

      private static ConditionalWeakTable<LNode, TypeNode> views
        = new ConditionalWeakTable<LNode, TypeNode>();

      public static explicit operator TypeNode(LNode n) {
        TypeNode r;
        if (views.TryGetValue(n, out r)) return r;
        r = new TypeNode(n);
        if (!r.IsValid) throw new ArgumentException();
        views[n] = r;
        return r;
      }
    }

One of the key advantages of this design is that it's very much a "pay for what you use" scheme. Views only cost memory when you actually use them. Also, this design interacts nicely with garbage collection: once you're done using a node, the garbage collector will just collect it as well as its views. The code's not incredibly pretty, but it shouldn't be too hard to write a macro for defining new views.

IMO, the only real downside to this design is that values in a ConditionalWeakTable are strong references, so you can't have a view that refers to its own LNode or a parent LNode. Those'll cause a memory leak. A different data structure could fix that issue, though. The WeakCache I created for my Flame rewrite looks like it fits the bill rather nicely: it stores both keys and values as weak references. A design based on WeakCache would also have the additional advantage of allowing views to be garbage collected even as their corresponding nodes stay live.

This is what a view based on WeakCache would look like in terms of code:

    class TypeNode {
      public LNode Node { get; private set; }
      private TypeNode(LNode n) { Node = n; }
      public LNode Name { get { return Node.Args[0]; } }
      public VList<LNode> BaseTypes { get { return Node.Args[1].Args; } }
      public LNode Body { get { return Node.Args[2]; } }
      public Symbol Kind { get { return Node.Name; } }
      public bool IsValid { get { return EcsValidators.SpaceDefinitionKind(n) != null; } }

      private static WeakCache<LNode, TypeNode> views
        = new WeakCache<LNode, TypeNode>();

      private static TypeNode CreateView(LNode n) {
        return new TypeNode(n);
      }

      public static explicit operator TypeNode(LNode n) {
        var r = views.Get(n, CreateView);
        if (!r.IsValid) throw new ArgumentException();
        return r;
      }
    }

0 replies

qwertie · 2018-12-07T16:30:21Z

qwertie
Dec 7, 2018
Maintainer Author

Interesting, I did not know about ConditionalWeakTable. Other than its very strange name and description, I don't see how it is different than the WeakKeyDictionary in Loyc.Essentials. Similarly I don't see how WeakCache is different from a hypothetical WeakDictionary (well, there's that unusual Get() method, but a conventional dictionary could have that.... I'm curious what the deal is with those concurrency domain thingies, but I have no time to think about that.)

The nice thing about your ideas is that they require no changes to LNode at all :)

I guess it's a question of how often people want to cache data. So I'd like to look at the memory cost. Let's say I optimize for up to two cached items. If, say, you want to cache one or two things on 25% of nodes... and LNode is larger by one word... then 3/4 of the words would be wasted, and for each of the remaining 1/4 there would be a special-purpose object to hold two cache items (probably 7 words)... total: 10 words per used cache entry. There is a cast, but no other significant costs.

How does this compare to an independent WeakCache? The cost of the WeakCache object costs O(1); if WeakCache is similar to a normal Dictionary (no time to check) it'll consume 4 or 5 words per pair plus waste (IIRC 4x64-bit or 5x64-bit). I assume the pairs will waste ~30% of their memory (a typical amount for dynamic sized collections), so that's probably a like 2 words per used pair. Then there's the weak references, which are a wildcard in both memory and execution time, as I don't know how they are implemented; I will guess that each weak reference takes 4 words, for a total of 8+6=14 64-bit words ... so it looks more expensive in terms of memory for 25% utilization.

You avoid the cost of a cast, but you gain the cost of a full hashtable lookup (probably with a % division), and the cost of frequently checking whether the weak references are live, and the cost of extra cleaning up after a GC. But at least you only pay for what you use.

I also wonder if there's any valuable feature that would be more than just caching - associating a value permanently with a node, no deletions? Hmm, not sure that that's a good idea.

0 replies

qwertie · 2018-12-07T16:32:48Z

qwertie
Dec 7, 2018
Maintainer Author

And let's keep ILNode in mind which allows avoiding the use of LNode entirely in some cases. I wrote the LESv3 printer to support printing ILNodes.

0 replies

jonathanvdc · 2018-12-07T16:53:18Z

jonathanvdc
Dec 7, 2018

Hmmm. I don't fundamentally disagree with your analysis, but isn't it kind of biased in the sense that it takes into account the memory cost of a feature of my suggestion (weak references) that your original idea doesn't have? That's kind of comparing apples to oranges. Surely adding weak references to your solution changes the math?

Permanently associating values with nodes would be... invasive. Especially if the values being associated are semantically relevant. That'd essentially add a third type of semantically relevant storage to LNodes (in addition to attributes and arguments).

0 replies

qwertie · 2018-12-07T17:19:26Z

qwertie
Dec 7, 2018
Maintainer Author

Oh right, I guess I should count only one weak reference (the keys, e.g. using WeakKeyDictionary) rather than two. Then the two versions would be essentially equivalent, right? Then if 25% of nodes use a single cache entry, the two implementations have roughly the same memory cost.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: convenient caching in LNode #72

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Idea: convenient caching in LNode #72

qwertie Dec 7, 2018 Maintainer

Replies: 5 comments

jonathanvdc Dec 7, 2018

qwertie Dec 7, 2018 Maintainer Author

qwertie Dec 7, 2018 Maintainer Author

jonathanvdc Dec 7, 2018

qwertie Dec 7, 2018 Maintainer Author

qwertie
Dec 7, 2018
Maintainer

jonathanvdc
Dec 7, 2018

qwertie
Dec 7, 2018
Maintainer Author

qwertie
Dec 7, 2018
Maintainer Author

jonathanvdc
Dec 7, 2018

qwertie
Dec 7, 2018
Maintainer Author