Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do we do about equality? #16

Closed
jackfirth opened this issue Jul 15, 2019 · 135 comments
Closed

What do we do about equality? #16

jackfirth opened this issue Jul 15, 2019 · 135 comments
Labels
libraries This is for issues related to Rhombus libraries moving on

Comments

@jackfirth
Copy link
Collaborator

jackfirth commented Jul 15, 2019

Racket newcomers are immediately confronted with four equality operators:

  • equal?, which does what you'd expect most of the time
  • =, which does what you'd expect if you're used to IEEE numbers
  • eq?, which looks like "the magic fast pointer comparison / reference equality operator"
  • eqv?, which just looks bizarre

I think we should simplify this in racket2 Rhombus. Here's a hypothetical straw proposal:

  • Keep equal? and =
  • Get rid of eqv?
  • Rename eq? to something like same-object-reference? and tuck it away in the corner of the docs, to make it clearer that you should only use this if you have a very good reason to

Is this feasible? Can we do better?

@AlexKnauth
Copy link
Member

It might be worth considering a notion of equality that recurs through immutable data structures like equal? does, but uses reference equality on mutable data structures.

Same idea as egal? from Rackjure and Equal Rights for Functional Objects

@jeapostrophe
Copy link
Collaborator

There's a syntax part of this, a semantics part, and a library part.

Regarding semantics of eq?, Robby has long advocated the idea that programs should have the same functional correctness if all (eq? x y) is replaced with #f. The idea is that eq? is purely a kind of optimistic optimization where the slow path is required to be present.

@spdegabrielle
Copy link
Member

spdegabrielle commented Jul 15, 2019

regarding eq? would same? be better?
the racket reference uses same to describe eq?;

Return #t if v1 and v2 refer to the same object, #f otherwise.

https://docs.racket-lang.org/reference/booleans.html?q=eq%3F#%28def._%28%28quote._~23~25kernel%29._eq~3f%29%29

@jackfirth
Copy link
Collaborator Author

regarding eq? would same? be better?
the racket reference uses same to describe eq?;

Return #t if v1 and v2 refer to the same object, #f otherwise.

https://docs.racket-lang.org/reference/booleans.html?q=eq%3F#%28def._%28%28quote._~23~25kernel%29._eq~3f%29%29

The name we pick for eq? ought to be longer / more complex than equal?, to nudge people towards using equal? by default. So while I like using the word "same" for this, I think same-object? (or something similar) would be much clearer than same?.

@spdegabrielle
Copy link
Member

+1 for same-object?. Much better than same?.

@sorawee
Copy link
Contributor

sorawee commented Jul 16, 2019

  1. Why don't we make it consistent: use =? instead of =? We do have things like free-identifier=? already anyway (yes, I'm also up for <?, <=?, etc.).
  2. Is there a case where (equal? a b) and (= a b) don't behave the same when a and b are numbers? If not, why don't we also merge them into one. I.e., in general use =? (which is currently equal?). If you want a contracted version, just prefix it: number=?, string=?, etc.
  3. object-equal? or reference-equal? also works for eq?'s new name. No need to invent a new name (same).

@AlexKnauth
Copy link
Member

To answer (2), yes, for (equal? 1 1.0) => #f vs (= 1 1.0) => #t. Exact-integers and floats-that-happen-to-be-integers are different racket values, but are equivalent numbers under =

@LiberalArtist
Copy link

My primary use for eq? is with symbols, especially as keys in eq?-based hash-tables. My impression is that there is a notable performance benefit.

@dedbox
Copy link

dedbox commented Jul 16, 2019

My primary use for eq? is with symbols, especially as keys in eq?-based hash-tables. My impression is that there is a notable performance benefit.

Here, here. eq? (and intern) are in the original Scheme paper, and I'm having a hard time understanding (Jay's account of) Robby's idea.

Is he saying eq? is misused so often that it is effectively useless, or that it is actually useless for some technical reason?

@jeapostrophe
Copy link
Collaborator

eq? interferes with optimizations because it exposes pointer identities, so you cannot, for example, create copies of immutable objects. Similarly, it interferes with providing safety because you can tell the difference between an object and its chaperone.

@gus-massa
Copy link

Something that I think is weird is that an immutable string can be equal? to a mutable string, but an immutable hash can't be equal? to a mutable hash. (I understand the reason at the C implementation, but I think that the difference should not leak to the racket level.)

@jackfirth
Copy link
Collaborator Author

Mutable objects are a strange case. If I have two vectors v and w, it's very important to know whether or not they refer to the same underlying storage: meaning whether or not vector-set! on v changes the result of vector-ref on w. This is different from knowing if v and w are eq?, because a vector may not be eq? to its own chaperone but they both refer to the same storage. Today, equal? on mutable objects just considers whether they have the same contents - not the same identity (which, due to chaperones, is not the same as just being eq?). I think this is very confusing, and that we should change equal? on mutable objects to only return true if they refer to the same mutable storage. This would communicate that the identity of a mutable object is an important part of its semantics.

@LiberalArtist
Copy link

As another point in the design space, Pyret has several notions of equality. In summary:

Name Operator Partial Predicate Total Predicate Similar To
Equal Now =~ equal-now equal-now3 equal? (Racket) == (Python, Ruby)
Always Equal == equal-always equal-always3 = (Ocaml)
Identical <=> identical identical3 eq? (Scheme) == (Ocaml) === (JavaScript) is (Python) == (Java)

IIUC, Racket lacks a general "Always Equal" for mutable values: a test to identify that mutations to one value will also mutate the other. Currently, eq? can only say "yes" or "idk": it is foiled by chaperones and impersonators. The default equal? for opaque structs does this, but sacrifices "Equal Now".

@LiberalArtist
Copy link

Re @jeapostrophe's comment on eq?: I agree that we should avoid exposing pointer identities (at least in the safe/high-level language) and we should allow the implementation to be aggressive as it wants at optimizing storage for immutable objects. I hope we can find a way to do that without degrading performance for things that are now idiomatic and correct uses of eq?, particularly with symbols.

I have a vague notion that the cost of equal? vs. eq? is especially notable in hash tables. With a statically-visible symbol, equal? can just be optimized to eq?, but AIUI it triggers more expensive double hashing in hash tables.

Another use-case is weak eq?-based hash tables to cache expensive computations that might be repeated: if the hash-table lookup becomes more expensive, that might negate the benefit of using the cache in the first place.

But maybe we will come up with an equality test that avoids the problems with eq? but still has good performance.

@rocketnia
Copy link

eq? interferes with optimizations because it exposes pointer identities, so you cannot, for example, create copies of immutable objects. Similarly, it interferes with providing safety because you can tell the difference between an object and its chaperone.

For this reason exactly, I think it'd be good to introduce a prop:has-hashable-object-identity property that new user-defined data structures don't have to implement (even if every value up to now has done so).

And, uh, I think this would eventually motivate migrating to yet another list representation, this time using cons cells that are not mutable and do not have object identity. But maybe one step at a time...

With that in mind, this is how I figure the equality functions can be renamed and recontextualized:

  • eq? -> equal-cheap, an equality check that only works on values that can be compared without traversing them. All values that have object identity support this since the object identity is the only part that needs to be checked.
  • (did not exist) -> equal, a new deep equality check that not all values support, but that is actually guaranteed to tell the truth about indistinguishability for values that do. Again, this any value that has object identity can be compared without traversing into it, so this operation is equivalent to eq? in Racket 1, and that's why it doesn't exist yet.
  • equal? -> custom-hashable-equal, which no longer accepts all values, but just those values which have object identity, have support for the new equal, or have prop:equal+hash. The name might tip people off to the gotchas of using this kind of equality, since programmatic behavior in an untyped language is unlikely to enforce algebraic laws.
  • = -> numbers-are-actual-numbers-and-have-been-rounded-to-be-equal. This rather elaborate name clarifies that this operation is only for numbers, that it returns #f if there are any NaNs, and that it isn't comparing real numbers, but only roundings thereof, which may have become arbitrarily imprecise.
  • eqv? -> (retired, possibly renamed to number-content-or-character-content-or-other-value-object-identity-is-equal)

@jackfirth
Copy link
Collaborator Author

I think in nearly all cases where eq?-based hash tables are used, they're used with quoted symbol keys. In these cases, instead of exposing pointer equality we can just point users to a special purpose data structure that assumes symbol-like keys such as Rebellion's records. This also makes it easier to optimize further than what is possible with a generic hasheq data structure that assumes nothing about its keys.

I really think we should avoid introducing additional equality operators and concentrate on changing or removing the existing ones. I'd much rather change equal? to treat mutable objects with different identity as unequal than introduce a new equality operator with this behavior. If I wanted to compare two mutable vectors on their contents alone, the first approach I'd reach for would be (equal? (vector->immutable-vector vec1) (vector->immutable-vector vec2)) rather than read through the docs on the various equality operators until I'm sure I have the one that does what I want.

@AlexKnauth
Copy link
Member

AlexKnauth commented May 17, 2020

For an equality function that uses structural equality on immutable values, but reference equality on mutable values, is chaperone-of? the function that currently does that?

I would propose making equal? more like Rackjure's egal? or Pyret's equal-always, while renaming the existing structural-on-mutable equal? to equal-now?.

However if chaperone-of? already behaves like an "equal-always" predicate, then that might be as simple as renaming chaperone-of? to equal? while renaming the old equal? to equal-now?. Is this right?

If chaperone-of? behaves like an "equal-always", then the only difference I can think of between chaperone-of? and egal? is on streams, where egal? says streams are egal if they have egal elements (potentially diverging on infinite streams), where chaperone-of? would return false.

(Edit: okay it wouldn't quite be that simple, there would also need to be chaperone-of-based hash-tables, renamed to equal-based hash-tables along with it)
(Edit 2: apparently chaperone-of? isn't symmetric, so I suppose the new "equal-always" version of equal? should be equivalent to (or (chaperone-of? a b) (chaperone-of? b a)), except it shouldn't have to traverse the data twice)
(Edit 3: so just or-flipping on the outside won't be enough, the or-flip needs to happen on every nested place because it could need different directions in different parts of the data)
(Edit 4: so just or-fliiping-and-traversing won't be enough, because if x2 and x3 are both chaperones of x1, they're unrelated by chaperone-of? in both directions)

@sorawee
Copy link
Contributor

sorawee commented Nov 19, 2021

In the current Rhombus:

  • = is for variable/object mutation (set!, vector-set!, etc.).
  • == is for number comparison (=)
  • === is for equal?

I think this is not ideal.

  • No other languages use == strictly for number comparison, so it would be very confusing to people who come from other languages.
  • === is not ergonomic. It's used everywhere, and three keystrokes cost too much.
    • = for mutation seems like a good target to be removed, since it's much less frequently used compared to other operations.

My preference is:

  • := for mutation (from Discord, @plane suggests that <- is also another good candidate)
  • = for number equality (from Discord, @soegaard suggests that this could be generalized to eqv?, though I personally think eqv? should not exist)
  • == for equal?
  • === for eq?

@mflatt
Copy link
Member

mflatt commented Nov 19, 2021

One potential confusion with == as equal? is that 1 is not equal? to 1.0. (That's one reason why == isn't currently equal?.)

Maybe == could be numeric equality when both arguments are numbers and equal? otherwise.

@jackfirth
Copy link
Collaborator Author

What if instead of making == mean numeric equality on numbers, we just made numbers read as exact decimals by default? Then 1 == 1.0 would be true even if == means equal?.

@sorawee
Copy link
Contributor

sorawee commented Nov 26, 2021

@jackfirth What should be the answer of

1 == exact_to_inexact(1)

?

@jackfirth
Copy link
Collaborator Author

I'd be fine with that being false, and telling people if they want to mix comparisons of exact and inexact values, they could use this:

is_numerically_equal(1, exact_to_inexact(1))

Exactness-insensitive numerical equality tests make up a very tiny fraction of all equality tests. I'm not too worried about them requiring some explicitness.

@Metaxal
Copy link

Metaxal commented Nov 26, 2021

Hmm, I'm not too sure that's a good thing, as quite likely a large portion of the users (in particular students) will expect this result to be true. This can lead to serious silent bugs too.

I would be okay with 1 == exact_to_inexact(1) raising an exception instead.

@jackfirth
Copy link
Collaborator Author

I think of it as a type mismatch. Two things of different disjoint types can't be equal, similar to how some_number == some_string is allowed, but pointless. It's important that it's allowed so that things like x == some_gensym work correctly. But we could lean on static analysis to detect cases where equality tests are guaranteed to be false.

@Metaxal
Copy link

Metaxal commented Nov 26, 2021

My current(*) personal preference:

  • set, := or <- for assignment
  • = for numeric equality (1 = 1. is true).
  • == for value equality within the same type, and exception raised if types are different
  • equal? for type equality, then value equality
  • object-equal? or ref=? for eq? (these names aren't great, but at least it's intuitive)
  • eqv? does not exist

Roughly based on the principle of least surprise: names should be intuitive enough and not do what you would not expect them to do (which is less stringent than "they should do what you expect them to do").

(*) Ask me next month for different answers.

@jackfirth
Copy link
Collaborator Author

jackfirth commented Nov 26, 2021

=, ==, and equal? all having slightly different semantics is not a good state of affairs. I think there should only be one symbolic equality operator and it should have the same meaning as whatever is_equal means in Rhombus (which may or may not be the same as equal? in Racket.) Any other forms of equality should have descriptive names like is_numerically_equal or is_same_object and should not have symbolic operators associated with them. We want equality to have a single clear default meaning, with other notions of equality given names that say how they're different from the default meaning.

@mflatt
Copy link
Member

mflatt commented Nov 26, 2021

What if instead of making == mean numeric equality on numbers, we just made numbers read as exact decimals by default?

That would definitely create surprises, given how much more expensive exact arithmetic is.

Exactness-insensitive numerical equality tests make up a very tiny fraction of all equality tests.

I'm not at all sure this is true, and I think this must depend on what kind of program you're writing.

While there are pitfalls to the convention that . means a floating point number and to conventions for implicit coercions between them, it's a pretty good compromise for the most part between performance and real arithmetic. I haven't seen the combination as a big source of bugs in our libraries. (It is a source of confusion for beginners in practically all languages, and this is one of those places where a different choice may be appropriate in a pedagogical language like BSL.)

More generally, I'm surprised at how often opinions in this thread dismiss the status quo, as if the only reason Racket has =, eq?, and equal? the way they are and in widespread use is because we've been too lazy to try anything else. Equality is tricky. Picking a small number of equality operators and allocating names to encourage the best ones — that sounds tractable and a good idea from our past experience. Trying to boil it down to just one equality operator seems at odds with experience across languages.

@mflatt
Copy link
Member

mflatt commented Nov 26, 2021

My current(*) perference:

  • == is Racket's equal? (because that's usually what you want, and it's a good choice for match)
  • === is Racket's eq? (because it really is a common operation, however much we might prefer a different universe)
  • .= is Racket's = (because I think traditional number comparison is common enough to merit a compact name)

Meanwhile:

  • := is assignment (because Pascal was my second language)
  • = is not a predefined operator, although it is used in places like providing the default for an optional argument

I still worry that the behavior of == as equal? on numbers will end surprising to some, but it's the least-bad effect I see in various compromises. Also, ending up with Racket's set of equality operators seems like a big benefit.

(*) Ask me again after you ask @Metaxal.

@sorawee
Copy link
Contributor

sorawee commented Jan 1, 2023

Right. The key function interface doesn't provide a theoretical guarantee either.

But to be fair, practically, you need to work much harder to break the invariant when you use the key function interface. In contrast, for binary relation interface, it can be broken easily, sometimes even unintentionally.

That being said, we could treat the binary relation interface as "unsafe operations" and the key function interface as "safe operations".

  • Unsafe operations are discouraged from being used, except when it is really needed.
  • Safe operations are built on top of core unsafe operations
  • Unsafe operations doesn't preserve invariant.

The existing unsafe operations in Racket seems to indicate that the approach is viable.

@jackfirth
Copy link
Collaborator Author

jackfirth commented Jan 1, 2023

Currently, all classes in Rhombus automatically get structural equality by default unless they contain mutable fields. That's what 99% of use cases need anyway.

@countvajhula
Copy link
Contributor

Re: nondeterminism, we could consider it analogous to streams where a user could write a random stream. In this case, due to the memoization that is part of the Scheme standard for streams, the stream will still be deterministic in actual use. Similarly, if memoization is considered part of the specification rather than just an optimization, we could guarantee determinism.

@jackfirth
Copy link
Collaborator Author

Mandating memoization would double the memory consumption of all collections at the very least, if not all values in the program.

@countvajhula
Copy link
Contributor

Yes that's true, it would certainly increase the memory requirement. There may be ways to mitigate this in practice, e.g. see "meh-less not meh-mo". It may even be possible to do this sharing of memoized representatives proactively in some way, maybe in cooperation with or in a process analogous to garbage collection. E.g. at the time when a representative for a value is computed, it can see if that representative is already "pointed to" by another value, and if so, reuse it. The garbage collector would not release the representative as long as at least one value is represented by it.

@jackfirth
Copy link
Collaborator Author

That might be feasible, but I'm still not totally sure what goal it accomplishes. Are we trying to make it easier to make correct equality implementations, or impossible to make incorrect equality implementations? If it's the former, that can be done on top of a normal binary relation interface quite easily without burdening all data structures with extra memory and performance costs. If it's the latter, fundamentally that's not possible to do in Rhombus because we want interop with Racket to be possible and because the key function interface (even with memoization) still permits buggy key function implementations.

@countvajhula
Copy link
Contributor

Yes, I would say both of those. For the former, the key interface is much simpler. For the latter, I'd recommend that we explore designs in their ideal form to see whether we can bring the whole language ecosystem to a better place, what that would take, and whether it's worth it. If it came to it, a "python 3" style transition for the Racket language might be a good thing both from the perspective of Rhombus as well as Racket.

buggy key function implementations

Could you give an example of what you mean here?

@AlexKnauth
Copy link
Member

After some discussion with @rocketnia, the unreduced rational number case isn't quite as bad as I first thought. Greatest-Common-Divisor algorithms exist that are faster than the factoring I had in mind. So the key function could be O(n^2), which in practical use, isn't much worse than the binary equal-proc function would be, at least is asymptotic complexity.

The key version would still require more memory allocation, and wouldn't have the same easy short-circuiting behavior that the equal-proc allows... and I'm still worried there might be other data structures with equality behavior I haven't thought about...

@sorawee
Copy link
Contributor

sorawee commented Jan 3, 2023

What key function would I specify to compute if two graphs are isomorphic?

@countvajhula
Copy link
Contributor

countvajhula commented Jan 3, 2023

A nonlinear example is a good choice to scrutinize. We could construct a hash representing the adjacency list representation of the graph, and that would then reduce to hash equality, assuming hashes are a key type. edit: Ah but you are asking about isomorphism -- not sure about that. One consideration is that using isomorphism as the definition for equality would prevent us from checking the stronger claim of equality using the generic equality predicate. So in this case, it may be reasonable to define a custom binary isomorphism predicate rather than use =.

@countvajhula
Copy link
Contributor

Not remotely efficient but we could construct the set of all isomorphic graphs, assuming sets are a key type. All isomorphic graphs would construct the same set. But looks like this problem is known to be hard.

@rocketnia
Copy link

rocketnia commented Jan 3, 2023

After chatting with @AlexKnauth, I'd like to catch up a bit.


@AlexKnauth: Hash tables and Hash sets will try to hash it, but most other collections won't.

Alex and I talked about this a bit between us, but for everyone else's sake, I actually was referring to other containers here (collections and just plain structs). The issue I was talking about was that the way these container types exist in Racket today with gen:equal-mode+hash, they only support one way of being used as a key in an efficient map. They don't support ordering for AVL trees or traversal for trie lookup, only hashing for hash tables. So whenever they're used as map keys, their contents are effectively limited to values that can be hashed.


@jackfirth: I'm against building sets into the core datatypes special-cased by the equality interface. Part of the advantage of the generic collection interfaces in Rhombus is supposed to be that we don't have to build special privileged data structures into every nook and cranny of the language core.

I sympathize with this concern, but I hope it's clear that the built-in types in the key approach don't actually show up in the API. You can always use any value as a key method result; it's just that the abstraction tower of key methods hits bedrock somewhere. This is similar to macros expanding until they hit special forms, or functions calling each other until they hit built-in procedures.

The kind of decoupling you're going for is similar to the value I see in the key approach. You're talking about replacing overly specific concrete types with generic ones, and I'm talking about deemphasizing the overly specific gen:equal-mode+hash interface in favor of one that hides its implementation details better.


@AlexKnauth: I wouldn't expect that we'd add new key types frequently (or ever). If indeed there is such a special type that it doesn't map to a key type, then it is such a fringe case that the user might as well define and use a custom equality predicate at that point, instead of relying on a generic one.

Alex and I did some thinking about whether we could come up with a more interesting general class of counterexamples.

One example Alex found was that comparing lambda calculus terms for alpha-equivalence might have a quadratic cost if key results are in a locally nameless de Bruijn style. Each individual key computation could have to traverse the whole term, and if every subterm ends up needing to have its key computed, then that could be a quadratic cost overall. We probably shouldn't have a whole lambda calculus term language be a built-in type, or should we?

More generally, the question of whether the set of available key types is are complete enough is a lot like the question of whether a programming language is complete enough for its purpose. I believe it can indeed be designed like a language, not like the "ad-hoc workarounds" @sorawee is worried about. I think it actually gets rather complete very fast -- in fact, as soon as you have a built-in orderless map type, which lets you build a lot of other things set-theory-style. After that point, I think the benefit of adding more key types will be mainly for optimizations.

That said, there might be some stragglers. Few programming languages have every possible feature.

I don't know if what I'm about to say will make sense to anyone but me, but once we have a pretty complete language of key types, the rest of the stragglers could be almost like the ambient side effects of that language. This analogy shows up particularly well if we consider comparison mechanisms which have extra outputs or extra inputs.

  • For instance, if I wanted a slightly new comparison mechanism where comparing two values would give me a report about how symbolic values (like free variables of a term) on one side were matched up with symbolic values on another, then I might need some new key types that can express operations to generate this report.
  • Likewise, if I wanted a new comparison mechanism that allowed me to compare up to a tolerance (like @AlexKnauth has pointed out with check-within) or to compare up to a provided substitution vocabulary, then somehow I'd need new key types that knew how to respond to these new input options.

So there might be a general way of finding counterexamples by asking what it would mean to have an extensible effect system in which users define creative new comparison mechanisms.

Maybe the users who define these new mechanisms will just need to use "unsafe" binary comparison techniques, thus justifying the existence of those techniques for them to use. But also, the same users might appreciate that they don't have to migrate every type in the language to have it support their new entrypoints, because many of the types will just be key method abstractions that inherit their support automatically from the things they expand into -- thus an emphasis on the key method approach is justified too.

In this sense, if key methods are like macros and "built-in" types are like special forms, then a crude way to get custom side effects might be something like local-expand with a stop list. (Crude enough that it exposes implementation details! It also makes me a liar in what I said just now about built-in types not being part of the API surface.) Another way might be to use actual side effects from the procedural comparison code, like finding the extra input in a parameter. (That way also exposes implementation details, since I bet some gen:equal-mode+hash implementations haven't been written to have an API-stable ordering of recur calls.)


@jackfirth: Putting the key function concept in the core interface of equality instead of making it a convenience layer on top of a more traditional binary relation interface doesn't buy us any additional guarantees.

Yeah, I think the guarantees @countvajhula is going for aren't much to hope for. Rhombus doesn't exactly look like it's going to have a sophisticated type system for expressing mathematical proofs of algebraic laws, so all of its generic instances, not just this one, are going to allow a lot of misbehavior. There was at least one person early on proposing for Rhombus to be an ocap-secure language with good control of side effects, but that's not the direction most of us seem to be expecting it to go.

The "guarantees" are nice in the limited sense of helping guide people toward writing well-behaved code. Beyond that, the only reason I would consider the key interface to be worth imposing on people who wouldn't want it is because of its decoupling from the details of the gen:equal-mode+hash mechanisms and any other potential comparison mechanisms like those.


@jackfirth: Currently, all classes in Rhombus automatically get structural equality by default unless they contain mutable fields. That's what 99% of use cases need anyway.

Absolutely. And that's honestly a great way to avoid having every value in the program need a memoized key; one kind of key function is so common that it gets its own special optimization and doesn't even have to be written out at all. :)


@sorawee: What key function would I specify to compute if two graphs are isomorphic?

One possible cop-out answer is that graphs could be a built-in type. :)

Another possible cop-out answer is that people probably expect the generic equality interface to take somewhere around linear time, maybe quadratic time at worst, so hiding a whole graph comparison behind the generic equality operation might actually be a bad idea.

But in the end, I think that in the Racket ecosystem, something that has an intense need for optimization like that should be able to use the FFI. So if you end up opting for something like binary comparison -- not quite as easy to get right as key functions but safer than the FFI -- I don't see a problem with that.

@rfindler
Copy link
Member

rfindler commented Jan 3, 2023 via email

@AlexKnauth
Copy link
Member

AlexKnauth commented Jan 3, 2023

Converting to deBruijin for a single term is fast. I don't know the actual asymptotic complexity off the top of my head but I expect it's pretty close to linear.

But converting to deBruijin for all the sub-terms of a term, at the time of construction, from the bottom up, is more expensive.

(lambda (x) (lambda (y) (x y)))
-> deBruijn
(lambda (lambda (1 0)))

That can be fast because the conversion can use an environment as it goes from the top down. But from the bottom up it's not as clean:

a = x
b = y
c = (a b)
d = (lambda (y) c)
e = (lambda (x) d)
-> deBruijn
a = x  ; free variables stay named, only bound variables get indices
b = y
c = (a b)
d = (lambda (x 0))  
; couldn't just reuse the deBruijn for c, it had to traverse it to find the y
e = (lambda (lambda (1 0)))
; couldn't just reuse the deBruijn for d, it had to traverse it again to find the x

This is an example where it doesn't make sense to compute the key as part of constructing the data from the bottom up.

@sorawee
Copy link
Contributor

sorawee commented Jan 3, 2023

Yeah, @mflatt said on Discord:

If the O(n)-sized thing can be extend to create a new, larger value, it's important that O(n) work is not done again/separately for the larger value, though.

The De Bruijn index canonicalization is a counterexample where this is not easily done because the process is not context-free.

@sorawee
Copy link
Contributor

sorawee commented Jan 3, 2023

Thanks to friends from my lab, here are two more examples:

  • Checking for equivalence of finite automata. There's a classic algorithm to check for equivalence in linear time. But canonicalization algorithm currently known takes O(n log n) (see: https://ecommons.cornell.edu/handle/1813/5958)
  • Define a funny list to be a list of length >= 2 whose elements are all integers >= 2. Two funny lists are equivalent when their products are the same. Checking for equivalence of two funny lists is straightforward. Canonicalization of a funny list is more difficult, since you probably need to do integer factorization.

@sorawee
Copy link
Contributor

sorawee commented Jan 4, 2023

Oh, the funny list doesn't work, since the key function doesn't need to produce a funny list. It could just produce the product.

@sorawee
Copy link
Contributor

sorawee commented Jan 4, 2023

Some papers to read (thanks to Josh Horowitz and Remy Wang):

On page 2 of COMPLEXITY CLASSES OF EQUIVALENCE PROBLEMS REVISITED, Ker is the set of eqv rels whose key function interface is efficient. PEq is the set of eqv rels whose binary comparator interface is efficient. It's known that Ker is a subset of PEq (obviously). But it's not known if Ker = PEq.

@countvajhula
Copy link
Contributor

That's fascinating and very relevant. Thank you Josh and Remy! 😆

@sorawee
Copy link
Contributor

sorawee commented Jan 4, 2023

The results on page 10-11 are also interesting.

Subgroup equality

Given two sets S and T, which are subsets of a finite group G, we wish to determine if the subgroup generated by S is equal to the subgroup generated by T.

This can be done in polynomial time. It suffices to check generating elements. I.e., see if for each element in S (resp T), the element is a member of the subgroup generated by T (resp S).

Checking for membership of a subgroup generated by a set can be done in polynomial time when G is given by its multiplication table representation (see Complete problems for deterministic polynomial time; https://www.sciencedirect.com/science/article/pii/0304397576900682)

There's no known polynomial time complete invariant for subgroup-generating subset.

@countvajhula
Copy link
Contributor

I need to step away from this discussion for a bit to organize the upcoming ABE Congress next weekend on Jan 14 (which you are all invited to). But for the meeting tomorrow, I'd like to summarize some points and suggest action items for consideration:

  • Mathematically, the key approach is not less expressive than using a binary comparator (see this theorem). To have an intuition for why, it's because of transitivity -- if a and b are both equal to the same thing, then they are equal to each other.
  • In practice this requires that the set of key types be rich enough to have faithful representatives for arbitrary values. This does not appear to require a large number of key types.
  • Some of the new collections being added could further expand the key types if needed.
  • The papers from @sorawee can help us understand what is known theoretically about this problem from the perspective of efficiency (not just expressiveness). I'm getting some serious P vs NP vibes here.
  • Yet, these efficiency analyses are dependent on what optimizations we employ in practice -- e.g. memoization, hash preverification are big ones we've talked about that make a difference here.
  • So in order to make informed assessments, we need some data. I'd suggest:
  1. an algorithmic complexity analysis of the two approaches in different cases
  2. expansion of the benchmarks in the POC since performance in practice will be usecase dependent
  3. a collection of examples of gen:equal+hash implementations in the wild from a Github search to study (suggested by @jackfirth ), including any others we'd like to include such as @AlexKnauth 's deBruijin example -- we can compare the implementations for each of these examples side by side and that would give us a sense of the experience of working with these interfaces.

This appendix could be a starting point to follow in tabulating the analyses and benchmark results.

See you all tomorrow.

@sorawee
Copy link
Contributor

sorawee commented Jan 5, 2023

Re "Mathematically, the key approach is not less expressive than using a binary comparator", that depends on the definition of expressiveness that you are using. Expressiveness via (unrestricted) Turing reduction allows you to use unbounded resource. But there are other kinds of reductions that take resources into account. E.g., polynomial-time reduction and log-space reduction.

Language-wise, expressiveness could also take "ergonomics" into account (see e.g. "essential features" in https://www.sciencedirect.com/science/article/pii/016764239190036W). "Ergonomics" is another thing that is worth considering, too.

@sorawee
Copy link
Contributor

sorawee commented Jan 6, 2023

This example is due to David Eppstein (https://cstheory.stackexchange.com/a/10704/68269). I find this to be very convincing and easy to understand than other examples.

For any positive integers x and y, define x == y to be: x = y OR x = pq and y = pr where p, q, r are all primes with p < q and p < r.

These are equivalence classes in this relation:

class 1: 1
class 2: 2
class 3: 3
class 4: 4
class 5: 5
class 2*: 6, 10, 14, 22, 26, ... (2 * p for p > 2)
class 7: 7
class 8: 8
class 9: 9
class 11: 11
class 12: 12
class 13: 13
class 3*: 15, 21, 33, ... (3 * p for p > 3)
...
class 5*: 35, 55, ... (5 * p for p > 5)
...
class 7*: 77, ... (7 * p for p > 7)
...

A naive way to compute whether x == y is to compute the class labels and compare them (i.e., class label is the key function)

fun equals(x, y):
  class-label(x) == class-label(y)

However, the class-label key function probably will require integer factorization, which is supposedly difficult (unless P = NP).

A better way is to utilize gcd and prime?. These all can be computed in polynomial time.

fun equals(x, y):
  if x = y:
    #true
  else:
    let g = gcd(x, y)
    g < x/g and g < y/g and prime?(g) and prime?(x/g) and prime?(y/g)

@rocketnia
Copy link

Awesome example, and a highly relevant StackExchange thread for seeing other examples as well!

is supposedly difficult (unless P = NP).

A nitpick: Integer factorization may not be known to be in P, but isn't known to be NP-complete either. I think that's still difficult enough to make the point. :)

@countvajhula
Copy link
Contributor

That's a great example and it certainly makes a point. But we have to ask ourselves what we are designing for -- is it to support the ability to have maximum performance in all cases, never mind that in practice these cases will be slower than they could be due to poorly written hash functions, high boilerplate and documentation-lookup related tax on the developer (a cost that many in the community have brought up, despite the high quality of docs -- this speaks to the complexity of the language rather than any failing on the part of documentation authors), and possibly even error-prone and invalid equality relations? Or would we prefer to have potentially better performance in the vast majority of cases (good by default, professionally written hash functions, high rate of success in hash preverification due to soundness and uniformity, highly optimized key type comparison used by 99% of values in the language -- remember that even without many of the proposed optimizations, and despite all this talk of performance, the key approach is currently ahead on benchmarks), more robust guarantees that may be assumed in the language (equivalence relation invariants that could support optimizations we haven't even thought of), and low developer overhead from not having to worry about writing hash functions and reading complicated docs? The key approach is dramatically simpler and comes with stronger guarantees. And after all, nothing is preventing users in these corner cases from writing binary equality predicates for their own use -- writing a simple function is all it takes. They don't need to extend the generic equality operator for that.

Yet, these characterizations are only intended to make a different point -- that there are many ways of looking at something, and looking at things in one way vs another way might produce different conclusions. And these are the kinds of discussions and considerations I'd encourage more of, as I think that would help us reach the best solutions and, equally importantly, feel confident that we are making the right decisions -- whatever those decisions turn out to be!

Having found myself often in the role of defending a particular position (though it isn't a role I've sought), I think I've now said and done all I should on that front. I have great faith in the core developers and everyone on this thread, so unless there are any objections, I will step down as active editor of the equality extension proposal. I am happy to resume the role if I am called upon to do so, and, as always, remain committed to helping in any way I can 🙂

@mflatt
Copy link
Member

mflatt commented Jul 29, 2024

#527

@mflatt mflatt closed this as completed Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libraries This is for issues related to Rhombus libraries moving on
Projects
None yet
Development

No branches or pull requests