Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-investigate string and sequence solvers #189

Open
pschanely opened this issue Nov 21, 2022 · 7 comments
Open

Re-investigate string and sequence solvers #189

pschanely opened this issue Nov 21, 2022 · 7 comments

Comments

@pschanely
Copy link
Owner

Noted in #187:

Hello, I was digging into this trying to isolate the problem that part of the problem might be how the 'in' operator is mapped to z3.

Crosshair watch returns something on this function after 55 seconds, meaning (if I understand correctly this part) it takes that time to generate an example given the pre-conditions. Because post-conditions are never met it should fail in the first example. This happens also with dictionaries and Sets.

def keys_in_an_array(arr: List[str]) -> bool:
    """
    pre: "a" in arr
    pre: "b" in arr
    pre: "c" in arr
    pre: "d" in arr
    pre: "e" in arr
    post: __return__ == True
    """
    return False if "a" in arr else True

But solving the same constraints in z3 takes 0.1s

In [15]: import z3
    ...: import time
    ...: start_time = time.time()
    ...: s = z3.Solver()
    ...: sseq = z3.Const('sseq', z3.SeqSort(z3.StringSort()))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("a"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("b"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("c"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("d"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("e"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("f"))))
    ...: s.check()
    ...: print(s.model())
    ...: print("--- %s seconds ---" % (time.time() - start_time))
[sseq = Concat(Unit("c"),
               Concat(Unit(""),
                      Concat(Unit("f"),
                             Concat(Unit(""),
                                    Concat(Unit("d"),
                                        Concat(Unit(""),
                                        Concat(Unit("e"),
                                        Concat(Unit(""),
                                        Concat(Unit("a"),
                                        Concat(Unit(""),
                                        Unit("b")))))))))))]
--- 0.01843857765197754 seconds ---

Sorry if I misunderstood something, I am still trying to get my head around the internals of it.

Originally posted by @petrusboniatus in #187 (comment)

@petrusboniatus
Copy link

petrusboniatus commented Nov 27, 2022

I was investigating more on this, because efficient "in" operations turned out to be unavoidable for the data frame issue.

  1. Sequence of Sequences, for example a Sequence of strings, are very unstable, I get unknowns in z3 if:
  • I put a length in a constraint
  • I put distinct in a constraint
  • Even scaling up the example I created above.
    So they are discarded for now.
  1. If all "contains" operation are compressed in one you get decent performance on Sequence of Ints, This is very usefull to model sets because order does not matter.
    This is an example of a set of ints with more than 150 elements an 100 in constraints, solved in less than 2 seconds
   s = z3.Solver()
   l = z3.Const("int_seq", z3.SeqSort(z3.IntSort()))
   sub = z3. Empty(z3.SeqSort(z3.IntSort()))
   for i in range(100):
       sub = z3.Concat(sub, z3.Unit(z3.IntVal(i)))

   s.add(z3.Contains(l, sub))
   s.add(z3.Distinct(l))
   s.add(z3.Length(l) > 150)
   s.check()
   print(s.model())
  1. To maintain the performance and have a Set List of Strings for example can be achieved by mapping the integer values through a z3 "array":
  import IPython; IPython.embed()
  s = z3.Solver()
  l = z3.Const("string_seq", z3.SeqSort(z3.IntSort()))
  idx = z3.Const("map", z3.ArraySort(z3.IntSort(), z3.StringSort()))

  sub = z3.Empty(z3.SeqSort(z3.IntSort()))
  for i in range(100):
      sub = z3.Concat(sub, z3.Unit(z3.IntVal(i)))
      s.add(idx[i] == f"{i}")

  s.add(z3.Length(l) > 150)
  s.add(idx[l[120]] == "potato")
  s.check()
  print(s.model())

My proposal is to use the principle in 3th to model Sets and Lists in crosshair. I still will try to make an implementation of a crosshair proxy of at least Set[Int] to check that it translates as well to actual performance.

Edit: changed code messed up by the spell checker, and corrected 1 error.

@pschanely
Copy link
Owner Author

Interesting stuff! I've got a bunch of things to talk about here. I'm travelling right at the moment, but will circle back in a day or two!

@pschanely
Copy link
Owner Author

pschanely commented Nov 28, 2022

CrossHair's current implementation makes surprising choices for representation in Z3. In most cases, I started with the "obvious" choice and later ended up moving towards something dumber (trading more execution paths for less work during solving).

To keep maintenance low and quality high, I've largely tried to keep only one kind of implementation for each Python type. But in the ideal world, CrossHair probably would try different modelings for the same type in parallel. I very much want this kind of "representation forking" for floats, as it happens. At any rate, it would be great to see whether some of your work can stand alongside the existing implementations. (and we could also easily run some evaluations - hopefully over time, the SMT solvers get better and we can ditch some of CrossHair's simpler representations)

Okay, let's get into some of the details. Starting simple, let's just talk about regular old strings. I'll try to follow up with additional mini-writeups about sets and lists here soon.

string

Current modeling (source): A variable Z3 integer length, and a lazily created/expanded list of Z3 integers, one per codepoint.
Obvious question: why not use a Z3 string?

  • In the past, I've run into practical problems with the Python API and unicode characters / null bytes. Hopefully most of these problems don't exist today, though.
  • Python's full regex capabilities significantly exceed those of Z3's. To support all features, we ended up going with a fairly basic linear search with backtracking, so a smarter representation isn't helping us with regexes today.
  • Many common operations aren't supported in Z3 - str.lower(), str.isspace(), int(<str>). In some cases, even if Z3 did support them, the semantics wouldn't match Python's semantics. Using integer codepoints lets us use numeric range comparisons to help implement such capabilities.
  • The unicode database itself depends on the Python version; correspondingly some methods on strings actually vary by Python version.

Now, there are lots of reasons to prefer a native SMT string. A variety of simplistic string scenarios could be handled much more effectively (e.g. just find a string in another string). And handling for complex cases may now be better than I recall (historically, it seemed that the solving can easily get slow; nesting str.substr calls was problematic at one point).

@petrusboniatus
Copy link

petrusboniatus commented Nov 29, 2022

Hello,
I will be a little busy this week. I still plan to have the Set[I] by the end of the week.

Just wanted to comment maybe a maybe stupid idea that is to have extra Interfaces for a subset of the builting api that we know that works well. One example might be a dictionary without Iterate or length that more closely resemble an array in z3. Jax follows a similar 'same api with restrictions' approach and I think it works good. This might come very handy for the user to build his symbolic representation / type system for his classes that are too problematic to be directly mapped by z3.

@pschanely
Copy link
Owner Author

Just wanted to comment maybe a maybe stupid idea that is to have extra Interfaces for a subset of the builting api that we know that works well. One example might be a dictionary without Iterate or length that more closely resemble an array in z3. Jax follows a similar 'same api with restrictions' approach and I think it works good. This might come very handy for the user to build his symbolic representation / type system for his classes that are too problematic to be directly mapped by z3.

That doesn't sound stupid to me!

@petrusboniatus
Copy link

Hi again.
I have an implementation of the Set[int] here only trying in for FrozenSets. I've not tested the actual performance yet (I am more confident with the List sequence), but passes the pre-existing unit tests.

In the meantime I have a question: I had to do a little modification to the core to be able to register a proxy for specific type with generics (FrozenSet[int]), as it was the fastest way to start testing this, not sure if it is the proper way because it feels like this dictionary is tracking origin type for some specific reason.

@pschanely
Copy link
Owner Author

Oh wow, you got quite far on your own. Let me answer your immediate question, and then will try to find a little time to review and say more things later this week.

We look up symbolic factories using the origin of the type rather than the actual type (see here). There isn't an important reason for this, except for that's USUALLY what we want. (one factory for all types of sets, lists, etc) If you want to keep that logic as-is, you can just add your set implementation under new branches in make_set(), which is probably what I'd advise for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants