Re-investigate string and sequence solvers #189

pschanely · 2022-11-21T13:10:03Z

Noted in #187:

Hello, I was digging into this trying to isolate the problem that part of the problem might be how the 'in' operator is mapped to z3.

Crosshair watch returns something on this function after 55 seconds, meaning (if I understand correctly this part) it takes that time to generate an example given the pre-conditions. Because post-conditions are never met it should fail in the first example. This happens also with dictionaries and Sets.

def keys_in_an_array(arr: List[str]) -> bool:
    """
    pre: "a" in arr
    pre: "b" in arr
    pre: "c" in arr
    pre: "d" in arr
    pre: "e" in arr
    post: __return__ == True
    """
    return False if "a" in arr else True

But solving the same constraints in z3 takes 0.1s

In [15]: import z3
    ...: import time
    ...: start_time = time.time()
    ...: s = z3.Solver()
    ...: sseq = z3.Const('sseq', z3.SeqSort(z3.StringSort()))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("a"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("b"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("c"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("d"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("e"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("f"))))
    ...: s.check()
    ...: print(s.model())
    ...: print("--- %s seconds ---" % (time.time() - start_time))
[sseq = Concat(Unit("c"),
               Concat(Unit(""),
                      Concat(Unit("f"),
                             Concat(Unit(""),
                                    Concat(Unit("d"),
                                        Concat(Unit(""),
                                        Concat(Unit("e"),
                                        Concat(Unit(""),
                                        Concat(Unit("a"),
                                        Concat(Unit(""),
                                        Unit("b")))))))))))]
--- 0.01843857765197754 seconds ---

Sorry if I misunderstood something, I am still trying to get my head around the internals of it.

Originally posted by @petrusboniatus in #187 (comment)

petrusboniatus · 2022-11-27T13:59:16Z

I was investigating more on this, because efficient "in" operations turned out to be unavoidable for the data frame issue.

Sequence of Sequences, for example a Sequence of strings, are very unstable, I get unknowns in z3 if:

I put a length in a constraint
I put distinct in a constraint
Even scaling up the example I created above.
So they are discarded for now.

If all "contains" operation are compressed in one you get decent performance on Sequence of Ints, This is very usefull to model sets because order does not matter.
This is an example of a set of ints with more than 150 elements an 100 in constraints, solved in less than 2 seconds

   s = z3.Solver()
   l = z3.Const("int_seq", z3.SeqSort(z3.IntSort()))
   sub = z3. Empty(z3.SeqSort(z3.IntSort()))
   for i in range(100):
       sub = z3.Concat(sub, z3.Unit(z3.IntVal(i)))

   s.add(z3.Contains(l, sub))
   s.add(z3.Distinct(l))
   s.add(z3.Length(l) > 150)
   s.check()
   print(s.model())

To maintain the performance and have a ~~Set~~ List of Strings for example can be achieved by mapping the integer values through a z3 "array":

  import IPython; IPython.embed()
  s = z3.Solver()
  l = z3.Const("string_seq", z3.SeqSort(z3.IntSort()))
  idx = z3.Const("map", z3.ArraySort(z3.IntSort(), z3.StringSort()))

  sub = z3.Empty(z3.SeqSort(z3.IntSort()))
  for i in range(100):
      sub = z3.Concat(sub, z3.Unit(z3.IntVal(i)))
      s.add(idx[i] == f"{i}")

  s.add(z3.Length(l) > 150)
  s.add(idx[l[120]] == "potato")
  s.check()
  print(s.model())

My proposal is to use the principle in 3th to model Sets and Lists in crosshair. I still will try to make an implementation of a crosshair proxy of at least Set[Int] to check that it translates as well to actual performance.

Edit: changed code messed up by the spell checker, and corrected 1 error.

pschanely · 2022-11-27T18:50:25Z

Interesting stuff! I've got a bunch of things to talk about here. I'm travelling right at the moment, but will circle back in a day or two!

pschanely · 2022-11-28T19:08:09Z

CrossHair's current implementation makes surprising choices for representation in Z3. In most cases, I started with the "obvious" choice and later ended up moving towards something dumber (trading more execution paths for less work during solving).

To keep maintenance low and quality high, I've largely tried to keep only one kind of implementation for each Python type. But in the ideal world, CrossHair probably would try different modelings for the same type in parallel. I very much want this kind of "representation forking" for floats, as it happens. At any rate, it would be great to see whether some of your work can stand alongside the existing implementations. (and we could also easily run some evaluations - hopefully over time, the SMT solvers get better and we can ditch some of CrossHair's simpler representations)

Okay, let's get into some of the details. Starting simple, let's just talk about regular old strings. I'll try to follow up with additional mini-writeups about sets and lists here soon.

string

Current modeling (source): A variable Z3 integer length, and a lazily created/expanded list of Z3 integers, one per codepoint.
Obvious question: why not use a Z3 string?

In the past, I've run into practical problems with the Python API and unicode characters / null bytes. Hopefully most of these problems don't exist today, though.
Python's full regex capabilities significantly exceed those of Z3's. To support all features, we ended up going with a fairly basic linear search with backtracking, so a smarter representation isn't helping us with regexes today.
Many common operations aren't supported in Z3 - str.lower(), str.isspace(), int(<str>). In some cases, even if Z3 did support them, the semantics wouldn't match Python's semantics. Using integer codepoints lets us use numeric range comparisons to help implement such capabilities.
The unicode database itself depends on the Python version; correspondingly some methods on strings actually vary by Python version.

Now, there are lots of reasons to prefer a native SMT string. A variety of simplistic string scenarios could be handled much more effectively (e.g. just find a string in another string). And handling for complex cases may now be better than I recall (historically, it seemed that the solving can easily get slow; nesting str.substr calls was problematic at one point).

petrusboniatus · 2022-11-29T16:28:40Z

Hello,
I will be a little busy this week. I still plan to have the Set[I] by the end of the week.

Just wanted to comment maybe a maybe stupid idea that is to have extra Interfaces for a subset of the builting api that we know that works well. One example might be a dictionary without Iterate or length that more closely resemble an array in z3. Jax follows a similar 'same api with restrictions' approach and I think it works good. This might come very handy for the user to build his symbolic representation / type system for his classes that are too problematic to be directly mapped by z3.

pschanely · 2022-11-29T20:58:25Z

Just wanted to comment maybe a maybe stupid idea that is to have extra Interfaces for a subset of the builting api that we know that works well. One example might be a dictionary without Iterate or length that more closely resemble an array in z3. Jax follows a similar 'same api with restrictions' approach and I think it works good. This might come very handy for the user to build his symbolic representation / type system for his classes that are too problematic to be directly mapped by z3.

That doesn't sound stupid to me!

petrusboniatus · 2022-12-03T22:14:29Z

Hi again.
I have an implementation of the Set[int] here only trying in for FrozenSets. I've not tested the actual performance yet (I am more confident with the List sequence), but passes the pre-existing unit tests.

In the meantime I have a question: I had to do a little modification to the core to be able to register a proxy for specific type with generics (FrozenSet[int]), as it was the fastest way to start testing this, not sure if it is the proper way because it feels like this dictionary is tracking origin type for some specific reason.

pschanely · 2022-12-05T00:58:17Z

Oh wow, you got quite far on your own. Let me answer your immediate question, and then will try to find a little time to review and say more things later this week.

We look up symbolic factories using the origin of the type rather than the actual type (see here). There isn't an important reason for this, except for that's USUALLY what we want. (one factory for all types of sets, lists, etc) If you want to keep that logic as-is, you can just add your set implementation under new branches in make_set(), which is probably what I'd advise for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-investigate string and sequence solvers #189

Re-investigate string and sequence solvers #189

pschanely commented Nov 21, 2022

petrusboniatus commented Nov 27, 2022 •

edited

pschanely commented Nov 27, 2022

pschanely commented Nov 28, 2022 •

edited

petrusboniatus commented Nov 29, 2022 •

edited

pschanely commented Nov 29, 2022

petrusboniatus commented Dec 3, 2022

pschanely commented Dec 5, 2022

Re-investigate string and sequence solvers #189

Re-investigate string and sequence solvers #189

Comments

pschanely commented Nov 21, 2022

petrusboniatus commented Nov 27, 2022 • edited

pschanely commented Nov 27, 2022

pschanely commented Nov 28, 2022 • edited

string

petrusboniatus commented Nov 29, 2022 • edited

pschanely commented Nov 29, 2022

petrusboniatus commented Dec 3, 2022

pschanely commented Dec 5, 2022

petrusboniatus commented Nov 27, 2022 •

edited

pschanely commented Nov 28, 2022 •

edited

petrusboniatus commented Nov 29, 2022 •

edited