Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster disjoint/isSubsetOf for Set via unbalanced splitting. #865

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexfmpe
Copy link
Contributor

Unlike, say, union/intersection, disjoint doesn't return a new structure. It can avoid the re-balancing work because it immediately inspects and forgets the produced tree. This allows significant constant factor speedups.

  member:                    OK (0.14s)
    133  μs ±  12 μs,   0 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  insert:                    OK (0.18s)
    678  μs ±  47 μs, 2.6 MB allocated,  55 KB copied, 8.0 MB peak memory,       same as baseline
  map:                       OK (0.22s)
    106  μs ± 6.0 μs, 557 KB allocated,  16 KB copied, 8.0 MB peak memory,       same as baseline
  filter:                    OK (0.17s)
    80.2 μs ± 6.4 μs,  80 KB allocated, 1.2 KB copied, 8.0 MB peak memory,       same as baseline
  partition:                 OK (0.16s)
    148  μs ±  15 μs, 239 KB allocated, 4.0 KB copied, 8.0 MB peak memory,       same as baseline
  fold:                      OK (0.31s)
    34.9 ns ± 1.5 ns, 695 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  delete:                    OK (0.12s)
    454  μs ±  44 μs, 1.3 MB allocated, 372 B  copied, 8.0 MB peak memory,       same as baseline
  findMin:                   OK (0.27s)
    15.7 ns ± 784 ps,  31 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  findMax:                   OK (0.30s)
    17.1 ns ± 842 ps,  31 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  deleteMin:                 OK (0.25s)
    114  ns ± 7.0 ns, 471 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  deleteMax:                 OK (0.25s)
    117  ns ± 7.0 ns, 511 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  unions:                    OK (0.31s)
    147  μs ± 7.2 μs, 472 KB allocated,  10 KB copied, 8.0 MB peak memory,       same as baseline
  union:                     OK (0.16s)
    148  μs ±  11 μs, 467 KB allocated,  10 KB copied, 8.0 MB peak memory,       same as baseline
  difference:                OK (0.21s)
    102  μs ± 7.2 μs, 239 KB allocated, 2.4 KB copied, 8.0 MB peak memory,       same as baseline
  intersection:              OK (0.20s)
    49.8 μs ± 3.5 μs,  80 KB allocated, 835 B  copied, 8.0 MB peak memory,       same as baseline
  fromList:                  OK (0.28s)
    67.4 μs ± 4.2 μs, 271 KB allocated, 6.9 KB copied, 8.0 MB peak memory,       same as baseline
  fromList-desc:             OK (0.30s)
    600  μs ±  25 μs, 2.6 MB allocated,  56 KB copied, 8.0 MB peak memory,       same as baseline
  fromAscList:               OK (0.27s)
    127  μs ± 8.9 μs, 414 KB allocated, 8.3 KB copied, 8.0 MB peak memory,       same as baseline
  fromDistinctAscList:       OK (0.21s)
    50.2 μs ± 2.9 μs, 159 KB allocated, 3.2 KB copied, 8.0 MB peak memory,       same as baseline
  disjoint:false:            OK (0.22s)
    24.9 ns ± 1.7 ns,  31 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  disjoint:true:             OK (0.12s)
    111  μs ±  10 μs, 153 KB allocated,  61 B  copied, 8.0 MB peak memory, 23% less than baseline
  isSubsetOf:true:           OK (0.16s)
    18.6 μs ± 1.6 μs,   0 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  isSubsetOf:false:          OK (0.43s)
    100  ns ± 4.6 ns, 463 B  allocated,   0 B  copied, 8.0 MB peak memory, 63% less than baseline
  null.intersection:false:   OK (0.21s)
    49.5 μs ± 3.6 μs,  80 KB allocated, 834 B  copied, 8.0 MB peak memory,       same as baseline
  null.intersection:true:    OK (0.15s)
    154  μs ±  13 μs, 293 KB allocated, 174 B  copied, 8.0 MB peak memory,       same as baseline
  alterF:member:             OK (0.21s)
    801  μs ±  48 μs, 2.4 MB allocated, 402 B  copied, 8.0 MB peak memory,       same as baseline
  alterF:insert:             OK (0.23s)
    864  μs ±  52 μs, 3.6 MB allocated, 106 KB copied, 8.0 MB peak memory,       same as baseline
  alterF:delete:             OK (0.14s)
    536  μs ±  46 μs, 1.9 MB allocated, 432 B  copied, 8.0 MB peak memory,       same as baseline
  alterF:four:               OK (0.20s)
    813  μs ±  74 μs, 2.4 MB allocated, 388 B  copied, 8.0 MB peak memory,       same as baseline
  alterF:four:strings:       OK (0.22s)
    1.68 ms ± 120 μs, 2.5 MB allocated, 413 B  copied, 8.0 MB peak memory,       same as baseline
  alterF_naive:four:         OK (0.17s)
    1.39 ms ±  88 μs, 2.0 MB allocated, 327 B  copied, 8.0 MB peak memory,       same as baseline
  alterF_naive:four:strings: OK (0.27s)
    4.16 ms ± 351 μs, 2.1 MB allocated, 397 B  copied, 8.0 MB peak memory,       same as baseline

All 32 tests passed (7.14s)
Benchmark set-benchmarks: FINISH

@alexfmpe
Copy link
Contributor Author

The same should be applicable to Map, but there disjoint and isSubmapOf use different splits, resp, splitMember and splitLookup. Do I duplicate both to replace link with bin ?
FWIW, replacing splitMember with splitMember k0 m = let (a,b,c) = splitLookup k0 m in (a, isJust b, c) seems to cause no change in the benchmarks, but I don't know how fragile the optimizations involved are.

@treeowl
Copy link
Contributor

treeowl commented Oct 29, 2022

I'm going to need to see a proof that the stated time bounds still hold. Or if they don't, some reasonably tight new bound. We need to be sure that this doesn't introduce some nasty cases that our benchmarks don't happen to catch.

-- Same as 'splitMember' but skips re-balancing by using 'bin' instead of 'link'.
-- Attempting to build new trees out of these will error when re-balancing but
-- this can improve performance when the resulting trees are disposable.
splitMemberUnbalanced :: Ord a => a -> Set a -> (Set a,Bool,Set a)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to factor out bin/link to avoid code duplication but that made the involved functions 2-3x slower.

@treeowl
Copy link
Contributor

treeowl commented Oct 29, 2022

Did you see my question? The bounds we have for these operations lean on the highly nontrivial published proofs for bounds on intersection and difference. My concern is that by allowing one set of trees to become unbalanced (hence potentially deep for their size), you could break those proofs (and bounds).

@alexfmpe
Copy link
Contributor Author

Did you see my question?

My bad, hadn't noticed you had already replied by the time I finished commenting.

My concern is that by allowing one set of trees to become unbalanced (hence potentially deep for their size), you could break those proofs (and bounds).

Right. The idea here is that the act of balancing that link performs pays a cost anyway since it needs to navigate the unbalanced bits to balance them but we don't get to amortize that cost since we only do one operation on the tree and then throw it away. A bit like linear scanning of an array being faster than building a binomial queue in O(n) but then doing a single lookup.

That said, I don't currently have a proof, only a gut feeling backed by no more than a specific benchmark. I'll peek at the published proofs.

@meooow25
Copy link
Contributor

meooow25 commented Jan 7, 2023

I think this is interesting, and can be applied to union/intersection/difference/others too. I implemented it and can see some improvements in set-operations-set,

Results
  union-block_nn:              OK (0.39s)
    419  μs ±  25 μs, 32% less than baseline
  union-block_nn_swap:         OK (0.39s)
    410  μs ±  32 μs, 35% less than baseline
  union-block_ns:              OK (0.42s)
    38.5 μs ± 3.0 μs, 44% less than baseline
  union-block_sn_swap:         OK (0.47s)
    48.0 μs ± 3.5 μs, 35% less than baseline
  union-common_nn:             OK (0.77s)
    512  μs ±  34 μs,       same as baseline
  union-common_nn_swap:        OK (0.84s)
    1.23 ms ± 113 μs, 13% more than baseline
  union-common_ns:             OK (0.42s)
    404  μs ±  28 μs, 43% less than baseline
  union-common_nt:             OK (0.55s)
    34.7 μs ± 1.8 μs, 27% less than baseline
  union-common_sn_swap:        OK (0.32s)
    1.49 ms ± 146 μs,       same as baseline
  union-common_tn_swap:        OK (0.42s)
    95.5 μs ± 5.5 μs, 14% more than baseline
  union-disj_nn:               OK (0.53s)
    2.97 μs ± 174 ns, 36% less than baseline
  union-disj_nn_swap:          OK (0.54s)
    2.82 μs ± 197 ns, 43% less than baseline
  union-disj_ns:               OK (0.49s)
    2.11 μs ± 177 ns, 40% less than baseline
  union-disj_nt:               OK (0.54s)
    1.23 μs ±  89 ns, 44% less than baseline
  union-disj_sn_swap:          OK (0.51s)
    2.41 μs ± 195 ns, 38% less than baseline
  union-disj_tn_swap:          OK (0.47s)
    1.72 μs ± 166 ns, 32% less than baseline
  union-mix_nn:                OK (1.16s)
    16.7 ms ± 624 μs,  6% less than baseline
  union-mix_nn_swap:           OK (0.58s)
    16.6 ms ± 565 μs,       same as baseline
  union-mix_ns:                OK (0.48s)
    1.18 ms ±  42 μs, 31% less than baseline
  union-mix_nt:                OK (0.36s)
    66.1 μs ± 5.6 μs, 16% less than baseline
  union-mix_sn_swap:           OK (0.46s)
    2.25 ms ± 111 μs, 16% more than baseline
  union-mix_tn_swap:           OK (0.46s)
    97.9 μs ± 7.5 μs, 14% more than baseline
  difference-block_nn:         OK (0.40s)
    191  μs ±  16 μs, 56% less than baseline
  difference-block_nn_swap:    OK (0.42s)
    188  μs ±  11 μs, 57% less than baseline
  difference-block_ns:         OK (0.45s)
    18.4 μs ± 1.5 μs, 64% less than baseline
  difference-block_sn_swap:    OK (0.42s)
    17.7 μs ± 1.5 μs, 65% less than baseline
  difference-common_nn:        OK (0.53s)
    3.03 ms ± 189 μs, 14% less than baseline
  difference-common_nn_swap:   OK (0.35s)
    577  μs ±  44 μs, 17% less than baseline
  difference-common_ns:        OK (0.31s)
    1.33 ms ± 104 μs, 46% less than baseline
  difference-common_nt:        OK (0.44s)
    92.2 μs ± 7.3 μs, 29% less than baseline
  difference-common_sn_swap:   OK (0.42s)
    453  μs ±  25 μs, 55% less than baseline
  difference-common_tn_swap:   OK (0.45s)
    43.4 μs ± 3.4 μs, 49% less than baseline
  difference-disj_nn:          OK (0.56s)
    1.53 μs ±  90 ns, 56% less than baseline
  difference-disj_nn_swap:     OK (0.56s)
    1.55 μs ±  85 ns, 47% less than baseline
  difference-disj_ns:          OK (0.51s)
    1.19 μs ±  84 ns, 55% less than baseline
  difference-disj_nt:          OK (0.58s)
    772  ns ±  49 ns, 54% less than baseline
  difference-disj_sn_swap:     OK (0.51s)
    1.19 μs ±  87 ns, 50% less than baseline
  difference-disj_tn_swap:     OK (0.56s)
    736  ns ±  42 ns, 55% less than baseline
  difference-mix_nn:           OK (0.31s)
    3.11 ms ± 213 μs, 49% less than baseline
  difference-mix_nn_swap:      OK (0.58s)
    3.23 ms ± 147 μs, 47% less than baseline
  difference-mix_ns:           OK (0.37s)
    833  μs ±  55 μs, 40% less than baseline
  difference-mix_nt:           OK (0.36s)
    68.5 μs ± 6.1 μs, 28% less than baseline
  difference-mix_sn_swap:      OK (0.33s)
    562  μs ±  47 μs, 61% less than baseline
  difference-mix_tn_swap:      OK (0.45s)
    50.1 μs ± 3.1 μs, 44% less than baseline
  intersection-block_nn:       OK (0.40s)
    191  μs ±  16 μs, 66% less than baseline
  intersection-block_nn_swap:  OK (0.42s)
    189  μs ±  13 μs, 66% less than baseline
  intersection-block_ns:       OK (0.44s)
    18.4 μs ± 1.5 μs, 73% less than baseline
  intersection-block_sn_swap:  OK (0.41s)
    17.8 μs ± 1.5 μs, 74% less than baseline
  intersection-common_nn:      OK (0.27s)
    1.06 ms ±  90 μs, 32% less than baseline
  intersection-common_nn_swap: OK (0.20s)
    545  μs ±  42 μs, 33% less than baseline
  intersection-common_ns:      OK (0.26s)
    975  μs ±  87 μs, 46% less than baseline
  intersection-common_nt:      OK (0.38s)
    77.0 μs ± 6.2 μs, 40% less than baseline
  intersection-common_sn_swap: OK (0.43s)
    430  μs ±  22 μs, 67% less than baseline
  intersection-common_tn_swap: OK (0.45s)
    43.3 μs ± 3.6 μs, 61% less than baseline
  intersection-disj_nn:        OK (0.58s)
    1.54 μs ±  85 ns, 62% less than baseline
  intersection-disj_nn_swap:   OK (0.55s)
    1.55 μs ±  92 ns, 65% less than baseline
  intersection-disj_ns:        OK (0.51s)
    1.20 μs ±  85 ns, 64% less than baseline
  intersection-disj_nt:        OK (0.58s)
    780  ns ±  47 ns, 66% less than baseline
  intersection-disj_sn_swap:   OK (0.51s)
    1.19 μs ±  84 ns, 65% less than baseline
  intersection-disj_tn_swap:   OK (0.57s)
    750  ns ±  47 ns, 65% less than baseline
  intersection-mix_nn:         OK (0.54s)
    3.16 ms ± 183 μs, 60% less than baseline
  intersection-mix_nn_swap:    OK (0.31s)
    3.06 ms ± 300 μs, 62% less than baseline
  intersection-mix_ns:         OK (0.40s)
    839  μs ±  54 μs, 58% less than baseline
  intersection-mix_nt:         OK (0.51s)
    65.5 μs ± 3.1 μs, 47% less than baseline
  intersection-mix_sn_swap:    OK (0.31s)
    585  μs ±  48 μs, 65% less than baseline
  intersection-mix_tn_swap:    OK (0.48s)
    56.1 μs ± 3.0 μs, 53% less than baseline

union and intersection are only changed in terms of the unbalanced split, for difference I had to make a larger change so it is not a good direct comparison. There are also a handful of increases in union, not sure why.

Anyway, this seems useful, so I'll also try to understand the proofs and see if they still apply with this change.

@treeowl
Copy link
Contributor

treeowl commented Jan 7, 2023

One option to consider is to switch to an unbalanced split (or the "hedge" algorithms we used to use) when the sets/maps get small enough (below some fixed size). That will avoid breaking big O while getting a lot of the performance benefits in the cases where it's good.

@alexfmpe
Copy link
Contributor Author

alexfmpe commented Jan 9, 2023

I'm surprised it's not always worse for operations that return sets since they must return balanced sets in the end to preserve invariants no? Are you doing a single call to balance at the very end?

I haven't been in the headspace to look at this in a while, but one thing I'd been meaning to do is try and make this allocation free. It sounds plausible to me since without re-balancing the triple that's returned is immediately consumed. I had tried to do this via CPS but having functions as arguments seemed to kill performance and I'm always a bit lost when trying to reason about the Core that comes out.

@treeowl
Copy link
Contributor

treeowl commented Jan 9, 2023

Not all the reconstructed pieces end up getting incorporated. For intersection, none of them do. (I wish that were true of difference as well, but we follow the exact algorithm proved optimal; I don't know how to adapt the proof to work with difference going the other way.)

@meooow25
Copy link
Contributor

Are you doing a single call to balance at the very end?

I had to only for union. For difference I changed the algorithm so we split t2 instead.
To avoid confusion, here are the modified versions I tested.

-- A possibly unbalanced set.
-- Invariant: A Bin with non-zero size is balanced.
--            To construct an unbalanced set: Unbalanced (Bin 0 x l r)
newtype Unbalanced a = Unbalanced (Set a)

fromUnbalanced :: Unbalanced a -> Set a
fromUnbalanced (Unbalanced s0) = go s0
  where
    go (Bin 0 x l r) = link x (go l) (go r)
    go s = s

splitSUnbalanced :: Ord a => a -> Unbalanced a -> StrictPair (Unbalanced a) (Unbalanced a)
splitMemberUnbalanced :: Ord a => a -> Unbalanced a -> (Unbalanced a,Bool,Unbalanced a)

union :: Ord a => Set a -> Set a -> Set a
union t10 t20 = go t10 (Unbalanced t20)
  where
    go t1 (Unbalanced Tip) = t1
    go t1 (Unbalanced (Bin _ x Tip Tip)) = insertR x t1
    go (Bin 1 x _ _) t2 = insert x (fromUnbalanced t2)
    go Tip t2 = fromUnbalanced t2
    go t1@(Bin _ x l1 r1) t2 = case splitSUnbalanced x t2 of
      (l2 :*: r2)
        | l1l2 `ptrEq` l1 && r1r2 `ptrEq` r1 -> t1
        | otherwise -> link x l1l2 r1r2
        where !l1l2 = go l1 l2
              !r1r2 = go r1 r2

difference :: Ord a => Set a -> Set a -> Set a
difference t10 t20 = go t10 (Unbalanced t20)
  where
    go Tip _ = Tip
    go t1 (Unbalanced Tip) = t1
    go t1@(Bin _ x l1 r1) t2 = case splitMemberUnbalanced x t2 of
      (l2,b,r2)
        | b -> merge l1l2 r1r2
        | l1l2 `ptrEq` l1 && r1r2 `ptrEq` r1 -> t1
        | otherwise -> link x l1l2 r1r2
        where !l1l2 = go l1 l2
              !r1r2 = go r1 r2

intersection :: Ord a => Set a -> Set a -> Set a
intersection t10 t20 = go t10 (Unbalanced t20)
  where
    go Tip _ = Tip
    go _ (Unbalanced Tip) = Tip
    go t1@(Bin _ x l1 r1) t2
      | b = if l1l2 `ptrEq` l1 && r1r2 `ptrEq` r1
            then t1
            else link x l1l2 r1r2
      | otherwise = merge l1l2 r1r2
      where
        !(l2, b, r2) = splitMemberUnbalanced x t2
        !l1l2 = go l1 l2
        !r1r2 = go r1 r2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants