Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Set/Map balance calls faster #1053

Open
meooow25 opened this issue Oct 12, 2024 · 8 comments · May be fixed by #1056
Open

Make Set/Map balance calls faster #1053

meooow25 opened this issue Oct 12, 2024 · 8 comments · May be fixed by #1056

Comments

@meooow25
Copy link
Contributor

meooow25 commented Oct 12, 2024

I've been looking at #980 which improves insertion times by ~22% in the no-rebalance case. This made me wonder: why is rebalancing slow?

Let's look at Set because it is simpler. Map is similar.
From the code, we can see that balanceL and balanceR are marked as
NOINLINE.

balanceR :: a -> Set a -> Set a -> Set a
balanceR x l r = case l of
Tip -> case r of
Tip -> Bin 1 x Tip Tip
(Bin _ _ Tip Tip) -> Bin 2 x Tip r
(Bin _ rx Tip rr@(Bin _ _ _ _)) -> Bin 3 rx (Bin 1 x Tip Tip) rr
(Bin _ rx (Bin _ rlx _ _) Tip) -> Bin 3 rlx (Bin 1 x Tip Tip) (Bin 1 rx Tip Tip)
(Bin rs rx rl@(Bin rls rlx rll rlr) rr@(Bin rrs _ _ _))
| rls < ratio*rrs -> Bin (1+rs) rx (Bin (1+rls) x Tip rl) rr
| otherwise -> Bin (1+rs) rlx (Bin (1+size rll) x Tip rll) (Bin (1+rrs+size rlr) rx rlr rr)
(Bin ls _ _ _) -> case r of
Tip -> Bin (1+ls) x l Tip
(Bin rs rx rl rr)
| rs > delta*ls -> case (rl, rr) of
(Bin rls rlx rll rlr, Bin rrs _ _ _)
| rls < ratio*rrs -> Bin (1+ls+rs) rx (Bin (1+ls+rls) x l rl) rr
| otherwise -> Bin (1+ls+rs) rlx (Bin (1+ls+size rll) x l rll) (Bin (1+rrs+size rlr) rx rlr rr)
(_, _) -> error "Failure in Data.Set.balanceR"
| otherwise -> Bin (1+ls+rs) x l r
{-# NOINLINE balanceR #-}

The NOINLINEs were added in b96ffeb citing code bloat issues. I think that's reasonable, balanceL and balanceR are moderately large functions. But let's try inlining them to see what happens.

We'll use the existing insert benchmark, which is pretty good for this purpose. It repeatedly inserts a maximum element into the set. This is close enough to typical usage (repeated insertion) but also a bad case (because it always makes one branch heavier). Benchmarks use GHC 9.6.6.

Current
  insert: OK
    684  μs ±  33 μs, 2.6 MB allocated,  55 KB copied, 7.0 MB peak memory

INLINE balanceR
  insert: OK
    463  μs ±  20 μs, 2.6 MB allocated,  56 KB copied, 7.0 MB peak memory, 32% less than baseline

That's a significant improvement, but it might come at the cost of code bloat.

But, there is a compromise:
Most calls to balanceR will be on already balanced trees. Performing rebalancing is the rare case. Using some unsafe IO, I gathered some counts to show this:

L.foldl' (flip insert) empty [1..100]
-- 1 to 50
-- rebalances
0 0 1 0 1 1 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1
0 1 2 2 3 3 3 3 4 4 4 4 5 5 4 4 5 5 5 5 6 6 5 5 6 6 6 6 7 7 5 5 6 6 6 6 7 7 6 6 7 7 7 7 8 8 6 6 7 7
-- calls to balanceR

-- 51 to 100
-- rebalances
1 0 1 2 1 0 1 1 1 0 1 4 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1 1 0 1 2 1 0 1 1 1 0  1  4 1 0 1 1 1 0
7 7 8 8 7 7 8 8 8 8 9 9 6 6 7 7 7 7 8 8 7 7 8 8 8 8 9 9 7 7 8 8 8 8 9 9 8 8 9 9 9 9 10 10 7 7 8 8 8 8
-- calls to balanceR

Putting it another way, the number of calls to balanceR is O(n log n), but the number of rebalancings is O(n) (amortized O(1) per insert).

Which brings me to the suggestion: only inline the fast no-rebalance branch of balanceR. And to make that smaller, only the no-rebalance Bin-Bin branch.

--- a/containers/src/Data/Set/Internal.hs
+++ b/containers/src/Data/Set/Internal.hs
@@ -1881,7 +1881,14 @@ balanceL x l r = case r of
 -- balanceR is called when right subtree might have been inserted to or when
 -- left subtree might have been deleted from.
 balanceR :: a -> Set a -> Set a -> Set a
-balanceR x l r = case l of
+balanceR x l r = case (l, r) of
+  (Bin ls _ _ _, Bin rs _ _ _)
+    | rs <= delta*ls -> Bin (1+ls+rs) x l r
+  _ -> balanceR' x l r
+{-# INLINE balanceR #-}
+
+balanceR' :: a -> Set a -> Set a -> Set a
+balanceR' x l r = case l of
   Tip -> case r of
            Tip -> Bin 1 x Tip Tip
            (Bin _ _ Tip Tip) -> Bin 2 x Tip r
@@ -1894,14 +1901,12 @@ balanceR x l r = case l of
   (Bin ls _ _ _) -> case r of
            Tip -> Bin (1+ls) x l Tip

-           (Bin rs rx rl rr)
-              | rs > delta*ls  -> case (rl, rr) of
+           (Bin rs rx rl rr) -> case (rl, rr) of
                    (Bin rls rlx rll rlr, Bin rrs _ _ _)
                      | rls < ratio*rrs -> Bin (1+ls+rs) rx (Bin (1+ls+rls) x l rl) rr
                      | otherwise -> Bin (1+ls+rs) rlx (Bin (1+ls+size rll) x l rll) (Bin (1+rrs+size rlr) rx rlr rr)
                    (_, _) -> error "Failure in Data.Set.balanceR"
-              | otherwise -> Bin (1+ls+rs) x l r
-{-# NOINLINE balanceR #-}
+{-# NOINLINE balanceR' #-}

With this, we get

INLINE Bin-Bin no-rebalance
  insert: OK
    500  μs ±  30 μs, 2.6 MB allocated,  55 KB copied, 7.0 MB peak memory, 26% less than baseline

TL;DR:

To make balanceL/R calls faster, we can

  1. Inline them. This might cause some code bloat. How bad would this be, I don't know. This improves insert by 31%.
  2. Or, inline the "fast path" of Bin-Bin no-rebalance. A small increase in code size for a large share of the improvement above (26%).

Edit: I should mention that I expect many Set/Map functions will benefit from this, but at the moment I have tested out the idea only on insert.

@treeowl
Copy link
Contributor

treeowl commented Oct 12, 2024

Interesting, and clever. I think you're right that slowing down the slow path is generally okay because it's relatively rare. How does this affect sets with one or two elements?

@meooow25
Copy link
Contributor Author

How does this affect sets with one or two elements?

insDelTiny :: Int -> S.Set Int
insDelTiny n = foldl' f (S.fromList [0,1]) (take n ops)
  where
    f s (Add x) = S.insert x s
    f s (Del x) = S.delete x s
    ops = cycle [Del 0, Add 0, Del 1, Add 1]

data Op = Add !Int | Del !Int
Current
  insDelTiny: OK
    45.1 μs ± 1.6 μs, 191 KB allocated,   9 B  copied, 7.0 MB peak memory

INLINE balanceL and balanceR
  insDelTiny: OK
    35.3 μs ± 1.4 μs, 191 KB allocated,   9 B  copied, 7.0 MB peak memory, 22% less than baseline

INLINE Bin-Bin no-rebalance of balanceL and balanceR
  insDelTiny: OK
    46.9 μs ± 1.4 μs, 191 KB allocated,   9 B  copied, 7.0 MB peak memory,       same as baseline

@treeowl
Copy link
Contributor

treeowl commented Oct 13, 2024

Looks pretty darn good then.

@meooow25
Copy link
Contributor Author

Do you mean inlining the whole thing? We're not getting improvements on the small sets if we only inline the Bin-Bin branch.

@treeowl
Copy link
Contributor

treeowl commented Oct 13, 2024

We're not slowing them down, so I think we're good to do the partial inlining. We'd need more code size stats to justify inlining everything, I imagine.

@meooow25
Copy link
Contributor Author

We're not slowing them down, so I think we're good to do the partial inlining.

Awesome, I'll benchmark and see what else gets affected.

We'd need more code size stats to justify inlining everything, I imagine.

How would one gather stats like that to make the decision?

@treeowl
Copy link
Contributor

treeowl commented Oct 13, 2024

Very good question. Maybe that was a bad way to put it.... The most important effect of code size is on caching, so benchmarking more complex algorithms that use sets should give us useful info.

@meooow25
Copy link
Contributor Author

I see, we should try adding such algorithms at some point then. Right now we don't really have complex benchmarks.

@meooow25 meooow25 linked a pull request Oct 26, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants