Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #455: Next in line Rollup (rebased) #514

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

cugni
Copy link
Member

@cugni cugni commented Dec 11, 2024

Fixes #454.

As the code base has changed considerably in the last months, I'm opening a new PR, as rebasing PR #455 has proved difficult.

@Jiaweihu08 if you want to work on this issue, you should work on this PR.

@Jiaweihu08 Jiaweihu08 marked this pull request as ready for review December 13, 2024 15:21
@Jiaweihu08 Jiaweihu08 requested a review from osopardo1 December 13, 2024 16:42
@Jiaweihu08 Jiaweihu08 mentioned this pull request Dec 13, 2024
5 tasks
@Jiaweihu08 Jiaweihu08 changed the title Rebasing the changes form #455: Next in line Rollup Issue #455: Next in line Rollup (rebased) Dec 13, 2024
@Jiaweihu08
Copy link
Member

Here is a simple illustration of how the two rollup implementations differ.

The existing implementation maps a small cube to its parent cube, while the one proposed here first checks the next sibling cube.

If the number of indexing columns d, is three, each inner cube has 2^d = 8 possible child cubes.

The existing implementation, therefore, can theoretically have up to 8 * (cubeSize - 1) + cubeSize records in a single file - 8 almost full child cubes, with (cubeSize -1) records all rolled up to their full parent cube.

Conversely, this number is reduced to 2 * (cubeSize - 1) + cubeSize for the new implementation - a cube c is merged with one sibling cube from the left and a child cube.

The example here shows how the old version merges all child cubes with their parent while the new implementation first checks the sibling and outputs a more balanced file size(element count).

// Indexing three columns
val df = (spark
	.range(85000)
	.toDF("id_1")
	.withColumn("id_2", rand())
	.withColumn("id_3", rand())
)
(df
	.write
	.mode("overwrite")
	.format("qbeast")
	.option("columnsToIndex", "id_1,id_2,id_3")
	.option("cubeSize", "10000")
	.save(path)
)

Old Rollup

Path Block Element Counts Element Count Num Blocks
file_1 [9278, 9371, 9483, 9415, 9370, 9300, 9343, 10112, 9328] 85000 9

New Rollup

Path Block Element Counts Element Count Num Blocks
file_1 [9365, 9449] 18814 2
file_2 [9305, 9416] 18721 2
file_3 [9324, 9342] 18666 2
file_4 [9409, 9242] 18651 2
file_5 [10148] 10148 1

@Jiaweihu08
Copy link
Member

Jiaweihu08 commented Jan 21, 2025

File Size Upperbound (in terms of element count)

Implementation Element Count Upper-bound
Current Implementation (2 ^ d + 1) * cubeSize
Next-in-line 3 * cubeSize

Note: d is the number of indexing dimensions.

The upper bound for the existing implementation increases exponentially with the number of indexing dimensions. While the proposed solution has a fixed value.

An important thing to remember is that while Rollup imposes strict conditions on when to map a cube to a receiving cube, the cube element counts are approximations during writes. The final element count of the files will depend on the quality of the cube domain size and max weight estimations, so do expect some files to exceed the values presented here. However, during optimization, we compute element counts with 100% certainty.

Rollup Basics

Both implementations iterate from the bottom up and check the size of the current cube to determine if it should be mapped to a receiver cube; a cube c is mapped to a receiver cube if and only if its (accumulated) element count is < cubeSize. The mapping is done regardless of the element count of the receiver cube, and the only difference between the two implementations is how the receiver cube is selected.

Existing Rollup Implementation

The receiver for the existing implementation is the parent cube. All cubes are either left where they are or mapped to their parent cube.

Given d indexing columns, all cubes can have up to 2^d children. The worst-case scenario happens when all 2^d child cubes have cubeSize - 1 elements, and the parent is full. In this case, the parent cube will have accumulated approximately (2^d + 1)*cubeSize records. This value grows exponentially with the number of indexing columns.
IMG_B6FDB8423037-1

Proposed Rollup Implementation (next-in-lin)

Instead of mapping cubes directly to their parent, the proposed implementation merges them with their next sibling cube, and the parent cube will be the receiver if it doesn't have a sibling.

The worst-case scenario for this implementation happens when a cube c, which has cubeSize elements, receives data from a left sibling cube and a child cube, both with cubeSize - 1 elements. The accumulated size of c is then approximately 3 * cubeSize, regardless of the number of dimensions d.

IMG_B6FDB8423037-2

The accumulated size of c cannot be larger than this number because it can only receive data from two other cubes; this is true regardless of the number of possible sibling or child cubes, making the upper-bound independent of d. It cannot receive more than cubeSize - 1 elements from any cube because it doesn't satisfy the rollup condition for merging cubes.

@fpj
Copy link
Contributor

fpj commented Jan 22, 2025

The new proposal impacts even lower dimensions, like 2 and 3, which are more common at the moment, but have higher impact with a larger number of dimensions.

One of the aspects that is not discussed at all, neither in the current approach nor in the new one, is checking the size of the recipient before merging cubes. There seems to be an implicit assumption that having larger files is better than having some small files. I'd think that a lot of small files is bad, but I'd think that a lot of large files, multiple times the cube size, is not our target either. This approach seems to be prone to producing many large files in some scenarios (I'd probably need to work through an example to convince myself of that).

We might want to apply some concept like hysteresis to determine when to merge cubes if we really want to strike a good balance. It depends on the target and we don't seem to have a clear target defined, though.

On a small note, if we do start merging sideways, then roll up makes less of a sense.,

@cugni
Copy link
Member Author

cugni commented Jan 22, 2025

Those are very good points. Indeed, we are mixing two concepts here. We use the desiredCubeSize to target the ideal cube size, and we use the same value for the minimum file size during the rollup process. The rationale behind having the minFileSize equal to desiredCubeSize is that in the ideal index state, the one that minimizes the files' column min-max ranges, every (not leaf) cube is stored in a single file alone (with its leaf children).

As we follow this strict no-files-smaller-than-cube-size approach, we push in the next cube in line even if it already has the desired number of elements.

We could be using more complex methods like:

  1. Merge in the next if I'm more than X% (e.g., 80%) of the desired cube size. This softens the issue a bit, but it doesn't solve it. This would be the equivalent of having minFileSize = 0.8 * desiredCubeSize. While simple, it is not clear to me how to find the proper value for this parameter other than testing and trying different values for each use case.
  2. Find the ideal combination of sibling cubes that minimize the file size difference. This approach can be expensive with high dimensions and might require some heuristics. At the same time, it might lead to worse file-skipping, as the optimal packaging might combine cubes that contain very diverse data (for instance, they have the opposite half of each dimension partition). With the current approach, we prioritize packaging cubes that have more space in common, using locally a z-order heuristic.

TL;DR: The roll(next?) prioritizes having files bigger than desiredCubeSize and producing better min-max intervals than minimizing the difference between the various file sizes and the desiredCubeSize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rollup performance on high dimensional space
3 participants