Issue #455: Next in line Rollup (rebased) #514

cugni · 2024-12-11T15:54:03Z

Fixes #454.

As the code base has changed considerably in the last months, I'm opening a new PR, as rebasing PR #455 has proved difficult.

@Jiaweihu08 if you want to work on this issue, you should work on this PR.

core/src/main/scala/io/qbeast/spark/writer/Rollup.scala

Jiaweihu08 · 2024-12-17T14:17:25Z

Here is a simple illustration of how the two rollup implementations differ.

The existing implementation maps a small cube to its parent cube, while the one proposed here first checks the next sibling cube.

If the number of indexing columns d, is three, each inner cube has 2^d = 8 possible child cubes.

The existing implementation, therefore, can theoretically have up to 8 * (cubeSize - 1) + cubeSize records in a single file - 8 almost full child cubes, with (cubeSize -1) records all rolled up to their full parent cube.

Conversely, this number is reduced to 2 * (cubeSize - 1) + cubeSize for the new implementation - a cube c is merged with one sibling cube from the left and a child cube.

The example here shows how the old version merges all child cubes with their parent while the new implementation first checks the sibling and outputs a more balanced file size(element count).

// Indexing three columns
val df = (spark
	.range(85000)
	.toDF("id_1")
	.withColumn("id_2", rand())
	.withColumn("id_3", rand())
)
(df
	.write
	.mode("overwrite")
	.format("qbeast")
	.option("columnsToIndex", "id_1,id_2,id_3")
	.option("cubeSize", "10000")
	.save(path)
)

Old Rollup

Path	Block Element Counts	Element Count	Num Blocks
file_1	[9278, 9371, 9483, 9415, 9370, 9300, 9343, 10112, 9328]	85000	9

New Rollup

Path	Block Element Counts	Element Count	Num Blocks
file_1	[9365, 9449]	18814	2
file_2	[9305, 9416]	18721	2
file_3	[9324, 9342]	18666	2
file_4	[9409, 9242]	18651	2
file_5	[10148]	10148	1

Jiaweihu08 · 2025-01-21T16:01:38Z

File Size Upperbound (in terms of element count)

Implementation	Element Count Upper-bound
Current Implementation	(2 ^ d + 1) * cubeSize
Next-in-line	3 * cubeSize

Note: d is the number of indexing dimensions.

The upper bound for the existing implementation increases exponentially with the number of indexing dimensions. While the proposed solution has a fixed value.

An important thing to remember is that while Rollup imposes strict conditions on when to map a cube to a receiving cube, the cube element counts are approximations during writes. The final element count of the files will depend on the quality of the cube domain size and max weight estimations, so do expect some files to exceed the values presented here. However, during optimization, we compute element counts with 100% certainty.

Rollup Basics

Both implementations iterate from the bottom up and check the size of the current cube to determine if it should be mapped to a receiver cube; a cube c is mapped to a receiver cube if and only if its (accumulated) element count is < cubeSize. The mapping is done regardless of the element count of the receiver cube, and the only difference between the two implementations is how the receiver cube is selected.

Existing Rollup Implementation

The receiver for the existing implementation is the parent cube. All cubes are either left where they are or mapped to their parent cube.

Given d indexing columns, all cubes can have up to 2^d children. The worst-case scenario happens when all 2^d child cubes have cubeSize - 1 elements, and the parent is full. In this case, the parent cube will have accumulated approximately (2^d + 1)*cubeSize records. This value grows exponentially with the number of indexing columns.

Proposed Rollup Implementation (next-in-lin)

Instead of mapping cubes directly to their parent, the proposed implementation merges them with their next sibling cube, and the parent cube will be the receiver if it doesn't have a sibling.

The worst-case scenario for this implementation happens when a cube c, which has cubeSize elements, receives data from a left sibling cube and a child cube, both with cubeSize - 1 elements. The accumulated size of c is then approximately 3 * cubeSize, regardless of the number of dimensions d.

The accumulated size of c cannot be larger than this number because it can only receive data from two other cubes; this is true regardless of the number of possible sibling or child cubes, making the upper-bound independent of d. It cannot receive more than cubeSize - 1 elements from any cube because it doesn't satisfy the rollup condition for merging cubes.

fpj · 2025-01-22T07:59:08Z

The new proposal impacts even lower dimensions, like 2 and 3, which are more common at the moment, but have higher impact with a larger number of dimensions.

One of the aspects that is not discussed at all, neither in the current approach nor in the new one, is checking the size of the recipient before merging cubes. There seems to be an implicit assumption that having larger files is better than having some small files. I'd think that a lot of small files is bad, but I'd think that a lot of large files, multiple times the cube size, is not our target either. This approach seems to be prone to producing many large files in some scenarios (I'd probably need to work through an example to convince myself of that).

We might want to apply some concept like hysteresis to determine when to merge cubes if we really want to strike a good balance. It depends on the target and we don't seem to have a clear target defined, though.

On a small note, if we do start merging sideways, then roll up makes less of a sense.,

cugni · 2025-01-22T08:46:15Z

Those are very good points. Indeed, we are mixing two concepts here. We use the desiredCubeSize to target the ideal cube size, and we use the same value for the minimum file size during the rollup process. The rationale behind having the minFileSize equal to desiredCubeSize is that in the ideal index state, the one that minimizes the files' column min-max ranges, every (not leaf) cube is stored in a single file alone (with its leaf children).

As we follow this strict no-files-smaller-than-cube-size approach, we push in the next cube in line even if it already has the desired number of elements.

We could be using more complex methods like:

Merge in the next if I'm more than X% (e.g., 80%) of the desired cube size. This softens the issue a bit, but it doesn't solve it. This would be the equivalent of having minFileSize = 0.8 * desiredCubeSize. While simple, it is not clear to me how to find the proper value for this parameter other than testing and trying different values for each use case.
Find the ideal combination of sibling cubes that minimize the file size difference. This approach can be expensive with high dimensions and might require some heuristics. At the same time, it might lead to worse file-skipping, as the optimal packaging might combine cubes that contain very diverse data (for instance, they have the opposite half of each dimension partition). With the current approach, we prioritize packaging cubes that have more space in common, using locally a z-order heuristic.

TL;DR: The roll(next?) prioritizes having files bigger than desiredCubeSize and producing better min-max intervals than minimizing the difference between the various file sizes and the desiredCubeSize.

cugni and others added 2 commits December 11, 2024 16:50

Rebasing the changes form Qbeast-io#455

af156fa

Ensure sibling cubes are checked in the correct order(left to right)

2f7a331

Jiaweihu08 marked this pull request as ready for review December 13, 2024 15:21

Rollup is made more efficient

ef3928f

Jiaweihu08 requested a review from osopardo1 December 13, 2024 16:42

Jiaweihu08 mentioned this pull request Dec 13, 2024

Issue #454: Next in line Rollup #455

Closed

5 tasks

Jiaweihu08 changed the title ~~Rebasing the changes form #455: Next in line Rollup~~ Issue #455: Next in line Rollup (rebased) Dec 13, 2024

cugni commented Dec 16, 2024

View reviewed changes

core/src/main/scala/io/qbeast/spark/writer/Rollup.scala Outdated Show resolved Hide resolved

core/src/main/scala/io/qbeast/spark/writer/Rollup.scala Show resolved Hide resolved

Add ordering test

57bb975

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #455: Next in line Rollup (rebased) #514

Issue #455: Next in line Rollup (rebased) #514

cugni commented Dec 11, 2024 •

edited by Jiaweihu08

Loading

Jiaweihu08 commented Dec 17, 2024

Jiaweihu08 commented Jan 21, 2025 •

edited

Loading

fpj commented Jan 22, 2025

cugni commented Jan 22, 2025

Issue #455: Next in line Rollup (rebased) #514

Are you sure you want to change the base?

Issue #455: Next in line Rollup (rebased) #514

Conversation

cugni commented Dec 11, 2024 • edited by Jiaweihu08 Loading

Jiaweihu08 commented Dec 17, 2024

Old Rollup

New Rollup

Jiaweihu08 commented Jan 21, 2025 • edited Loading

File Size Upperbound (in terms of element count)

Rollup Basics

Existing Rollup Implementation

Proposed Rollup Implementation (next-in-lin)

fpj commented Jan 22, 2025

cugni commented Jan 22, 2025

cugni commented Dec 11, 2024 •

edited by Jiaweihu08

Loading

Jiaweihu08 commented Jan 21, 2025 •

edited

Loading