-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #455: Next in line Rollup (rebased) #514
base: main
Are you sure you want to change the base?
Conversation
Here is a simple illustration of how the two rollup implementations differ. The existing implementation maps a small cube to its parent cube, while the one proposed here first checks the next sibling cube. If the number of indexing columns The existing implementation, therefore, can theoretically have up to Conversely, this number is reduced to The example here shows how the old version merges all child cubes with their parent while the new implementation first checks the sibling and outputs a more balanced file size(element count). // Indexing three columns
val df = (spark
.range(85000)
.toDF("id_1")
.withColumn("id_2", rand())
.withColumn("id_3", rand())
)
(df
.write
.mode("overwrite")
.format("qbeast")
.option("columnsToIndex", "id_1,id_2,id_3")
.option("cubeSize", "10000")
.save(path)
) Old Rollup
New Rollup
|
The new proposal impacts even lower dimensions, like 2 and 3, which are more common at the moment, but have higher impact with a larger number of dimensions. One of the aspects that is not discussed at all, neither in the current approach nor in the new one, is checking the size of the recipient before merging cubes. There seems to be an implicit assumption that having larger files is better than having some small files. I'd think that a lot of small files is bad, but I'd think that a lot of large files, multiple times the cube size, is not our target either. This approach seems to be prone to producing many large files in some scenarios (I'd probably need to work through an example to convince myself of that). We might want to apply some concept like hysteresis to determine when to merge cubes if we really want to strike a good balance. It depends on the target and we don't seem to have a clear target defined, though. On a small note, if we do start merging sideways, then roll up makes less of a sense., |
Those are very good points. Indeed, we are mixing two concepts here. We use the desiredCubeSize to target the ideal cube size, and we use the same value for the minimum file size during the rollup process. The rationale behind having the minFileSize equal to desiredCubeSize is that in the ideal index state, the one that minimizes the files' column min-max ranges, every (not leaf) cube is stored in a single file alone (with its leaf children). As we follow this strict no-files-smaller-than-cube-size approach, we push in the next cube in line even if it already has the desired number of elements. We could be using more complex methods like:
TL;DR: The roll(next?) prioritizes having files bigger than desiredCubeSize and producing better min-max intervals than minimizing the difference between the various file sizes and the desiredCubeSize. |
Fixes #454.
As the code base has changed considerably in the last months, I'm opening a new PR, as rebasing PR #455 has proved difficult.
@Jiaweihu08 if you want to work on this issue, you should work on this PR.