-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifest split design document #593
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty thorough to me but I am a little concerned about complexity.
I left a few more detailed comments near the end, will think more about this tomorrow
# max-manifest-size and arrays-per-manifest are | ||
# mutually exclusive | ||
|
||
overflow-to: coord2 # if an array cannot fit in the set, send it to set coord2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dang this is complex!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure we need it, but it's so easy to implement (si if/else in the algorithm pseudo-code). We can skip it, replace it, etc .... just a gathering of ideas for now, trying to go for maximum power before we cut
|
||
- path: .*/(latitude|longitude|time) # an optional regex matching on path | ||
|
||
metadata-chunks: [0, 500] # arrays having number of chunks in this range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata-chunks: [0, 500] # arrays having number of chunks in this range | |
metadata-chunk-size: [0, 500] # arrays having number of chunks calculated as Σ(shape/chunk-shape) in this range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not committing this changes to apply the change everywhere)
Manifest 1 has arrays: a, b | ||
Manifest 2 has arrays: a, b, c | ||
Manifest 3 has arrays: c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like we'd gain a large amount of simplicity by disallowing that kind of packing.
For example,
Manifest 1 has arrays: a, b
Manifest 2 has arrays: a, b
Manifest 3 has arrays: a # b is over; c is not allowed
Manifest 4 has arrays: c
Manifest 5 has arrays: c
I'm not sure we gain a lot by allowing the packing of a,c
in Manifest 3 (assuming that matches the rules). Indeed we might be better off relaxing the threshold to fit a
between 1,2 by being looser about enforcing max-manifest-size
. Something like if the overflow of a
in Manifest 3 is <20% of max-manifest-size
then just redistribute over preceding manifests.
To understand how the feature works, we show an example configuration. Comments | ||
in the yaml file explain what the different settings do. | ||
|
||
```yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should write out some "optimal" configs for a few use-cases:
- ERA5 ingest : 250 arrays with 10 million chunks each.
- updating forecast dataset like HRRR.
- A sparsely populated datacube where updates occur in specific geographic regions.
That might help constrain the design space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, yes!
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an extremely thoughtful exploration of this complex issue.
My initial reaction was the same as Deepak's...this is creating a lot of complexity for us to implement and maintain. I'm trying to imagine the test suite for the features described here, and it's making my head hurt!
However, if you feel like you can see the path forward here, I'm happy to support it. I would encourage trying to whittle this down to the minimum possible set of capabilities needed to meet the requirements.
a specific manifest set. These rules can be based on array paths or number | ||
of chunks. In the future we could add more power to the rules system. | ||
|
||
Other new feature we provide is the ability to preload certain manifests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will help significantly
|
||
Manifest sets, array rules, and manifest prefetch, are configured in the | ||
persistent configuration of the repository and can be overloaded on open as | ||
any other configuration value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the default configuration provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what happens if the configuration is changed? Will all of the manifests be rewritten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll explain in the document, i missed this part. No, we won't rewrite until they are rewritten by a commit. In the future we could offer repacking the manifest as a explicit optimization operation.
* Icechunk to be fast for interactive usage, when a user is exploring a dataset | ||
* Small reads not to parse the whole manifest | ||
* Small writes not to rewrite the whole manifest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good set of requirements.
|
||
|
||
# should we actually do this by default or is no-preload a better default? | ||
preload: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we allow the user to preload the coord1
set? Seems silly to duplicate info like the path
regex in both preload
and rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could ... but currently the manifest sets are purely tacit. They don't persist in any way, it's purely a way to group them during flush. But of course, we could persist them.
But, it feels less powerful? How about having a condition on manifest size too? so you could say "preload up to 5 small manifest" or "preload manifest for this array"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another future use-case is to preload the most recent manifest for multiple arrays (in the case of an updating forecast dataset)
No description provided.