Manifest split design document #593

paraseba · 2025-01-19T01:38:08Z

No description provided.

dcherian

This looks pretty thorough to me but I am a little concerned about complexity.

I left a few more detailed comments near the end, will think more about this tomorrow

design-docs/005-manifest-split.md

dcherian · 2025-01-19T04:17:55Z

design-docs/005-manifest-split.md

+                                  # max-manifest-size and arrays-per-manifest are
+                                  # mutually exclusive
+
+        overflow-to: coord2       # if an array cannot fit in the set, send it to set coord2


dang this is complex!

I'm not entirely sure we need it, but it's so easy to implement (si if/else in the algorithm pseudo-code). We can skip it, replace it, etc .... just a gathering of ideas for now, trying to go for maximum power before we cut

design-docs/005-manifest-split.md

dcherian · 2025-01-19T04:43:32Z

design-docs/005-manifest-split.md

+
+    - path: .*/(latitude|longitude|time)  # an optional regex matching on path
+
+      metadata-chunks: [0, 500]           # arrays having number of chunks in this range


Suggested change

metadata-chunks: [0, 500] # arrays having number of chunks in this range

metadata-chunk-size: [0, 500] # arrays having number of chunks calculated as Σ(shape/chunk-shape) in this range

(not committing this changes to apply the change everywhere)

dcherian · 2025-01-19T05:03:18Z

design-docs/005-manifest-split.md

+Manifest 1 has arrays: a, b
+Manifest 2 has arrays: a, b, c
+Manifest 3 has arrays: c


Feels like we'd gain a large amount of simplicity by disallowing that kind of packing.

For example,

Manifest 1 has arrays: a, b Manifest 2 has arrays: a, b Manifest 3 has arrays: a # b is over; c is not allowed Manifest 4 has arrays: c Manifest 5 has arrays: c

I'm not sure we gain a lot by allowing the packing of a,c in Manifest 3 (assuming that matches the rules). Indeed we might be better off relaxing the threshold to fit a between 1,2 by being looser about enforcing max-manifest-size. Something like if the overflow of a in Manifest 3 is <20% of max-manifest-size then just redistribute over preceding manifests.

design-docs/005-manifest-split.md

dcherian · 2025-01-19T05:11:23Z

design-docs/005-manifest-split.md

+To understand how the feature works, we show an example configuration. Comments
+in the yaml file explain what the different settings do.
+
+```yaml


I think we should write out some "optimal" configs for a few use-cases:

ERA5 ingest : 250 arrays with 10 million chunks each.

updating forecast dataset like HRRR.

A sparsely populated datacube where updates occur in specific geographic regions.

That might help constrain the design space.

great, yes!

Co-authored-by: Deepak Cherian <[email protected]>

rabernat

This is an extremely thoughtful exploration of this complex issue.

My initial reaction was the same as Deepak's...this is creating a lot of complexity for us to implement and maintain. I'm trying to imagine the test suite for the features described here, and it's making my head hurt!

However, if you feel like you can see the path forward here, I'm happy to support it. I would encourage trying to whittle this down to the minimum possible set of capabilities needed to meet the requirements.

rabernat · 2025-01-20T13:29:49Z

design-docs/005-manifest-split.md

+a specific manifest set. These rules can be based on array paths or number
+of chunks. In the future we could add more power to the rules system.
+
+Other new feature we provide is the ability to preload certain manifests.


Interesting.

I think this will help significantly

rabernat · 2025-01-20T13:35:43Z

design-docs/005-manifest-split.md

+
+Manifest sets, array rules, and manifest prefetch, are configured in the
+persistent configuration of the repository and can be overloaded on open as
+any other configuration value.


How is the default configuration provided?

And what happens if the configuration is changed? Will all of the manifests be rewritten?

I'll explain in the document, i missed this part. No, we won't rewrite until they are rewritten by a commit. In the future we could offer repacking the manifest as a explicit optimization operation.

rabernat · 2025-01-20T13:40:23Z

design-docs/005-manifest-split.md

+* Icechunk to be fast for interactive usage, when a user is exploring a dataset
+* Small reads not to parse the whole manifest
+* Small writes not to rewrite the whole manifest


Good set of requirements.

dcherian · 2025-01-20T16:23:30Z

design-docs/005-manifest-split.md

+
+
+  # should we actually do this by default or is no-preload a better default?
+  preload:


Actually, can we allow the user to preload the coord1 set? Seems silly to duplicate info like the path regex in both preload and rules

We could ... but currently the manifest sets are purely tacit. They don't persist in any way, it's purely a way to group them during flush. But of course, we could persist them.

But, it feels less powerful? How about having a condition on manifest size too? so you could say "preload up to 5 small manifest" or "preload manifest for this array"

Another future use-case is to preload the most recent manifest for multiple arrays (in the case of an updating forecast dataset)

Manifest split design document

2ef723d

paraseba requested review from rabernat and dcherian January 19, 2025 01:38

dcherian reviewed Jan 19, 2025

View reviewed changes

paraseba and others added 3 commits January 19, 2025 14:12

Update design-docs/005-manifest-split.md

6c06028

Co-authored-by: Deepak Cherian <[email protected]>

Update design-docs/005-manifest-split.md

ae3d7d7

Co-authored-by: Deepak Cherian <[email protected]>

Apply suggestions from code review

15bfd8a

Co-authored-by: Deepak Cherian <[email protected]>

rabernat reviewed Jan 20, 2025

View reviewed changes

dcherian reviewed Jan 20, 2025

View reviewed changes

dcherian mentioned this pull request Jan 20, 2025

split out manifests for smaller "coordinate" arrays #539

Closed

jhamman and others added 2 commits January 24, 2025 21:34

Merge branch 'main' into push-rmqlpwwzmlsz

c31863e

Merge branch 'main' into push-rmqlpwwzmlsz

13575f6

paraseba merged commit d1eba13 into main Jan 31, 2025
2 checks passed

paraseba deleted the push-rmqlpwwzmlsz branch January 31, 2025 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manifest split design document #593

Manifest split design document #593

paraseba commented Jan 19, 2025

dcherian left a comment

dcherian Jan 19, 2025

paraseba Jan 19, 2025

dcherian Jan 19, 2025

paraseba Jan 19, 2025

dcherian Jan 19, 2025

dcherian Jan 19, 2025

paraseba Jan 19, 2025

rabernat left a comment

rabernat Jan 20, 2025

paraseba Jan 20, 2025

rabernat Jan 20, 2025

rabernat Jan 20, 2025

paraseba Jan 20, 2025

rabernat Jan 20, 2025

dcherian Jan 20, 2025 •

edited

Loading

paraseba Jan 20, 2025

dcherian Jan 20, 2025


		- path: .*/(latitude\|longitude\|time) # an optional regex matching on path

		metadata-chunks: [0, 500] # arrays having number of chunks in this range

	metadata-chunks: [0, 500] # arrays having number of chunks in this range
	metadata-chunk-size: [0, 500] # arrays having number of chunks calculated as Σ(shape/chunk-shape) in this range



		# should we actually do this by default or is no-preload a better default?
		preload:

Manifest split design document #593

Manifest split design document #593

Conversation

paraseba commented Jan 19, 2025

dcherian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian Jan 20, 2025 •

edited

Loading