You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your question @sjperkins! You're correct that this is the current expected behavior of Icechunk.
The most immediate way we can address the growth in the size of the repo is via expiration of old versions and subsequent garbage collection of expired chunks. This will be implemented soon.
In developing Arraylake, we realized that there were some tricky challenges around CAS and garbage collection...basically it's hard to know if a CAS chunk is ever safe to delete, because they could be written at any moment. Moreover, we looked at over 1 PB of existing customer data and determined that CAS was only saving 1% of storage. (So your example is very artificial in terms of real-world usage.)
So it is conceivable that we may find a way to bring CAS back at some point. But this is not on the near-term roadmap.
I've done some local tests committing the same dataset to a local icechunk repo in multiple commits.
This seems to increase the repo size linearly by the number of commits x the dataset size.
I guess this is because Content Addressable Storage https://docs.earthmover.io/concepts/version-control-system#content-addressable-chunk-storage isn't implemented.
Will icechunk implement CAS in future?
The text was updated successfully, but these errors were encountered: