-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] support for stream deduplication? #23
Comments
Well, at this project, we are definitely fans of @klauspost's work! 😄 It looks like it would mainly involve wrapping the reader/writer for tar... yeah I dunno how well this would work for zip. But tar, possibly. A lot of users may not find the memory/size tradeoff beneficial for them. That said, are you interested in putting together a PoC? |
I don't have the time to do this right now, I'm a student, I would like to see it in the project, since I think it'd be beneficial and add a lot of value. I can try tho :) |
Honestly I haven't really been putting too much effort into this, as the interest (probably understandably) is quite low. I was looking at what a "modern" implementation would look like. Wouldn't use SHA1 by default - but there is no real good thing for adaptive block splitting. restic/chunker is similar speed to the zpaq splitter, which is inherently single-core per-byte processing. But let me know if you have any questions, and I will try to remember the best I can :) |
Thanks @klauspost ! @xplshn I know how busy it can be as a student. There's no rush. If you find the time, feel free to spike it together and see how it goes! Could be a fun project. |
I was reading through sbase's tar.c, because I find it tidy/neat and simple, I'm trying to find a way to implement this so that it is compatible with other Any ideas that come to mind would be helpful :) |
Would it be in the scope for this project to add stream deduplication to some formats?
It would drastically reduce the filesizes of the archives. Basically, marking blocks which are the same as such, and only including a reference to the first of them in the header. Not all formats may be compatible tho
This would come as an step before compression.
So it could be implemented as a format option, basically "tar+dedup".compressionFormat
More about the technique can be read at https://github.com/klauspost/dedup, which is also a stream-deduplication library.
There's an article explaining everything in great detail here too: https://blog.klauspost.com/fast-stream-deduplication-in-go/
The text was updated successfully, but these errors were encountered: