Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] support for stream deduplication? #23

Open
xplshn opened this issue Mar 23, 2025 · 5 comments
Open

[r] support for stream deduplication? #23

xplshn opened this issue Mar 23, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@xplshn
Copy link

xplshn commented Mar 23, 2025

Would it be in the scope for this project to add stream deduplication to some formats?

It would drastically reduce the filesizes of the archives. Basically, marking blocks which are the same as such, and only including a reference to the first of them in the header. Not all formats may be compatible tho

This would come as an step before compression.
So it could be implemented as a format option, basically "tar+dedup".compressionFormat

More about the technique can be read at https://github.com/klauspost/dedup, which is also a stream-deduplication library.

There's an article explaining everything in great detail here too: https://blog.klauspost.com/fast-stream-deduplication-in-go/

@mholt
Copy link
Owner

mholt commented Mar 24, 2025

Well, at this project, we are definitely fans of @klauspost's work! 😄

It looks like it would mainly involve wrapping the reader/writer for tar... yeah I dunno how well this would work for zip. But tar, possibly.

A lot of users may not find the memory/size tradeoff beneficial for them.

That said, are you interested in putting together a PoC?

@mholt mholt added the enhancement New feature or request label Mar 24, 2025
@xplshn
Copy link
Author

xplshn commented Mar 25, 2025

Well, at this project, we are definitely fans of @klauspost's work! 😄

It looks like it would mainly involve wrapping the reader/writer for tar... yeah I dunno how well this would work for zip. But tar, possibly.

A lot of users may not find the memory/size tradeoff beneficial for them.

That said, are you interested in putting together a PoC?

I don't have the time to do this right now, I'm a student, I would like to see it in the project, since I think it'd be beneficial and add a lot of value. I can try tho :)

@klauspost
Copy link

Honestly I haven't really been putting too much effort into this, as the interest (probably understandably) is quite low.

I was looking at what a "modern" implementation would look like. Wouldn't use SHA1 by default - but there is no real good thing for adaptive block splitting. restic/chunker is similar speed to the zpaq splitter, which is inherently single-core per-byte processing.

But let me know if you have any questions, and I will try to remember the best I can :)

@mholt
Copy link
Owner

mholt commented Mar 25, 2025

Thanks @klauspost !

@xplshn I know how busy it can be as a student. There's no rush. If you find the time, feel free to spike it together and see how it goes! Could be a fun project.

@xplshn
Copy link
Author

xplshn commented Mar 26, 2025

I was reading through sbase's tar.c, because I find it tidy/neat and simple, I'm trying to find a way to implement this so that it is compatible with other tar implementations.

Any ideas that come to mind would be helpful :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants