-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3git snapshots read entire (potentially huge) data files into memory with probability 1 in 64 #21
Comments
To reproduce the behavior described above. mkdir s3git_test
cd s3git_test
s3git init
# Create a 4GB file
head -c $((4*1024*1024*1024)) /dev/zero > test.bin
# Now create a snapshot
/usr/bin/time -l s3git snapshot create . -m 'BOOM' Under MacOS this requires ~17GB of memory to complete:
If you modify the test to produce a file that is 4GB-1 in length, you get a very different result: mkdir s3git_test
cd s3git_test
s3git init
# Create a 4GB - 1 file
head -c $((4*1024*1024*1024-1)) /dev/zero > test.bin
# Now create a snapshot
/usr/bin/time -l s3git snapshot create . -m 'boom?' In this case it completes using only 23 MB of memory!
|
Current behaviour is a "quick" implementation, obviously a streaming approach is the correct way to do this. I will create an issue for this. Regarding 2)
But overall I think the upper limit in combination with increasing the chuck size is the best approach (and not difficult to implement), with BLAKE2 hashing speeds of up to 1GB/sec/core this should still be pretty performant. |
Hi @fwessels, thanks for your response! Having spent some time in the code and thinking about this, I'm curious why the "deduped" files need to be as large/complicated as they are. I'm sure I haven't yet thought of all of the complexities, but isn't the "root hash" alone sufficient to point into the backing store to be able to retrieve the hydrated file? Other than validating that the deduped file is indeed a deduped file by recreating the root hash, it seems that it would be just as easy to use the root hash to retrieve the chunk hashes from the backing store. So something like:
Would write a file in the target directory for every file in the snapshot that contains only the root hash, and perhaps some kind of check mechanism to distinguish from any other 64 byte file (perhaps appended with a hash of the root hash?). Then all deduped files would be small, very inexpensive to check, and wouldn't need to be transmitted from the server. They could simply be created from the equivalent of Retrieving the hydrated contents of a deduped file would then only need to add the extra step of retrieving the chunk hashes before catting them together (as What am I missing? How does storing/revalidating all of the chunk hashes in the deduped files really help? Thanks! |
You are saying: "just as easy to use the root hash to retrieve the chunk hashes from the backing store”
This is what is already happening here: just as easy to use the root hash to retrieve the chunk hashes from the backing store
So you fetch 18e622….28d7053 (size=64), which gives the pointer to the chunk with the actual data.
So where is the complicated part (other than the fact that an object can be represented in two formats: deduced & hydrated https://github.com/s3git/s3git/blob/master/BLAKE2.md#hydrated )
… On 10 Mar 2017, at 14:42, Vaughn Iverson ***@***.***> wrote:
Hi @fwessels <https://github.com/fwessels>, thanks for your response!
Having spent some time in the code and thinking about this, I'm curious why the "deduped" files need to be as large/complicated as they are.
I'm sure I haven't yet thought of all of the complexities, but isn't the "root hash" alone sufficient to point into the backing store to be able to retrieve the hydrated file?
Other than validating that the deduped file is indeed a deduped file by recreating the root hash, it seems that it would be just as easy to use the root hash to retrieve the chunk hashes from the backing store.
So something like:
s3git snapshot checkout [dir] [commit] --dedup
Would write a file in the target directory for every file in the snapshot that contains only the root hash, and perhaps some kind of check mechanism to distinguish from any other 64 byte file (perhaps appended with a hash of the root hash?). Then all deduped files would be small, very inexpensive to check, and wouldn't need to transmitted from the server. They could simply be created from the equivalent of s3git snapshot ls [commit].
Retrieving the hydrated contents of a deduped file would then only need to add the extra step of retrieving the chunk hashes before cattign them together (as s3git cat [key] already does).
What am I missing? How does storing/revalidating all of the chunk hashes in the deduped files really help?
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#21 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgMTUVlWjPdCCqj-2xrQUa8Du44ihUfks5rkdHrgaJpZM4MY4l7>.
|
I probably didn't express myself very clearly... When you clone a snapshot, why are deduped files: 1)
and not just: 2)
That way they can be generated and checked in constant time/space. What advantage is there to dedup format 1) over format 2)? They seem equivalent in terms of functionality. But 1) is more work and complexity (as this issue demonstrates) when deduped files are never actually hydrated in the clone. |
To be clear, format 1) would still be used for deduped files in the actual repository (under the root hash key), just not in local clones (when under the original filename). |
Using 1) you can directly get any portion of the file, eg. if you need (given 5 MB chunks) the range 10-15 MB, you would get the 3rd hash (0:2) and read this chunk out of object storage. Otherwise you would first need to get the (list of) child hashes and only then be able to read the contents. As for 2): what would CHK be? A checksum? |
Right, I guess my thought was that if you want all (or part) of the hydrated data, it is no big deal to use the root hash to retrieve that information from the server (what Using 1), the chunk hashes are effectively cached locally, saving a round trip request when you want file data. For small files, that is a negligible amount of data, and for large files, it is still small relative to all of the requests you will need to do to retrieve all of the needed chunks. But 1) comes at the cost of:
Using 2) every dedup file is the same length, and creating dedup clones and commits is extremely cheap (O(n) work). CHK could be:
Basically CHK is something inexpensive (and constant time/memory) to verify that this is a valid, uncorrupted dedup file pointer and not a hydrated file of length 256. So, 1) makes hydrating deduped files marginally cheaper (saves one access per file) at the time of hydration, but at the cost of a performing an extra lookup per file when cloning, a little extra storage, added complexity, and more expensive clone/commit operations. Option 2) gives deduped files of all sizes the same storage and performance on clone and commit that single chunk files currently have, but at the cost of an extra disk/network request when hydrated file data is needed. In my opinion, 2) is preferable because it is simpler and performs better in what seems like the most common case for deduped clones: most files will not be hydrated multiple times. If files are hydrated (on average) less than once per clone, then I believe 2) is better in every respect. The only case where 1) is clearly better is when files will be hydrated many times each from a single clone, and commits are relatively infrequent. |
Regarding 1) The amount of data would not be so much the issue, it is more that, for every read access to the object, you would first need to get the hashes for the chunks (since you have no place to store the chunk hashes you would need to do this for every read). Given huge file sizes, a single GET is probably not always what you want, so making more GETs using HTTP-RANGE makes more sense (and you can do this in parallel, massively speeding up the download -- this is similar to a multi-part-upload but then for a download). The downside is having to rehash the file upon a commit. Regarding 2) I can certainly see the usefulness here for certain scenarios, but I would like for the checksum to break the "CAS" nature of the content. However if the checksum would be another BLAKE2B hash (of the root bash) in hierarchy mode then this would be retained. And it would be ideal if you could even differentiate this from the format used in scenario 1). Let me think about that... |
Over the weekend, I thought of one other issue with the current dedup format. It does not communicate any information about chunk size. That is, if it is valid to chunk files using something other than the current 5MB default (either a different fixed length, or a variable length content defined chunking scheme), then it is not possible to effectively "seek" to the chunk that contains specific data you would like to access. You can't know which chunk it is in without assuming the default fixed chunk size. |
Also, I should add, if you aren't familiar with them, you should examine how the Dat and Noms projects are addressing very similar issues. They each have somewhat different goals than S3git (and each is more complex for those reasons) but the fundamental underlying architectures are similar, and the functionality of each of those projects significantly overlaps with s3git. Specifically, each of them structures their data store as a Git-like Merkle tree, each implements Rabin-like content defined chunking, and each uses content addressed storage, with pluggable backends that include S3. They also each provide some kind of "sparse checkout" functionality like the deduped files in s3git. The Dat project differs from s3git in that it is less focused on providing a "git-like" workflow to its users (it is more Dropbox/GoogleDrive-like), and it is heavily focused on decentralized peer-to-peer sharing of files, rather than cloud-backed repos (although S3 is an optional backend for cloud based peers). The Noms project is very similar to s3git in that it very much exposes a git-like workflow, and anticipates cloud (and specifically S3-backed) hosting, but it adds to that a rich database-like model for the information being stored which includes the ability to store unstructured binary blobs. Anyway, I have studied each of these projects in quite a bit of detail, and understand pretty well how they work and why they've made the choices they have. My comments here are informed by that. However, I'm interested in s3git because it is simpler and the project seems more stable, while still meeting the needs of my immediate project. Dat and Noms are each still developing at a rapid pace, and I can't deploy software that is changing so rapidly without maintaining my own stable fork, which would be a lot of work for such ambitious projects with much larger teams behind them. Anyway, that's my motivation here. |
Regarding the chunk-size, it is correct that this has to be a "known" property. In case it would ever get lost you could still figure it out by doing a kind of "brute" force attack in order to find the correct root hash. |
I am familiar that Dat is more a peer-to-peer system, Noms is a newer project that I haven't studied much. Especially with content defined chunking (#20 (comment)) it would make sense (or is even necessary) to be more flexible in the deduped format. Also regarding a new deduped format, the BLAKE2 NodeDepth maybe offers something to take advantage of (see https://github.com/s3git/bt2sum/blob/master/bt2sum.go#L67). Currently the root hash is computed at NodeDepth = 1, if a CHK would be computed over the root hash with a NodeDepth = 2 (unused as of yet) you would get a unique hash key. And you would be able to uniquely detect a deduped file. What do you think? |
When used to commit snapshots, s3git checks to see whether each file is stored in the snapshot deduped format. Described here
The code that performs this check for each file in the snapshot implements a quick "prefilter" using the length of each file:
Any file that meets these criteria (a 1 in 64 chance) is then read entirely into memory to attempt to verify that the presumed root hash (the last 64 bytes of the file) is the hash of the concatenation of all of the bytes that precede it.
Although it may seem safe to read a valid deduped file into memory like this (an 8TB file will only produce a 100MB deduped file), the problem is that the very purpose of this check is to determine if this is in fact a deduped file or not. If it is not, then it may be a huge file (e.g. an exactly 8TB file) that meets the criteria above of
len >= 128 && (len & 63) == 0
.This is obviously bad, and probably two things need to change to fix it.
For safety and stability the memory used by this process needs to be limited. The use of
ioutil.ReadFile(filename)
(usually a bad code smell) needs to be replaced with functionality that reads the file in chunks, streaming them into the hash calculation, so the memory use is safely bounded.For performance some more stringent test needs to be made to inexpensively reject virtually all non-deduped datafiles, to avoid unnecessarily reading/hashing huge files:
I'm sure there are other possibile solutions I haven't immediately thought of...
Anyway, I consider this to be a serious issue that makes the current "snapshot" function unsuitable for any dataset with large-ish files (say... bigger than 5-10% of available RAM).
The text was updated successfully, but these errors were encountered: