You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to #789, many of the corruption issues stem from reading freelists with bogus values. Most of the time we're catching the corruption, e.g. if the page id is suddenly out of the bounds of the memory mapped file, or it's allocated and free at the same time. We currently cannot detect silent bit flips where we suddenly turn pageid 9 into pageid 8, which then later find their way into a corrupted freelist.
The current serialization mechanism is a very simple run-length encoding of a list of sorted uint64 that denote the page identifier being free/pending. The data is directly written to the memory address of the freelist page(s) as a slice, which is backed by the memory mapped file.
While we assume that we have a bug somewhere that corrupts certain pages in some (virtualized?) environments and crash cases, we should be proactive here and start adding additional safeguards to detect and potentially even correct errors.
Efficiency
The amount of free pages and the size of the individual page ids rarely require the full 64 bit width of the page id type.
Which is great, as this opens up the possibility for varint/vint encoding, which is also done by protobuf [1].
This could save significant amount of space and deserialization time, as the current encoding can only store 512 ids per page, with vint compression we could easily store twice that.
Alternatively, we can encode the free page id as a bit in a sparse bitset like [2], which allows us even further compress this up to about 2048 ids per page.
To start off very simple, we should add an error detection code (e.g. a CRC checksum) to the serialization process, if not even to the Page struct itself to also safeguard the data stored in the btree.
A separate topic is how we're reacting to a page/freelist being invalid based on their checksum, I'm happy to hear your suggestions on this.
Error Correction
This is probably a longer project, yet we should evaluate error correcting codes like Reed-Solomon [3] here as well. This would allow us to reconstruct faulty pages with only a little more space allocated.
Given that we already have a serialization protocol in place, we need to be mindful about backward compatibility.
I propose to reserve the first integer with value math.MaxUint64 to denote the new "schema", which we should start to version properly for any future changes we may want to make. It's unlikely to be used as the real length of a freelist, given current physical memory and disk constraints.
One example encoding I could think of would be:
(PageCount uint16 = 0xFFFF)
(ids[0] aka. Length = math.MaxUint64)
version uint8
< encoded list of ids, varint/bitsets >
checksum uint32/uint64
I would prefer to have a more battle tested and upgradable schema like protobuf, even at the expense of parsing overhead, but I wouldn't want to introduce a dependency that might be conflicting with etcd.
Thanks for reading that far, happy to hear your thoughts.
The text was updated successfully, but these errors were encountered:
Background
Related to #789, many of the corruption issues stem from reading freelists with bogus values. Most of the time we're catching the corruption, e.g. if the page id is suddenly out of the bounds of the memory mapped file, or it's allocated and free at the same time. We currently cannot detect silent bit flips where we suddenly turn pageid 9 into pageid 8, which then later find their way into a corrupted freelist.
The current serialization mechanism is a very simple run-length encoding of a list of sorted
uint64
that denote the page identifier being free/pending. The data is directly written to the memory address of the freelist page(s) as a slice, which is backed by the memory mapped file.While we assume that we have a bug somewhere that corrupts certain pages in some (virtualized?) environments and crash cases, we should be proactive here and start adding additional safeguards to detect and potentially even correct errors.
Efficiency
The amount of free pages and the size of the individual page ids rarely require the full 64 bit width of the page id type.
Which is great, as this opens up the possibility for varint/vint encoding, which is also done by protobuf [1].
This could save significant amount of space and deserialization time, as the current encoding can only store 512 ids per page, with vint compression we could easily store twice that.
Alternatively, we can encode the free page id as a bit in a sparse bitset like [2], which allows us even further compress this up to about 2048 ids per page.
[1] https://protobuf.dev/programming-guides/encoding/#varints
[2] https://github.com/RoaringBitmap/roaring
Error Detection
To start off very simple, we should add an error detection code (e.g. a CRC checksum) to the serialization process, if not even to the
Page
struct itself to also safeguard the data stored in the btree.A separate topic is how we're reacting to a page/freelist being invalid based on their checksum, I'm happy to hear your suggestions on this.
Error Correction
This is probably a longer project, yet we should evaluate error correcting codes like Reed-Solomon [3] here as well. This would allow us to reconstruct faulty pages with only a little more space allocated.
[3] https://github.com/klauspost/reedsolomon
Compatibility
Given that we already have a serialization protocol in place, we need to be mindful about backward compatibility.
I propose to reserve the first integer with value
math.MaxUint64
to denote the new "schema", which we should start to version properly for any future changes we may want to make. It's unlikely to be used as the real length of a freelist, given current physical memory and disk constraints.One example encoding I could think of would be:
I would prefer to have a more battle tested and upgradable schema like protobuf, even at the expense of parsing overhead, but I wouldn't want to introduce a dependency that might be conflicting with etcd.
Thanks for reading that far, happy to hear your thoughts.
The text was updated successfully, but these errors were encountered: