Inefficient file formats #44

scottcarey · 2019-11-21T22:17:13Z

It appears like the file formats have a lot of redundancy.

For example, ever Record, Tombstone, and Index entry have an individual crc32, a version byte, plus 4 byte record size.

Lets take IndexFileEntry for example

/**
 * checksum         - 4 bytes. 
 * version          - 1 byte.
 * Key size         - 1 bytes.
 * record size      - 4 bytes.
 * record offset    - 4 bytes.
 * sequence number  - 8 bytes
 */

A few things come to mind.

The file could have a header with the version number, since it is identical for all entries. Since the file is only read sequentially, and truncated at the first corrupted item found, the header could contain the first sequenceNumber as well, and values afterward can be deltas relative to this value using a variable length encoding. RecordOffset is similar -- the values are monotonically increasing and could be delta encoded with a variable length integer.
As for the checksum, it could be written for small 'blocks' rather than for each record. This also would accelerate recovery from a crash, as each block could be something like: (2 byte size, 8 byte xxHash checksum, size bytes of index entries). Validating the file would then only need to go a block at a time until it fails. As long as the block had at least 3 entries, it would save space. I suspect something like flushing a block every ~ 32 entries or 2k bytes (whatever is first) would work well -- ~9% as many bytes used for checksums, but small enough chunks so that it shouldn't significantly impact the chance that data fails to reach disk before a crash.
Also, unless hardware accelerated, crc32 is much slower than XXHash and also more prone to collisions.

The text was updated successfully, but these errors were encountered:

bellofreedom added the question Need more discussion label Dec 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient file formats #44

Inefficient file formats #44

scottcarey commented Nov 21, 2019

Inefficient file formats #44

Inefficient file formats #44

Comments

scottcarey commented Nov 21, 2019