Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write latest finalized state in chunks #9026

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tbenr
Copy link
Contributor

@tbenr tbenr commented Jan 22, 2025

  • serializes beaconState into chunks via SszByteArrayChunksWriter (currently hardcoded to 8MiB chunk sizes)
  • on DB side introduced a KvStoreUnchunkedVariable which stores chunks in a similarly to what we do with columns:
    • the base key (id)->byte stores a byte with the number of chunks (max 255 chunks)
    • chunks are stored as(id,chunkIndex) -> chunk bytes

There is an automatic backward compatibility behavior: if the base (id)->byte returns something bigger than a single byte, than it is considered as a big chunk (essentially the previous storage mode)

TODO:

  • verify that this actually improves memory allocations (no more big 256mb single byte array)
  • implement an iterator when reading chunks
  • trimming: if the variable becomes smaller, previous additional chunks remains on DB (we could not care for BeaconState, it always grows)
  • rocksDB implementation
  • tests

fixes #9018

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if updates are required.

Changelog

  • I thought about adding a changelog entry, and added one if I deemed necessary.

@Nashatyrev
Copy link
Contributor

Looks like a great idea!

However:

  • When you write you still keep all chunks in memory when serializing
  • Not sure how would you implement (non-blocking) reading cause chunk boundaries would be at random places?

From my understanding the ideal approach would be to make SszWriter/Reader asynchronous so they could flush/read chunks on the fly. But I'm not sure how much overhead it would introduce as we would need to return and handle SafeFuture on every tiny read/write operation.

@tbenr
Copy link
Contributor Author

tbenr commented Jan 23, 2025

@Nashatyrev
I hear you on those points. I did in that way due to the fact that our current interface (at least towards LevelDB) requires all data to be there in memory for a given transaction. It boils down to usage of org.iq80.leveldb.WriteBatch.
If we breaks down the write in multiple transaction it would work but we loose atomicity.
I'll doublecheck again and will look into rocksDB if it has a way of stream data into transactions.

@Nashatyrev
Copy link
Contributor

If you only care about atomicity of a single state record then it could be the option to stream all chunks besides the first one and write the first chunk just at the very end. So you would either have committed the first chunk and all others or you would have no first chunk which would be treated as absence of the state.
Does it make any sense?

@tbenr
Copy link
Contributor Author

tbenr commented Jan 23, 2025

well, it is not only atomicity in writing the state. We update that state within a bigger transaction, which contains several updates.

try (final HotUpdater updater = hotUpdater()) {

@Nashatyrev
Copy link
Contributor

Could it be generalized to such a statement: if there are KvStoreChunkedVariables update in a batch then all of them should be written first (without first chunks), and then only first chunks of them are included into the final batch.
How do you think would it work?

@tbenr
Copy link
Contributor Author

tbenr commented Jan 24, 2025

Could it be generalized to such a statement: if there are KvStoreChunkedVariables update in a batch then all of them should be written first (without first chunks), and then only first chunks of them are included into the final batch.
How do you think would it work?

this would work from the atomicity standpoint, but it wont be transactional. Once we start writing there wont be any rollback. I.e. If the client shuts down in the middle of the sequence we loose the previous state too and client wont be able to recover.

An approach could be to have two version of a given variable (lets say A and B) and have an additional meta value tracking which is the "current" (updating this info only after all chunks have been written) and we will "ping-pong" between A and B on each write.

Doable (maybe complicated?), but if we will end up moving from levelDB and focus on RocksDB again as a primary DB, it could be a waste of time.

@tbenr tbenr force-pushed the write-latest-finalized-state-in-chunks branch from f9100dd to 16e0b33 Compare January 27, 2025 10:35
@Nashatyrev
Copy link
Contributor

Oh, right! I forgot it's the DB variable.
I'm sure there is a suitable solution, but yeah it could make things complicated

@tbenr
Copy link
Contributor Author

tbenr commented Jan 27, 2025

The other problem is that we currently assume that the data in db (hot and finalized) is consistent as we update it in a big transaction (updating tables and variables all together).
If we split in multiple transactions we could break this consistency and thus break those assumptions.

@Nashatyrev
Copy link
Contributor

Nashatyrev commented Jan 28, 2025

I would suggest the following approach for chunked variable transactions:

  1. You write and commit data chunks with keys hash_root(new_data) + chunk_idx
  2. Execute a batch (transaction) which:
    • deletes old chunks referred by hash_root(old_data)
    • sets the chunked variable reference to hash_root(new_data)
    • applies other transactional changes

Looks pretty atomic and transactional to me 🤔

The only drawback would be unmanaged data left in the DB if the transaction (steps 1+2) is aborted in the middle. The same could happen for concurrent transactions. But that could be neglected due to rare occasions imo

@tbenr
Copy link
Contributor Author

tbenr commented Jan 28, 2025

yeah i like that model and it could be extended to values in columns too.
I'd like to dig into rocksDB. If it isn't possible to stream data into it as well, this could be a nice overall improvement.

@tbenr
Copy link
Contributor Author

tbenr commented Jan 28, 2025

we could even think about writing those chunks directly as files on FS and write the full path to them as values (instead of the actual data). This would turn the db to be just an index.

We could easily abstract this as KvStoreSerializer

This would even lead us to offload blobs on FS, pretty transparently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support serializing state in chunks when saving it on DB
3 participants