forky: Fixed Chunk Data Size Store #2017

janos · 2019-12-05T19:03:31Z

This PR changes the chunk data disk persistence structure. It is based on previous experiment https://github.com/janos/forky where multiple approaches were tried out. The most performant was chosen and added to swarm as storage/fcds package.

This PR includes additional changes required for new storage to be used and optionally manually migrated.

FCDS is integrated into storage/localstore without breaking changes to the package API. That involves:

removing usage of retrievalDataIndex, except to provide export functionality for older db schema
replacing retrievalAccessIndex with metaIndex that contains storage timestamp, access timestamp and bin id

At the roundtable discussion, it was decided to have an optional manual data migration. To achieve this,
instructions with steps are printed when the new swarm version is started with older data format. LocalStore
migrations are adjusted to make this functionality. Some issues to getMigrations function are discovered
that are fixed and tested in this pr through additional tests. Schema name variables are now unexported
and names from the legacy ldbstore removed from migrations as they are not needed to be there.

In order for migration to be complete, import and export localstore functions needed to handle pinning
information. This functionality is added.

Measurements

Measurements are performed locally on 4 core MacBook Pro mid 2014.
Every run on a clean swarm data directory by uploading files with random data.

Local test1 - 1GB size - 5x speedup

time ./swarm.master up test1
c4e105675ab10acc907ac4e966aa0359eb0accc5fe5bd03dc15d8161e5fa1dda
./swarm.master up test1  0.98s user 1.85s system 1% cpu 3:52.39 total

time ./swarm.fcds up test1
c4e105675ab10acc907ac4e966aa0359eb0accc5fe5bd03dc15d8161e5fa1dda
./swarm.forky up test1  1.01s user 1.94s system 6% cpu 46.629 total

Local test4 - 4GB size - 6.5x speedup

time ./swarm.master up test4
b2c1bae070933e5c46d1f839340e2ea33c77469e1c8210691cbe0ed79b211506
./swarm.master up test4  3.91s user 7.37s system 0% cpu 26:27.17 total

time ./swarm.fcds up test4
b2c1bae070933e5c46d1f839340e2ea33c77469e1c8210691cbe0ed79b211506
./swarm.forky up test4  4.30s user 8.50s system 5% cpu 4:06.79 total

Smoke tests on cluster

Smoke tests were run multiple times for validation and to measure performance. However, the performance
gain is related to the number of cpus that ec2 node has and how many swarm nodes are running on the same ec2 node.

These are results from running one swarm node on 2 core c5.large https://snapshot.raintank.io/dashboard/snapshot/dD0JQruCpHpOjvKqwR6YN3ay3GmONLyr.
If there are two swarm processes on the same node, upload speed is about the half. It is noticeable that
garbage collection influences performance and this is the area where further adjustments can be made.

janos · 2019-12-10T13:15:55Z

cmd/swarm/db.go

@@ -168,10 +167,6 @@ func dbImport(ctx *cli.Context) {
 }

 func openLDBStore(path string, basekey []byte) (*localstore.DB, error) {
- if _, err := os.Stat(filepath.Join(path, "CURRENT")); err != nil {


This check is removed in order to allow dbImport to create localstore directory if it does not exist, to simplify migration steps (remove the need to start swarm node in between export and import, as it was required in v0.5.0 release).

storage/fcds/fcds.go

zelig · 2019-12-11T11:02:13Z

storage/fcds/fcds.go

+ shards map[uint8]*os.File // relations with shard id and a shard file
+ shardsMu map[uint8]*sync.Mutex // mutex for every shard file
+ meta MetaStore // stores chunk offsets
+ free map[uint8]struct{} // which shards have free offsets


i would also just have a shardCount length bool array here

storage/fcds/meta.go

zelig · 2019-12-11T18:02:00Z

storage/fcds/offsetcache.go

+
+// offsetCache is a simple cache of offset integers
+// by shard files.
+type offsetCache struct {


i am a bit lost here. what do we need this for?

This is the caching of free offsets in Store. It reduces the number of calls to MetaStore.FreeOffset in Store.Put to avoid disk i/o.

storage/localstore/gc.go

storage/localstore/localstore_test.go

zelig · 2019-12-13T07:57:44Z

storage/fcds/fcds.go

+func (s *Store) getOffset(shard uint8) (offset int64, reclaimed bool, err error) {
+ if !s.shardHasFreeOffsets(shard) {
+ // shard does not have free offset
+ return -1, false, err


err -> nil preferred

Thanks, yes of course, my mistake.

janos · 2019-12-13T14:11:33Z

I am confirming that smoke tests pass with the current state of this PR https://snapshot.raintank.io/dashboard/snapshot/BDyvSKICIk5a0AbAjXxBeJfgvEvvpiqP. These results are with deployment of 3 swarm pods per one c5.large ec2 node. The first two failures are because smoke test job started before all swarm nodes in the cluster.

pradovic

LGTM, very nice 👏 I left a few minor comments, mostly cosmetic ones.

pradovic · 2019-12-18T10:03:45Z

storage/fcds/fcds.go

+// If offset is less then 0, no free offsets are available.
+func (s *Store) getOffset(shard uint8) (offset int64, reclaimed bool, err error) {
+ if !s.shardHasFreeOffsets(shard) {
+ // shard does not have free offset


I would maybe reduce comments in this function as the function comment itself explains the behavior and the code itself with private function names are 100% self explanatory. So I think having only the code makes it even more readable, imo.

pradovic · 2019-12-18T10:15:43Z

storage/fcds/fcds.go

+ }()
+ select {
+ case <-done:
+ case <-time.After(15 * time.Second):


Maybe a debug log here?

pradovic · 2019-12-18T10:24:38Z

storage/fcds/fcds.go

+ default:
+ }
+ s.wg.Add(1)
+ return s.wg.Done, nil


Why not have a private Store.done() function instead of returning done function?

The main reason is to make it more obvious that done must be called in order to finish the protection and that it is only in the relation to protect function. But for this small codebase, I think that it is not so significant. Is it ok to call the new method unprotect instead done?

Both make sense to me, I agree with your point as well, so go for what feels better for you. If you decide to make it a private function unprotect sounds good to me.

The usage is nicer with a private method, with one less line. Thanks.

pradovic · 2019-12-18T10:27:40Z

storage/fcds/mem/mem.go

+func (s *MetaStore) Get(addr chunk.Address) (m *fcds.Meta, err error) {
+ s.mu.RLock()
+ m = s.meta[string(addr)]
+ s.mu.RUnlock()


defer? Not needed functionally, just a subjective cosmetic suggestion :)

Defer comes with a cost, and I wanted to have as good as possible performance for measurements. In general, I would agree, but here defer contributes with no insignificant relative cost.

Makes sense 👍

pradovic · 2019-12-18T10:28:10Z

storage/fcds/mem/mem.go

+// Reclaimed flag denotes that the chunk is at the place of
+// already deleted chunk, not appended to the end of the file.
+func (s *MetaStore) Set(addr chunk.Address, shard uint8, reclaimed bool, m *fcds.Meta) (err error) {
+ s.mu.Lock()


defer unlock?

Defer comes with a cost, and I wanted to have as good as possible performance for measurements.

pradovic · 2019-12-18T10:31:02Z

storage/fcds/offsetcache.go

+
+// offsetCache is a simple cache of offset integers
+// by shard files.
+type offsetCache struct {


Would expiry be helpful to guard against memory "leak"? Is there a situation where the cache items are not removed for a long time?

Good point. In the context of offsetCache, this is certainly a possibility, and would be a good improvement. In the wider context, removal depends on MetaStore.FreeOffset function, which is injected and cannot be trusted for any implementation. I will add basic TTL.

I have added a basic TTL and cleanup, but I am not sure that it is actually needed while adding a bit more complexity and locking.

There is a case where offsetCashe may hold offsets for longer periods in case of no activity (uploads, retrievals or syncing). In that case expiration reduces the optimisation that offsetCache is here for, and I would not add it.

Maybe @zelig would like to share opinion on this subject also?

I see, makes sense. I do not think that a small TTL is needed, maybe like 24h one or something similar, just to prevent cache growing forever, but I am not sure it is needed at all as well. Do we expect to have this same cache for the whole time the app running? Sorry if it's a stupid question 🙈 If yes, and if the app is supposed to be running for long periods of time (like multiple days), then it might have be good to clean it up some time.

I have reverted the TTL feature after a brief discussion. We have concluded that if there would be a cache leak, then a much bigger problem would have cause it in this case, so the additional complexity is most likely not needed.

This reverts commit 26f6626.

santicomp2014

This a very good improvement.
I want to test it in a test cluster that we setup.
LGTM

janos · 2019-12-19T14:28:47Z

Thanks, @santicomp2014. I have updated this branch with the current master and pushed docker image to janos/swarm:fcds, for testing.

acud

great work on this PR @janos!

a few comments from my side:

different chunk sizes - will we need different shards for different chunk sizes? how much effort are we talking in migrating this implementation to support different chunk size? i think that adding the infrastructure to this right now would save us some trouble down the road
shard count may change for various reasons and i think we should not tie getShard function output to this number
max chunk size is not enforced and gracefully truncates data
if we want to accommodate for a hypothetical future feature that reconstructs localstore from the actual data on the shard files (something which is a low hanging fruit with this kind of a design, and allows us to reconstruct the db from an inconsistent state) but also accommodate variable chunk size - we cannot since the shards do not encode chunk size data. this is just a sidemark, i'm sure you are well aware of this
traditionally i am not so happy about manual migrations but i'm gonna put a sock in it

in any case i would prefer to have this in the next release just to give us some leeway to play around with this some more

<3

acud · 2020-01-08T07:52:23Z

storage/fcds/fcds.go

+// Interface specifies methods required for FCDS implementation.
+// It can be used where alternative implementations are needed to
+// switch at runtime.
+type Interface interface {


do we really want to call an interface Interface? maybe Storer?

I took the inspiration from sort.Interface, but Storer is a good name. Will change, thanks.

storage/fcds/fcds.go

acud · 2020-01-08T09:40:56Z

storage/localstore/export.go

+ case <-ctx.Done():
+ }
+ }
+ continue


this means we continue although there were errors, right?

Actually, this behaviour should be fixed in the whole goroutine where the error is not nil. If there is an error, goroutine should return. I will fix that.

storage/localstore/migration.go

storage/localstore/gc.go

storage/fcds/fcds.go

janos

Thank you @acud for a review it is very helpful to see there the implementation is weak.

different chunk sizes - will we need different shards for different chunk sizes? how much effort are we talking in migrating this implementation to support different chunk size? i think that adding the infrastructure to this right now would save us some trouble down the road

I am not sure that there is a decision to have variable chunk sizes. This highly influences this and probably other components that deal with the chunks.

shard count may change for various reasons and i think we should not tie getShard function output to this number

I am open for suggestions.

max chunk size is not enforced and gracefully truncates data

This is a great find. Thanks.

if we want to accommodate for a hypothetical future feature that reconstructs localstore from the actual data on the shard files (something which is a low hanging fruit with this kind of a design, and allows us to reconstruct the db from an inconsistent state) but also accommodate variable chunk size - we cannot since the shards do not encode chunk size data. this is just a sidemark, i'm sure you are well aware of this

Yes, and the name of the store reflects on assumption that there is a fixed maximal chunk data size constant in time.

traditionally i am not so happy about manual migrations but i'm gonna put a sock in it

This is just as we did a year ago when we introduced a new localstore (I think that you implemented the migration), but without and external link to the migration steps.

in any case i would prefer to have this in the next release just to give us some leeway to play around with this some more

I agree.

acud

LGTM @janos. Thanks for addressing all of my comments

acud · 2020-01-20T08:05:44Z

also please do not merge until the next release is out

janos · 2020-01-20T10:20:31Z

@acud could you reject this PR until the next release is released, to block merging that way? :)

jmozah · 2020-03-08T18:40:53Z

storage/fcds/fcds.go

+ }
+
+ for _, sh := range s.shards {
+ if err := sh.f.Close(); err != nil {


Keeping the fd open for a long duration without mmap is a disaster in the making. I am not sure if go file flush() does a os level fsync(). If it is, then we should flush() at regular intervals to save the contents against crash.

Nice observation. Go os.File.Sync() is flushing the content to the disk, I am not sure to which flush() are you referring to.

Flushing on (regular) intervals is what operating system is already doing. I am not sure how this would help agains crash unless we fsync on every write, which is quite costly. I have already tested fsync on every write and it makes fcds much slower, even compared to go-leveldb, as go-leveldb dos not fsync at all.

Mmap brings its own complexity, especially on different operating systems.

Thanks for the reply.

Go os.File.Sync() is flushing the content to the disk, I am not sure to which flush() are you referring to.

Apologies if i confused you. There are two things...

flush(), which flushes the application write buffers to OS.

fsync(), the OS level sync which absolutely makes sure that buffers has gone to disk.

The first one is done by golang itself as pointed out by you. The second one is the one i am concerned.

Flushing on (regular) intervals is what operating system is already doing

It is usually done by OS disk drivers whenever they feel it is okay. The fsync() is expensive as pointed out by you. To avoid fsyncing on every commit, DB's usually implement WAL's which fsyncs it on very small regular intervals on the background. so that even if some data is lost it will be of for very small duration . If you fsync on foreground (query path) everytime that will be very expensive.

even compared to go-leveldb, as go-leveldb dos not fsync at all.

leveldb takes another strategy my doing a larger mmap file than required and written directly using memcopy, then on mmap driver takes care of writing the dirty pages to the disk.

All i am saying is, one way or other, if we want to evade crash and end up with corrupted files on bootup, We have to implement this otherwise the I am sure we can expect some gibberish files when you switch off power abruptly.

Mmap brings its own complexity, especially on different operating systems.

Yes. We should not reinvent the wheel.
As a remedy... we have two options

Use already existing DB which takes care of this ( we can talk about badger if you like)

In Forky, do a FIle.Sync() every X seconds ( where we can tolerate X seconds data loss) in a background thread for all files.

I found this intresting read on this topic
https://www.joeshaw.org/dont-defer-close-on-writable-files/

Thanks for explanations.

Use already existing DB which takes care of this ( we can talk about badger if you like)

I have already tried badger and it is slower then go-leveldb. Maybe you can find to do it more efficiently. And still it does not scale with more cpu cores.

In Forky, do a FIle.Sync() every X seconds ( where we can tolerate X seconds data loss) in a background thread for all files.

I am concerned if this is actually help us, as os is already doing that in some frequency. Also, as you pointed, the solution would be to implement WAL, which in some basic form I already did, with of course, some performance penalties.

As far as I can see go-leveldb is not using mmap.

I see your point very valid, but I think that your suggestions should be tested, even the ones that I already did, to revalidate. We can talk about possibilities, but it would be good to actually test resilience for the weak points that you described.

As this is a very important part of the swarm I am more in favour not to merge this PR until we are clear that it is reliable. Currently, I see that it creates more problems than it brings benefits.

I did few changes in the badger configuration and did run TestLevelDBForky and TestBadgerSuite test cases. These are the results.

The iteration in Badger is somewhat slow... so i commented out the Iteration and did these test cases.. with an assumption that write/read/delete is more important and not iteration.

There are more write improvements i can make in badger.. like batching the Writes and so on.. but for now.. i will leave it like this...

Thanks, but I cannot say anything from the screenshots except to compare timings and conclude that TestBadgerSuite writes are slower. It would be very helpful to actually see the changes and what is TestBadgerSuite doing.

Look at the last 2 commits in my fork https://github.com/jmozah/forky/commits/master
The only one that matters is SyncWrites = false, which makes Forky and badger apples to apples in my opinion. Others configs are more experimental.

janos · 2020-03-10T20:35:44Z

Based on recent discussions and decision to go for a more reliable and in general better approach with using Badger for chunk data storage, I am closing this PR. Thank you all for investing time to improve, test and review this PR, and I am sorry for a long attempt that was so late identified as not the best approach to improve storage performance. At least the high level design, localstore integration and migration part could be reused.

jmozah · 2020-03-11T13:24:45Z

Please don't delete the branch. This should be used to test performance with the badger.

janos · 2020-03-11T13:55:41Z

@jmozah Of course. I deliberately did not delete the branch.

tfalencar · 2020-03-22T14:15:08Z

Based on recent discussions and decision to go for a more reliable and in general better approach with using Badger for chunk data storage, I am closing this PR. Thank you all for investing time to improve, test and review this PR, and I am sorry for a long attempt that was so late identified as not the best approach to improve storage performance. At least the high level design, localstore integration and migration part could be reused.

Hi , could you please just summarize why the approach was abandoned? Not everyone is able to join in these discussions, still it would be nice to have it documented here.

storage/localstore: integrate fcds

a2d5edc

janos added the in progress label Dec 5, 2019

janos self-assigned this Dec 5, 2019

janos added 6 commits December 6, 2019 12:57

storage/{fcds,localstore}: add comments and minor adjustments

2506b88

storage/fcds: add doc.go

c8f3622

cmd/swarm, storage/localstore: support breaking migrations

ec14da3

cmd/swarm, storage/localstore: improve migrations, export and import

e352b88

storage/localstore: export pins

c8c16e1

Merge branch 'master' into fcds

66aa7fd

janos commented Dec 10, 2019

View reviewed changes

janos requested review from jmozah, zelig, pradovic and acud December 10, 2019 15:04

janos added ready for review and removed in progress labels Dec 10, 2019

zelig suggested changes Dec 11, 2019

View reviewed changes

storage/{fcds,localstore}: address Viktor's comments

7aed3b1

zelig approved these changes Dec 13, 2019

View reviewed changes

storage/fcds: correctly return explicit nil in getOffset

26bcb48

storage/fcds: add WithCache optional argument to New constructor

1e94680

janos requested a review from nolash December 16, 2019 15:19

pradovic approved these changes Dec 18, 2019

View reviewed changes

janos added 3 commits December 18, 2019 13:02

storage/fcds: address most of Petar's comments

db658c7

storage/fcds: add offsetCache ttl

26f6626

Revert "storage/fcds: add offsetCache ttl"

5be3c25

This reverts commit 26f6626.

santicomp2014 approved these changes Dec 19, 2019

View reviewed changes

Merge branch 'master' into fcds

c222011

acud approved these changes Jan 8, 2020

View reviewed changes

janos commented Jan 14, 2020

View reviewed changes

janos added 6 commits January 14, 2020 17:32

storage/fcds: rename fcds.Interface to fcds.Storer

661a7f5

storage/fcds: improve some commenting

f51c6d8

storage/localstore: improve comment in the Import method

91fc21f

storage/localstore: improve migrateDiwali migration message

60e3938

storage/fcds: ensure that chunk data is no longer the the max value

c542f32

storage/localstore: terminate import goroutine in case of errors

0fc5e3a

janos requested a review from acud January 17, 2020 12:01

acud approved these changes Jan 20, 2020

View reviewed changes

janos added 5 commits March 5, 2020 15:36

storage/localstore: do not put existing chunks

842f7d8

storage/fcds/test: correctly handle storage path

d9341b8

strage/fcds: check if chunk exists before it is put

0f13d3b

storage/fcds: add and use MetaStore.Has

4b6f726

storage/fcds: optimize locking

39d328a

jmozah reviewed Mar 8, 2020

View reviewed changes

janos closed this Mar 10, 2020

acud reopened this Mar 24, 2020

acud changed the title ~~Fixed Chunk Data Size Store~~ forky: Fixed Chunk Data Size Store Mar 24, 2020

forky: Fixed Chunk Data Size Store #2017

Are you sure you want to change the base?

forky: Fixed Chunk Data Size Store #2017

Conversation

janos commented Dec 5, 2019 • edited Loading

Measurements

Local test1 - 1GB size - 5x speedup

Local test4 - 4GB size - 6.5x speedup

Smoke tests on cluster

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janos commented Dec 13, 2019 • edited Loading

pradovic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pradovic Dec 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pradovic Dec 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

santicomp2014 left a comment

Choose a reason for hiding this comment

janos commented Dec 19, 2019

acud left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janos left a comment

Choose a reason for hiding this comment

acud left a comment

Choose a reason for hiding this comment

acud commented Jan 20, 2020

janos commented Jan 20, 2020

jmozah Mar 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmozah Mar 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmozah Mar 10, 2020 • edited Loading

Choose a reason for hiding this comment

janos commented Mar 10, 2020

jmozah commented Mar 11, 2020

janos commented Mar 11, 2020

tfalencar commented Mar 22, 2020

janos commented Dec 5, 2019 •

edited

Loading

janos commented Dec 13, 2019 •

edited

Loading

pradovic Dec 18, 2019 •

edited

Loading

pradovic Dec 18, 2019 •

edited

Loading

acud left a comment •

edited

Loading

jmozah Mar 8, 2020 •

edited

Loading

jmozah Mar 9, 2020 •

edited

Loading

jmozah Mar 10, 2020 •

edited

Loading