-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance the test to reproduce data corruption issues #568
Comments
@fuweid are you interested in working on this? The goal is to try to reproduce the data corruption issues. My high level thought is:
We can discuss the details on next meeting. Note: the goal is to reproduce the data corruption issues instead of verify correctness. Reference: https://github.com/dsrhaslab/lazyfs cc @tjungblu |
Sure! Let me take this one. |
Hi @ahrtr , I tried to use dm-flakey device to drop_writes during block layer and cause data corruption.
$ go test -c -o /tmp/test ./datacorruption/boltdb && sudo /tmp/test -test.v
=== RUN TestDropWritesDuringBench
main_test.go:56: Init empty bbolt database with 128 MiB
main_test.go:60: Ensure the empty boltdb data persisted in the flakey device
main_test.go:63: Start to run bbolt-bench
main_test.go:85: Drop all the write IOs after 3 seconds
main_test.go:89: Let bbolt-bench run with DropWrites mode for 3 seconds
main_test.go:92: Start to allow all the write IOs for 2 seconds
main_test.go:96: Kill the bbolt process and simulate power failure
main_test.go:101: Invoke bbolt check to verify data
main_test.go:103:
Error Trace: /home/fuwei/workspace/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:103
Error: Received unexpected error:
exit status 2
Test: TestDropWritesDuringBench
Messages: bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>
goroutine 1 [running]:
go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
/home/fuwei/workspace/bbolt/freelist.go:270 +0x199
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/home/fuwei/workspace/bbolt/db.go:400 +0xc5
sync.(*Once).doSlow(0xc0001301c0?, 0x584020?)
/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
/usr/local/go/src/sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc000130000?)
/home/fuwei/workspace/bbolt/db.go:393 +0x47
go.etcd.io/bbolt.Open({0x7ffc60ce9530, 0x38}, 0x670060?, 0xc00005fc18)
/home/fuwei/workspace/bbolt/db.go:275 +0x44f
main.(*checkCommand).Run(0xc000137e58, {0xc0000161a0, 0x1, 0x1})
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:212 +0x1e5
main.(*Main).Run(0xc00005ff40, {0xc000016190?, 0xc0000061a0?, 0x200000003?})
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:124 +0x4d4
main.main()
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:62 +0xae
--- FAIL: TestDropWritesDuringBench (8.29s)
FAIL I think if the lazyFS can drop writes silently, that will be perfect for simulation. |
@fuweid thx for the test case.
We can have more discussion on Monday's meeting. |
Confirmed that the
I added log in #616, also updated your test case in fuweid/go-dmflakey#1. So it should be the test case's issue (specifically it should be flaky filesystem's issue) instead of bbolt's issue.
|
Hi @ahrtr , The original case is used to simulate the data loss in the device level. The DropWrites failpoint is to drop all the data submitted by I was trying to use the following script in pod: run-bench-test &
sleep random-seconds
echo b > /proc/sysrq-trigger And run pod as daemonset in the kubernetes cluster. It can't reproduce the data corruption. I think we can add failpoint between We can sync the detail in tomorrow meeting. |
@fuweid can you add the test case TestDropWritesDuringBench (Possibly need to change the name) into bbolt? |
@ahrtr Hi, I am still working on this. will file pull request when it's ready |
See etcd-io/etcd#16596 (comment)
The text was updated successfully, but these errors were encountered: