Enhance the test to reproduce data corruption issues #568

ahrtr · 2023-09-15T08:34:39Z

See etcd-io/etcd#16596 (comment)

ahrtr · 2023-11-06T15:11:48Z

@fuweid are you interested in working on this? The goal is to try to reproduce the data corruption issues.

My high level thought is:

Implement a simple example application which concurrently reads and writes bbolt db file; Reference concurrent_test.go
Intentionally injects failpoints (e.g. forcibly kill, lazyFS split_write, etc) to crash the application;
check whether the db file is corrupted or not.

We can discuss the details on next meeting.

Note: the goal is to reproduce the data corruption issues instead of verify correctness.

Reference: https://github.com/dsrhaslab/lazyfs

cc @tjungblu

fuweid · 2023-11-06T15:43:30Z

Sure! Let me take this one.

fuweid · 2023-11-18T10:19:23Z

Hi @ahrtr , I tried to use dm-flakey device to drop_writes during block layer and cause data corruption.
The test case is stable to reproduce the data corruption.

$ go test -c -o /tmp/test ./datacorruption/boltdb && sudo /tmp/test -test.v
=== RUN   TestDropWritesDuringBench
    main_test.go:56: Init empty bbolt database with 128 MiB
    main_test.go:60: Ensure the empty boltdb data persisted in the flakey device
    main_test.go:63: Start to run bbolt-bench
    main_test.go:85: Drop all the write IOs after 3 seconds
    main_test.go:89: Let bbolt-bench run with DropWrites mode for 3 seconds
    main_test.go:92: Start to allow all the write IOs for 2 seconds
    main_test.go:96: Kill the bbolt process and simulate power failure
    main_test.go:101: Invoke bbolt check to verify data
    main_test.go:103:
                Error Trace:    /home/fuwei/workspace/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:103
                Error:          Received unexpected error:
                                exit status 2
                Test:           TestDropWritesDuringBench
                Messages:       bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>

                                goroutine 1 [running]:
                                go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
                                        /home/fuwei/workspace/bbolt/freelist.go:270 +0x199
                                go.etcd.io/bbolt.(*DB).loadFreelist.func1()
                                        /home/fuwei/workspace/bbolt/db.go:400 +0xc5
                                sync.(*Once).doSlow(0xc0001301c0?, 0x584020?)
                                        /usr/local/go/src/sync/once.go:74 +0xc2
                                sync.(*Once).Do(...)
                                        /usr/local/go/src/sync/once.go:65
                                go.etcd.io/bbolt.(*DB).loadFreelist(0xc000130000?)
                                        /home/fuwei/workspace/bbolt/db.go:393 +0x47
                                go.etcd.io/bbolt.Open({0x7ffc60ce9530, 0x38}, 0x670060?, 0xc00005fc18)
                                        /home/fuwei/workspace/bbolt/db.go:275 +0x44f
                                main.(*checkCommand).Run(0xc000137e58, {0xc0000161a0, 0x1, 0x1})
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:212 +0x1e5
                                main.(*Main).Run(0xc00005ff40, {0xc000016190?, 0xc0000061a0?, 0x200000003?})
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:124 +0x4d4
                                main.main()
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:62 +0xae
--- FAIL: TestDropWritesDuringBench (8.29s)
FAIL

I think if the lazyFS can drop writes silently, that will be perfect for simulation.
We can discuss this in next meeting.

ahrtr · 2023-11-18T14:59:17Z

@fuweid thx for the test case.

I can reproduce the same error using your test case.
After applying a simple patch (the first commit), then no reproduction on the issue any more. I think the key point is the system call syscall.Fdatasync's semantics is not respected. Please see more detailed info in Update test case TestDropWritesDuringBench to stop bbolt before resuming the failpoint fuweid/go-dmflakey#1

We can have more discussion on Monday's meeting.

ahrtr · 2023-11-19T13:19:06Z

Confirmed that the file.WriteAt never fails when running your test case, nor the fdatasync.

bbolt/tx.go

Line 457 in 33db274

if _, err := tx.db.ops.writeAt(buf, offset); err != nil {
bbolt/tx.go

Line 511 in 33db274

if _, err := tx.db.ops.writeAt(buf, int64(p.Id())*int64(tx.db.pageSize)); err != nil {
bbolt/bolt_linux.go

Line 9 in 33db274

return syscall.Fdatasync(int(db.file.Fd()))

I added log in #616, also updated your test case in fuweid/go-dmflakey#1. So it should be the test case's issue (specifically it should be flaky filesystem's issue) instead of bbolt's issue.

# go test -v
=== RUN   TestDropWritesDuringBench
    main_test.go:57: Init empty bbolt database with 128 MiB
    main_test.go:61: Ensure the empty boltdb data persisted in the flakey device
    main_test.go:64: Start to run bbolt-bench
    main_test.go:92: Drop all the write IOs after 3 seconds
    main_test.go:96: Let bbolt-bench run with DropWrites mode for 3 seconds
    main_test.go:99: Start to allow all the write IOs for 2 seconds
    main_test.go:103: Kill the bbolt process and simulate power failure
    main_test.go:86: ####### bbolt output: 
         work: /tmp/TestDropWritesDuringBench3215672155/001/root/boltdb
        starting write benchmark.
        Completed 910 requests, 909/s 
        Completed 1805 requests, 894/s 
        Completed 2705 requests, 899/s 
        Completed 79925 requests, 77209/s 
        Completed 154430 requests, 74501/s 
        Completed 226100 requests, 71659/s 
        Completed 227845 requests, 1744/s 
        Completed 228710 requests, 864/s 
    main_test.go:115: Invoke bbolt check to verify data: "/tmp/boltdb"
    main_test.go:117: 
        	Error Trace:	/root/go/src/github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:117
        	Error:      	Received unexpected error:
        	            	exit status 2
        	Test:       	TestDropWritesDuringBench
        	Messages:   	bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>
        	            	
        	            	goroutine 1 [running]:
        	            	go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
        	            		/root/go/src/github.com/ahrtr/bbolt/freelist.go:270 +0x179
        	            	go.etcd.io/bbolt.(*DB).loadFreelist.func1()
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:400 +0xb8
        	            	sync.(*Once).doSlow(0xc0001341c0?, 0x581580?)
        	            		/root/software/go/src/sync/once.go:74 +0xbf
        	            	sync.(*Once).Do(...)
        	            		/root/software/go/src/sync/once.go:65
        	            	go.etcd.io/bbolt.(*DB).loadFreelist(0xc000134000?)
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:393 +0x45
        	            	go.etcd.io/bbolt.Open({0x7fff4ee2c62a, 0xb}, 0x677960?, 0xc00005fbd0)
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:275 +0x425
        	            	main.(*checkCommand).Run(0xc00013be18, {0xc000016140, 0x1, 0x1})
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:212 +0x1df
        	            	main.(*Main).Run(0xc00005ff00, {0xc000016130?, 0xc000006340?, 0x200000003?})
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:124 +0x469
        	            	main.main()
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:62 +0xa6
--- FAIL: TestDropWritesDuringBench (8.35s)
FAIL
exit status 1
FAIL	github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb	8.350s

fuweid · 2023-11-19T14:30:02Z

Hi @ahrtr , The original case is used to simulate the data loss in the device level. The DropWrites failpoint is to drop all the data submitted by fdatasync or fsync. It's like power failure and cause data committed loss.

I was trying to use the following script in pod:

run-bench-test &
sleep random-seconds
echo b > /proc/sysrq-trigger

And run pod as daemonset in the kubernetes cluster. It can't reproduce the data corruption.
So I inject failpoint in device level to drop the writes, which seems to change one byte in the data.

I think we can add failpoint between writeAt and fdatasync so that we can see the recovery after power failure.

We can sync the detail in tomorrow meeting.

ahrtr · 2023-11-22T10:29:10Z

@fuweid can you add the test case TestDropWritesDuringBench (Possibly need to change the name) into bbolt?

fuweid · 2023-11-22T11:20:50Z

@ahrtr Hi, I am still working on this. will file pull request when it's ready

ahrtr added the area/testing label Sep 15, 2023

ahrtr added the priority/important label Oct 18, 2023

ahrtr changed the title ~~Introduce lazyFS into bbolt test~~ Enhance the test to reproduce data corruption issues Nov 6, 2023

fuweid self-assigned this Nov 6, 2023

ahrtr mentioned this issue Nov 18, 2023

Update test case TestDropWritesDuringBench to stop bbolt before resuming the failpoint fuweid/go-dmflakey#1

Closed

ahrtr mentioned this issue Nov 19, 2023

Add log when file.WriteAt or syscall.Fdatasync fails #616

Closed

fuweid mentioned this issue Nov 25, 2023

tests/robustness: init with powerfailure case #622

Merged

github-actions bot added the stale label Apr 16, 2024

ahrtr removed the stale label May 10, 2024

github-actions bot added the stale label Aug 9, 2024

serathius mentioned this issue Oct 4, 2024

after restart, bbolt db failed to get all reachable pages #778

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance the test to reproduce data corruption issues #568

Enhance the test to reproduce data corruption issues #568

ahrtr commented Sep 15, 2023

ahrtr commented Nov 6, 2023 •

edited

Loading

fuweid commented Nov 6, 2023

fuweid commented Nov 18, 2023

ahrtr commented Nov 18, 2023 •

edited

Loading

ahrtr commented Nov 19, 2023

fuweid commented Nov 19, 2023 •

edited

Loading

ahrtr commented Nov 22, 2023

fuweid commented Nov 22, 2023

Enhance the test to reproduce data corruption issues #568

Enhance the test to reproduce data corruption issues #568

Comments

ahrtr commented Sep 15, 2023

ahrtr commented Nov 6, 2023 • edited Loading

fuweid commented Nov 6, 2023

fuweid commented Nov 18, 2023

ahrtr commented Nov 18, 2023 • edited Loading

ahrtr commented Nov 19, 2023

fuweid commented Nov 19, 2023 • edited Loading

ahrtr commented Nov 22, 2023

fuweid commented Nov 22, 2023

ahrtr commented Nov 6, 2023 •

edited

Loading

ahrtr commented Nov 18, 2023 •

edited

Loading

fuweid commented Nov 19, 2023 •

edited

Loading