Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the test to reproduce data corruption issues #568

Open
ahrtr opened this issue Sep 15, 2023 · 8 comments
Open

Enhance the test to reproduce data corruption issues #568

ahrtr opened this issue Sep 15, 2023 · 8 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented Sep 15, 2023

See etcd-io/etcd#16596 (comment)

@ahrtr
Copy link
Member Author

ahrtr commented Nov 6, 2023

@fuweid are you interested in working on this? The goal is to try to reproduce the data corruption issues.

My high level thought is:

  • Implement a simple example application which concurrently reads and writes bbolt db file; Reference concurrent_test.go
  • Intentionally injects failpoints (e.g. forcibly kill, lazyFS split_write, etc) to crash the application;
  • check whether the db file is corrupted or not.

We can discuss the details on next meeting.

Note: the goal is to reproduce the data corruption issues instead of verify correctness.

Reference: https://github.com/dsrhaslab/lazyfs

cc @tjungblu

@ahrtr ahrtr changed the title Introduce lazyFS into bbolt test Enhance the test to reproduce data corruption issues Nov 6, 2023
@fuweid
Copy link
Member

fuweid commented Nov 6, 2023

Sure! Let me take this one.

@fuweid fuweid self-assigned this Nov 6, 2023
@fuweid
Copy link
Member

fuweid commented Nov 18, 2023

Hi @ahrtr , I tried to use dm-flakey device to drop_writes during block layer and cause data corruption.
The test case is stable to reproduce the data corruption.

$ go test -c -o /tmp/test ./datacorruption/boltdb && sudo /tmp/test -test.v
=== RUN   TestDropWritesDuringBench
    main_test.go:56: Init empty bbolt database with 128 MiB
    main_test.go:60: Ensure the empty boltdb data persisted in the flakey device
    main_test.go:63: Start to run bbolt-bench
    main_test.go:85: Drop all the write IOs after 3 seconds
    main_test.go:89: Let bbolt-bench run with DropWrites mode for 3 seconds
    main_test.go:92: Start to allow all the write IOs for 2 seconds
    main_test.go:96: Kill the bbolt process and simulate power failure
    main_test.go:101: Invoke bbolt check to verify data
    main_test.go:103:
                Error Trace:    /home/fuwei/workspace/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:103
                Error:          Received unexpected error:
                                exit status 2
                Test:           TestDropWritesDuringBench
                Messages:       bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>

                                goroutine 1 [running]:
                                go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
                                        /home/fuwei/workspace/bbolt/freelist.go:270 +0x199
                                go.etcd.io/bbolt.(*DB).loadFreelist.func1()
                                        /home/fuwei/workspace/bbolt/db.go:400 +0xc5
                                sync.(*Once).doSlow(0xc0001301c0?, 0x584020?)
                                        /usr/local/go/src/sync/once.go:74 +0xc2
                                sync.(*Once).Do(...)
                                        /usr/local/go/src/sync/once.go:65
                                go.etcd.io/bbolt.(*DB).loadFreelist(0xc000130000?)
                                        /home/fuwei/workspace/bbolt/db.go:393 +0x47
                                go.etcd.io/bbolt.Open({0x7ffc60ce9530, 0x38}, 0x670060?, 0xc00005fc18)
                                        /home/fuwei/workspace/bbolt/db.go:275 +0x44f
                                main.(*checkCommand).Run(0xc000137e58, {0xc0000161a0, 0x1, 0x1})
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:212 +0x1e5
                                main.(*Main).Run(0xc00005ff40, {0xc000016190?, 0xc0000061a0?, 0x200000003?})
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:124 +0x4d4
                                main.main()
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:62 +0xae
--- FAIL: TestDropWritesDuringBench (8.29s)
FAIL

I think if the lazyFS can drop writes silently, that will be perfect for simulation.
We can discuss this in next meeting.

@ahrtr
Copy link
Member Author

ahrtr commented Nov 18, 2023

@fuweid thx for the test case.

We can have more discussion on Monday's meeting.

@ahrtr
Copy link
Member Author

ahrtr commented Nov 19, 2023

Confirmed that the file.WriteAt never fails when running your test case, nor the fdatasync.

I added log in #616, also updated your test case in fuweid/go-dmflakey#1. So it should be the test case's issue (specifically it should be flaky filesystem's issue) instead of bbolt's issue.

# go test -v
=== RUN   TestDropWritesDuringBench
    main_test.go:57: Init empty bbolt database with 128 MiB
    main_test.go:61: Ensure the empty boltdb data persisted in the flakey device
    main_test.go:64: Start to run bbolt-bench
    main_test.go:92: Drop all the write IOs after 3 seconds
    main_test.go:96: Let bbolt-bench run with DropWrites mode for 3 seconds
    main_test.go:99: Start to allow all the write IOs for 2 seconds
    main_test.go:103: Kill the bbolt process and simulate power failure
    main_test.go:86: ####### bbolt output: 
         work: /tmp/TestDropWritesDuringBench3215672155/001/root/boltdb
        starting write benchmark.
        Completed 910 requests, 909/s 
        Completed 1805 requests, 894/s 
        Completed 2705 requests, 899/s 
        Completed 79925 requests, 77209/s 
        Completed 154430 requests, 74501/s 
        Completed 226100 requests, 71659/s 
        Completed 227845 requests, 1744/s 
        Completed 228710 requests, 864/s 
    main_test.go:115: Invoke bbolt check to verify data: "/tmp/boltdb"
    main_test.go:117: 
        	Error Trace:	/root/go/src/github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:117
        	Error:      	Received unexpected error:
        	            	exit status 2
        	Test:       	TestDropWritesDuringBench
        	Messages:   	bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>
        	            	
        	            	goroutine 1 [running]:
        	            	go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
        	            		/root/go/src/github.com/ahrtr/bbolt/freelist.go:270 +0x179
        	            	go.etcd.io/bbolt.(*DB).loadFreelist.func1()
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:400 +0xb8
        	            	sync.(*Once).doSlow(0xc0001341c0?, 0x581580?)
        	            		/root/software/go/src/sync/once.go:74 +0xbf
        	            	sync.(*Once).Do(...)
        	            		/root/software/go/src/sync/once.go:65
        	            	go.etcd.io/bbolt.(*DB).loadFreelist(0xc000134000?)
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:393 +0x45
        	            	go.etcd.io/bbolt.Open({0x7fff4ee2c62a, 0xb}, 0x677960?, 0xc00005fbd0)
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:275 +0x425
        	            	main.(*checkCommand).Run(0xc00013be18, {0xc000016140, 0x1, 0x1})
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:212 +0x1df
        	            	main.(*Main).Run(0xc00005ff00, {0xc000016130?, 0xc000006340?, 0x200000003?})
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:124 +0x469
        	            	main.main()
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:62 +0xa6
--- FAIL: TestDropWritesDuringBench (8.35s)
FAIL
exit status 1
FAIL	github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb	8.350s

@fuweid
Copy link
Member

fuweid commented Nov 19, 2023

Hi @ahrtr , The original case is used to simulate the data loss in the device level. The DropWrites failpoint is to drop all the data submitted by fdatasync or fsync. It's like power failure and cause data committed loss.

I was trying to use the following script in pod:

run-bench-test &
sleep random-seconds
echo b > /proc/sysrq-trigger

And run pod as daemonset in the kubernetes cluster. It can't reproduce the data corruption.
So I inject failpoint in device level to drop the writes, which seems to change one byte in the data.

I think we can add failpoint between writeAt and fdatasync so that we can see the recovery after power failure.

We can sync the detail in tomorrow meeting.

@ahrtr
Copy link
Member Author

ahrtr commented Nov 22, 2023

@fuweid can you add the test case TestDropWritesDuringBench (Possibly need to change the name) into bbolt?

@fuweid
Copy link
Member

fuweid commented Nov 22, 2023

@ahrtr Hi, I am still working on this. will file pull request when it's ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants