Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database is deadlocking #298

Closed
cbrake opened this issue Oct 27, 2020 · 13 comments · Fixed by #304
Closed

Database is deadlocking #298

cbrake opened this issue Oct 27, 2020 · 13 comments · Fixed by #304
Labels
bug Something isn't working

Comments

@cbrake
Copy link

cbrake commented Oct 27, 2020

Hi, I've been experimenting with Genji v0.8.0 in one of my apps:

https://github.com/simpleiot/simpleiot/blob/feature-genji2/db/genji/genji.go

After I click around in the frontend a bit, API calls start timing out.

I instrumented the db calls, and learned that if one of the calls starts before the previous one finishes, the API calls start timing out soon after that.

Am I doing anything obviously wrong? I'm using boltdb backend and thought boltdb was thread safe. Do I need to wrap all db operations in transactions?

@asdine
Copy link
Collaborator

asdine commented Oct 28, 2020

@cbrake This behavior shouldn't happen, especially if you are only using read statements. I'll try to reproduce it.

@cbrake
Copy link
Author

cbrake commented Oct 28, 2020

thanks @asdine

The following program will deadlock after a few seconds of running:

https://github.com/simpleiot/simpleiot/blob/feature-genji2/cmd/genji-test6/main.go

Without the sleep statements, it deadlocks right away.

t2Count:  138
t1Count:  138
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc0000cc06c, 0xc0000a3e00, 0x1)
        /usr/lib/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0000cc068)
        /usr/lib/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/lib/go/src/sync/mutex.go:81
github.com/genjidb/genji/database.(*Transaction).Commit(0xc0002068c0, 0x0, 0x0)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/database/transaction.go:61 +0x14c
github.com/genjidb/genji/sql/query.(*Result).Close(0xc00020e320, 0x5b5dc1, 0x10)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/sql/query/query.go:145 +0x4d
main.main()
        /scratch/simpleiot/go/cmd/genji-test6/main.go:72 +0x3e5

goroutine 68 [semacquire]:
sync.runtime_SemacquireMutex(0xc0000e01b4, 0x525600, 0x1)
        /usr/lib/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0000e01b0)
        /usr/lib/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/lib/go/src/sync/mutex.go:81
go.etcd.io/bbolt.(*DB).beginRWTx(0xc0000e0000, 0x0, 0x0, 0x0)
        /home/cbrake/go/pkg/mod/go.etcd.io/[email protected]/db.go:593 +0x1dd
go.etcd.io/bbolt.(*DB).Begin(0xc0000e0000, 0x300000001, 0xc00015bd28, 0x40c950, 0xc00019c050)
        /home/cbrake/go/pkg/mod/go.etcd.io/[email protected]/db.go:542 +0x38
github.com/genjidb/genji/engine/boltengine.(*Engine).Begin(0xc0000c8030, 0xc00015be01, 0x40, 0x48, 0xc00019c050, 0x4093ff)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/engine/boltengine/engine.go:34 +0x3b
github.com/genjidb/genji/database.(*Database).BeginTx(0xc0000cc040, 0xc00015bdee, 0x0, 0x0, 0x0)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/database/database.go:120 +0xd7
github.com/genjidb/genji/database.(*Database).Begin(...)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/database/database.go:99
github.com/genjidb/genji/sql/query.Query.Run(0xc0004a99d0, 0x1, 0x1, 0x0, 0x0, 0xc0000cc040, 0x70aa70, 0x0, 0x0, 0xc00015bf20, ...)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/sql/query/query.go:51 +0x52a
github.com/genjidb/genji.(*DB).Query(0xc0000c80a8, 0x5b5db1, 0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/db.go:78 +0x125
main.main.func1(0xc0000c80a8)
        /scratch/simpleiot/go/cmd/genji-test6/main.go:55 +0x11c
created by main.main
        /scratch/simpleiot/go/cmd/genji-test6/main.go:52 +0x3cc
exit status 2

@cbrake
Copy link
Author

cbrake commented Oct 28, 2020

The memory engine deadlocks as well. However, I can't get badger engine to deadlock, even after removing the sleeps.

t1Count:  175
t2Count:  160
t1Count:  176
t2Count:  161
t1Count:  177
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc00010037c, 0x0, 0x1)
        /usr/lib/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000100378)
        /usr/lib/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/lib/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc000100378)
        /usr/lib/go/src/sync/rwmutex.go:98 +0x90
github.com/genjidb/genji/engine/memoryengine.(*Engine).Begin(0xc000100360, 0xc000075e01, 0x40, 0x48, 0xc0002a3680, 0x4093ff)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/engine/memoryengine/engine.go:37 +0x132
github.com/genjidb/genji/database.(*Database).BeginTx(0xc000118040, 0xc000075d66, 0x0, 0x0, 0x0)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/database/database.go:120 +0xd7
github.com/genjidb/genji/database.(*Database).Begin(...)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/database/database.go:99
github.com/genjidb/genji/sql/query.Query.Run(0xc00029aa30, 0x1, 0x1, 0x0, 0x0, 0xc000118040, 0x70aa70, 0x0, 0x0, 0xc000075e98, ...)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/sql/query/query.go:51 +0x52a
github.com/genjidb/genji.(*DB).Query(0xc00000e050, 0x5b5dbb, 0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/db.go:78 +0x125
main.main()
        /scratch/simpleiot/go/cmd/genji-test6/main.go:68 +0x4d0

goroutine 2076 [semacquire]:
sync.runtime_SemacquireMutex(0xc00011806c, 0xc00009ff00, 0x1)
        /usr/lib/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000118068)
        /usr/lib/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/lib/go/src/sync/mutex.go:81
github.com/genjidb/genji/database.(*Transaction).Commit(0xc000354080, 0x0, 0x0)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/database/transaction.go:61 +0x14c
github.com/genjidb/genji/sql/query.(*Result).Close(0xc00030cf00, 0x5b5dab, 0x10)
        /home/cbrake/go/pkg/mod/github.com/genjidb/[email protected]/sql/query/query.go:145 +0x4d
main.main.func1(0xc00000e050)
        /scratch/simpleiot/go/cmd/genji-test6/main.go:59 +0x45
created by main.main
        /scratch/simpleiot/go/cmd/genji-test6/main.go:52 +0x3cc
exit status 2

@tie
Copy link
Contributor

tie commented Oct 28, 2020

@cbrake thanks! I can reproduce the issue on single-core xhyve VM with GOMAXPROCS ≥ 2 with your main.go.

Log
t1Count:  4728
t2Count:  4712
t1Count:  4729
t2Count:  4713
t1Count:  4730
t2Count:  4714
t1Count:  4731
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc0000640ac, 0x0, 0x1)
	runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0000640a8)
	sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	sync/mutex.go:81
github.com/genjidb/genji/database.(*Transaction).Commit(0xc00044cd00, 0x0, 0x0)
	github.com/genjidb/[email protected]/database/transaction.go:61 +0x14b
github.com/genjidb/genji/sql/query.(*Result).Close(0xc00041f7c0, 0x5be006, 0x10)
	github.com/genjidb/[email protected]/sql/query/query.go:145 +0x4d
main.main()
	play.golang.org@/main.go:74 +0x486

goroutine 18 [semacquire]:
sync.runtime_SemacquireMutex(0xc00008c1b4, 0x0, 0x1)
	runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00008c1b0)
	sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	sync/mutex.go:81
go.etcd.io/bbolt.(*DB).beginRWTx(0xc00008c000, 0x0, 0x0, 0x0)
	go.etcd.io/[email protected]/db.go:593 +0x1e7
go.etcd.io/bbolt.(*DB).Begin(0xc00008c000, 0x203001, 0x7f9c5b1c1328, 0x42aeea, 0x0)
	go.etcd.io/[email protected]/db.go:542 +0x38
github.com/genjidb/genji/engine/boltengine.(*Engine).Begin(0xc00000e048, 0xc000080001, 0x50, 0x48, 0x5b0120, 0xc00004dd00)
	github.com/genjidb/[email protected]/engine/boltengine/engine.go:34 +0x3b
github.com/genjidb/genji/database.(*Database).BeginTx(0xc000064080, 0xc00004ddee, 0x0, 0x0, 0x0)
	github.com/genjidb/[email protected]/database/database.go:120 +0xe8
github.com/genjidb/genji/database.(*Database).Begin(...)
	github.com/genjidb/[email protected]/database/database.go:99
github.com/genjidb/genji/sql/query.Query.Run(0xc000010520, 0x1, 0x1, 0x0, 0x0, 0xc000064080, 0x726bf8, 0x0, 0x0, 0x5c6988, ...)
	github.com/genjidb/[email protected]/sql/query/query.go:51 +0x51c
github.com/genjidb/genji.(*DB).Query(0xc00000e0b8, 0x5bdff6, 0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	github.com/genjidb/[email protected]/db.go:78 +0x121
main.main.func1(0xc00000e0b8)
	play.golang.org@/main.go:57 +0x11b
created by main.main
	play.golang.org@/main.go:54 +0x46e
exit status 2

And when setting GOMAXPROCS=1 and removing time.Sleep (this seems to be consistent reproducer).

Log
t2Count:  1
t2Count:  2
t2Count:  3
t2Count:  4
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc0000841b4, 0x0, 0x1)
	runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0000841b0)
	sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	sync/mutex.go:81
go.etcd.io/bbolt.(*DB).beginRWTx(0xc000084000, 0x0, 0x0, 0x0)
	go.etcd.io/[email protected]/db.go:593 +0x1e7
go.etcd.io/bbolt.(*DB).Begin(0xc000084000, 0xc000052001, 0x0, 0x42aeea, 0x0)
	go.etcd.io/[email protected]/db.go:542 +0x38
github.com/genjidb/genji/engine/boltengine.(*Engine).Begin(0xc00000e050, 0x43d201, 0xc00005c0ac, 0x43e500, 0x1, 0xc00004bbf0)
	github.com/genjidb/[email protected]/engine/boltengine/engine.go:34 +0x3b
github.com/genjidb/genji/database.(*Database).BeginTx(0xc00005c080, 0xc00004bcfe, 0x0, 0x0, 0x0)
	github.com/genjidb/[email protected]/database/database.go:120 +0xe8
github.com/genjidb/genji/database.(*Database).Begin(...)
	github.com/genjidb/[email protected]/database/database.go:99
github.com/genjidb/genji/sql/query.Query.Run(0xc0002725a0, 0x1, 0x1, 0x0, 0x0, 0xc00005c080, 0x726bf8, 0x0, 0x0, 0x10, ...)
	github.com/genjidb/[email protected]/sql/query/query.go:51 +0x51c
github.com/genjidb/genji.(*DB).Query(0xc00000e0c0, 0x5be006, 0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	github.com/genjidb/[email protected]/db.go:78 +0x121
main.main()
	play.golang.org@/main.go:68 +0x565

goroutine 8 [semacquire]:
sync.runtime_SemacquireMutex(0xc00005c0ac, 0x1, 0x1)
	runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00005c0a8)
	sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	sync/mutex.go:81
github.com/genjidb/genji/database.(*Transaction).Commit(0xc000411240, 0x0, 0x0)
	github.com/genjidb/[email protected]/database/transaction.go:61 +0x14b
github.com/genjidb/genji/sql/query.(*Result).Close(0xc000078af0, 0x5bdff6, 0x10)
	github.com/genjidb/[email protected]/sql/query/query.go:145 +0x4d
main.main.func1(0xc00000e0c0)
	play.golang.org@/main.go:60 +0x44
created by main.main
	play.golang.org@/main.go:53 +0x46e
exit status 2

@tie
Copy link
Contributor

tie commented Oct 28, 2020

Seems to be caused by attachedTxMu on *database.Database.

@tie
Copy link
Contributor

tie commented Oct 29, 2020

See also https://play.golang.org/p/Fnfs2aUNey4 for a test case we may want to include in tests.

@asdine
Copy link
Collaborator

asdine commented Oct 29, 2020

Thanks for this! It's very helpful!
I will work on a fix tomorrow and release a patch for the v0.8

@asdine
Copy link
Collaborator

asdine commented Oct 30, 2020

@cbrake @tie I have created the v0.8.1 release, it should be working now.
Thank you!

@asdine asdine added the bug Something isn't working label Oct 30, 2020
@tie
Copy link
Contributor

tie commented Oct 30, 2020

@asdine actually, we don’t need a separate branch just for the v0.8.1 tag. This comment suggested a release-v0.8 branch for all v0.8.x “patch” releases so that we can create backport PRs (similar to the release branches Go project uses, see MinorReleases).

If you don’t have any objections, I’ll get this process documented and automated as part of the #269.

@asdine
Copy link
Collaborator

asdine commented Oct 30, 2020

@tie Indeed you're right I'll fix that, thanks!

If you don’t have any objections, I’ll get this process documented and automated as part of the #269.

Of course, very good idea

@cbrake
Copy link
Author

cbrake commented Oct 31, 2020

v0.8.1 is working great with boltdb now -- thanks for the fix!

@tie
Copy link
Contributor

tie commented Oct 31, 2020

@asdine just noticed that we forgot to add cmd/genji/v0.8.1 tag.

@cbrake
Copy link
Author

cbrake commented Nov 2, 2020

turns out I need the below fix as well, so just running with latest master for now -- guess I can do more testing for you that way :-)

#216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants