-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master lock getting out of sync in case the LUA script gets interrupted #553
Comments
Using a Redlock is probably a better idea though, as you stated. |
I really doubt that Redis can ensure that the script is not terminated in the middle in case something bad happens to the Redis server. And I guess something like this happen on our production Redis (we use ElasticCache from AWS). I was thinking about this in the recent days and I think an easy fix could be to use: def acquire!
Resque.redis.set(key, value, nx: true, px: timeout)
end instead of using a script. This should atomically set the key and the timeout and return @carsonreinke, what do you think? |
Strange that even happened. Seriously, no TTL? The return value from |
Yes. No TTL. The scheduled jobs were stuck for over a month before we noticed that something is not quite right. And while debugging the live system I found that the TTL of the master lock is This Redis version thing sucks :( It should be easy to reproduce the bug by putting a sleep in the Lua script and killing the server while the script is running. I am not sure that is possible for the Lua supported by Redis, though... (probably putting a long loop in there could simulate a delay too, so that you have time to kill the Redis process). Let me know if you need a PR for this. I can try to prepare one. |
That sucks, sorry to hear that happened. Looks like A PR would be great. |
@carsonreinke Did you end up patching this in your installation? Could you submit a PR? |
We had this production problem yesterday where we discovered that the scheduled tasks are not executed even though the scheduler is running. After investigation it turned out that the master lock key in redis is set to some value, but has no TTL set, essentially leading us to this function: https://github.com/resque/resque-scheduler/blob/master/lib/resque/scheduler/lock/resilient.rb#L54
The above inconsistency caused no master node to be elected (although we don't use multiple schedulers) and all the scheduled jobs got blocked.
I really believe the way this lock is set with 2 separate operations SETNX and EXPIRE is not atomic, even though it is executed in a LUA script. These 2 operations need to be atomic and this can be achieved using the
SET NX PX
. Even a better solution will be to use a lock implementation which is reviewed by the community, for example using the Redis guidelines for distributed locks: http://redis.io/topics/distlockThe text was updated successfully, but these errors were encountered: