You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For all the job schedules that are duration-driven and for which the execution of the job is relatively fast, the lock approach does not guarantee one execution per instance. The reason for that is that the schedule will likely start at different times across the different instances (e.g. pod rollouts in k8s). With that, the triggering moment of the schedule is not synchronized across different instances as the ticker is shifted between instances. What endup happening is that most of the instances manage to successfully acquire and release the lock when it's time for their execution.
The same problem is true for the other types of schedulers in the presence of clock-skew (one instance is likely to be able to acquire and release the lock before another instance attempt to acquire).
I don't see a way to fix this but I think we should document this situation or completely remove the distributed-lock.
To Reproduce
I can try to create a reproducer but I think that the explanation is sufficient, it's more a functionality bug rather than a technical bug.
Describe the bug
For all the job schedules that are duration-driven and for which the execution of the job is relatively fast, the lock approach does not guarantee one execution per instance. The reason for that is that the schedule will likely start at different times across the different instances (e.g. pod rollouts in k8s). With that, the triggering moment of the schedule is not synchronized across different instances as the ticker is shifted between instances. What endup happening is that most of the instances manage to successfully acquire and release the lock when it's time for their execution.
The same problem is true for the other types of schedulers in the presence of clock-skew (one instance is likely to be able to acquire and release the lock before another instance attempt to acquire).
I don't see a way to fix this but I think we should document this situation or completely remove the distributed-lock.
To Reproduce
I can try to create a reproducer but I think that the explanation is sufficient, it's more a functionality bug rather than a technical bug.
Version
v2.16.1
Expected behavior
Could we either document the distributed lock's shortcomings or remove the functionality?
Additional context
I have opted to use leader-election after identifying the issue above.
The text was updated successfully, but these errors were encountered: