[BUG] - Lock does not work for distributed instances #841

rbroggi · 2025-03-17T19:43:47Z

Describe the bug

For all the job schedules that are duration-driven and for which the execution of the job is relatively fast, the lock approach does not guarantee one execution per instance. The reason for that is that the schedule will likely start at different times across the different instances (e.g. pod rollouts in k8s). With that, the triggering moment of the schedule is not synchronized across different instances as the ticker is shifted between instances. What endup happening is that most of the instances manage to successfully acquire and release the lock when it's time for their execution.

The same problem is true for the other types of schedulers in the presence of clock-skew (one instance is likely to be able to acquire and release the lock before another instance attempt to acquire).

I don't see a way to fix this but I think we should document this situation or completely remove the distributed-lock.

To Reproduce

I can try to create a reproducer but I think that the explanation is sufficient, it's more a functionality bug rather than a technical bug.

Version

v2.16.1

Expected behavior

Could we either document the distributed lock's shortcomings or remove the functionality?

Additional context

I have opted to use leader-election after identifying the issue above.

rbroggi added the bug Something isn't working label Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Lock does not work for distributed instances #841

[BUG] - Lock does not work for distributed instances #841

rbroggi commented Mar 17, 2025

[BUG] - Lock does not work for distributed instances #841

[BUG] - Lock does not work for distributed instances #841

Comments

rbroggi commented Mar 17, 2025

Describe the bug

To Reproduce

Version

Expected behavior

Additional context