Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime upsert and delete events race conditions #894

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yhabteab
Copy link
Member

Apparently the runtime state updates don't seem to work as they should. Specifically, I spent almost 2 days debugging why Icinga DB always keeps some leftovers in the dependency_edge_state and redundancy_group_state table while testing #889. After some time I got clueless where the problem might lie and wanted to check if the actual host_state and service_state tables actually get cleaned up correctly when the respective host or/and service object is removed and surprisingly they don't! Icinga 2 does not even send runtime delete events for these two tables and any leftovers will only get cleaned up after the next config dump triggered by either a Icinga DB or Icinga 2 process restart. However, the thing is that unlike these two tables, Icinga 2 does send runtime delete events for the dependency states but the integration tests from #889 were still able to identify some leftovers in the dependency_edge_state and redundancy_group_state. Fortunately, @julianbrost gave me a hint and suggested to temporarily disable the parallel state updates triggered from here:

logger.Info("Starting state runtime updates sync")
return rt.Sync(synctx, v1.StateFactories, runtimeStateUpdateStreams, true)

Tada 🎉, that finally ended my seemingly never-ending nightmare! It seems that when the allowParallel parameter is set to true, the RuntimeUpdates#Sync() method does not honor the exact order of delete and upsert events from Redis, resulting in some rare and difficult to manually trigger race conditions where a subsequent runtime delete event from Icinga 2 might actually surpass the previously sent upsert event. Since fixing this issue would require a major refactoring of the Icinga DB code, I'm not even going to try to fix it now, instead I'm just creating this PR to have a reference in the future so it doesn't get lost.

@cla-bot cla-bot bot added the cla/signed label Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant