Runtime upsert
and delete
events race conditions
#894
+86
−41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Apparently the runtime state updates don't seem to work as they should. Specifically, I spent almost 2 days debugging why Icinga DB always keeps some leftovers in the
dependency_edge_state
andredundancy_group_state
table while testing #889. After some time I got clueless where the problem might lie and wanted to check if the actualhost_state
andservice_state
tables actually get cleaned up correctly when the respective host or/and service object is removed and surprisingly they don't! Icinga 2 does not even send runtime delete events for these two tables and any leftovers will only get cleaned up after the next config dump triggered by either a Icinga DB or Icinga 2 process restart. However, the thing is that unlike these two tables, Icinga 2 does send runtime delete events for the dependency states but the integration tests from #889 were still able to identify some leftovers in thedependency_edge_state
andredundancy_group_state
. Fortunately, @julianbrost gave me a hint and suggested to temporarily disable the parallel state updates triggered from here:icingadb/cmd/icingadb/main.go
Lines 301 to 303 in a77498d
Tada 🎉, that finally ended my seemingly never-ending nightmare! It seems that when the
allowParallel
parameter is set to true, theRuntimeUpdates#Sync()
method does not honor the exact order ofdelete
andupsert
events from Redis, resulting in some rare and difficult to manually trigger race conditions where a subsequent runtimedelete
event from Icinga 2 might actually surpass the previously sentupsert
event. Since fixing this issue would require a major refactoring of the Icinga DB code, I'm not even going to try to fix it now, instead I'm just creating this PR to have a reference in the future so it doesn't get lost.