Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust sync from git #72

Closed
mscherer opened this issue Sep 22, 2022 · 12 comments
Closed

More robust sync from git #72

mscherer opened this issue Sep 22, 2022 · 12 comments

Comments

@mscherer
Copy link
Contributor

TLDR:

The current system of using a git checkout with either clone or and syncing is IMHO not robust enough, and should be replaced by a sync function that create a temporary directory, do a git clone, and then use that clone for the sync. This would prevent the issue that I dubbed "the mysterious case of the knative 2022-TOC disparition".

So earlier this week, @jberkus asked us to investigate why the elekto instance we setup on Openshift had lost 1 election after the upgrade to 0.6. Upon further debugging yesterday, we found this was due to the ballot table being empty, and after ruling out data corruptions and the like, we stopped investigating. But we found this happened before the update, which puzzled us a lot.

I restarted the investigation this evening and found I think the cause.

So ballots are not removed from the db, unless a election is removed. Upon looking the database backups we had, I noticed that indeed, the election of 2022 wasn't here, and so my hypothesis at that time was that the election removal triggered the ballots cleanup. However, it was unclear how the election could be removed in the first place. Upon checking the code, I found that the sync function do remove elections if they are not in the yaml file for the election, and that sync is triggered by the webhook.

But the yaml file should never be absent, because the webhook populate them the git repo if it doesn't exist, correct ? Yes, but... the sync function can also be run from the CLI using --sync, and this is done when starting the container

I wasn't aware that the meta directory had to persist (because that's not how I would have done). So I placed it on /tmp, which is likely wiped on restart, or cleaned somehow, for example, when we have a upgrade as it happen on a regular basis.

So the 1st obvious fix is to tell "make sure META_PATH is persistent". But I think a better fix would be to not need META_PATH in the first place, and have the sync() function do a git clone (maybe a shallow one) each time it is called, in a temporary directory and operate from here.

Besides avoiding the problem I faced, this also would be more robust in case of container crash, for example, in the middle of a git clone that leave some lock file and requires a human to fix. It would also simplify the code a bit, by removing 1 option.

@mscherer
Copy link
Contributor Author

And doing a clone also avoid issue if someone do a dirty rebase or a push -f, as it can happen from time to time.

@mscherer
Copy link
Contributor Author

Ok so I was wrong on the container --sync, it also do clone, and indeed, /tmp/ is wiped on restart, so that part is good.

However, the git clone/pull is done without verifying the return code and python do not trigger a exception, so a failure here would not stop the sync. So my point on robustness still apply I think.

@jberkus
Copy link
Member

jberkus commented Sep 26, 2022

So the issue here is failing to check for an exception correctly? Will look into it.

Also, are you saying that you think that maybe there was a container misfire, where /tmp/ was not writeable, and sync deleted the elections for that reason?

@mscherer
Copy link
Contributor Author

Well, I suspect something made elekto remove the election, and the only path I can imagine is related to the sync.

However, I elaborated a false theory as I didn't see elekto clone the git repo at start, and not just with the webhook, so just ignore the 1st comment (I should maybe edit it).

I can't find why only 1 election got removed. If my theory on sync silently failing was right, all elections would have been lost, not just one.

So either the theory is false, or my theory is right and there is a unknown reason that could explain why the 3 others elections (the empty ones) were still here.

The sync function use 2 try/except, so a failure to delete a table with session.delete wouldn't stop the sync, and the code would continue. Maybe it is worth testing what happen if elekto is restart when meta is not here, and if delete behave as we expect (cause sql alchemy wouldn't delete the object, just mark it as deleted, and I wonder if that could have unexpected side effects )

@mscherer
Copy link
Contributor Author

As for the misfire, a more likely explanation would be that the container restarted when github was down, or DNS issue, or something like this.

@mscherer
Copy link
Contributor Author

So I just verified (not on prod), but failure to checkout git repository (her, triggered with a mistyped URL) result in all elections to be removed.

So that do not explain why only 2022-TOC disappeared.

@mscherer
Copy link
Contributor Author

Ok so seems I am confused. The election (from table election) itself was not removed. But the ballots were.

@mscherer
Copy link
Contributor Author

So my hypothesis is likely right, I just badly explained.

In my DB dump from the upgrade day, I have the 3 elections in the elections table, but no ballots. What I think happened is the following:

Every 2 weeks, our cluster is upgraded, which move pods around, by restarting them. I suspect that during one of those upgrade, the elekto pod ran a console --sync on a fresh /tmp/. There is no reason for it to fail, except a issue on github and/or network. However, if a issue on github/network happen, the actual code will just ignore the result and continue, assuming that the existing checkout is fine.

However, on our setup, the checkout is on /tmp, which is wiped by the reboot. So the code will continue, see there is no meta files, and erase all elections, which in turn remove all ballots.

Then 2 weeks later, a new upgrade is triggered. The pod got restarted again, and this time, the git clone work, so elekto add back the elections, but the ballots are gone.

And this is exactly what I see in my dump. The elections are here in the table, the ballots are not, and the ballot_id sequence is not 0. So the DB wasn't removed, just the ballots, and the elections are here as they were added later on next sync (and where all created at the same time).

Ergo, I think the sync operation should account for git errors, and stop the sync without preventing elekto from starting.

@jberkus
Copy link
Member

jberkus commented Sep 28, 2022

I don't think only 2022-TOC disappeared. I think all three elections disappeared; it's that only 2022-TOC was not backed up.

@mscherer
Copy link
Contributor Author

mscherer commented Sep 29, 2022

Yu, that's what I meant when I said erase all elections. I looked in github outage history to see if there was a big one that would have blocked git clone from working, but found nothing egregious on that front (except one). However, given that's only code path doing that, I am fairly confident that's what happened.

I will submit a new PR, as #75 is not going to fit.

@jberkus
Copy link
Member

jberkus commented Sep 29, 2022

Addressing part of this with #76

@jberkus
Copy link
Member

jberkus commented May 10, 2023

The other parts of this issue have been fixed. Closing this and tracking the remaining enhancement via #74

@jberkus jberkus closed this as completed May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants