More robust sync from git #72

mscherer · 2022-09-22T20:37:04Z

TLDR:

The current system of using a git checkout with either clone or and syncing is IMHO not robust enough, and should be replaced by a sync function that create a temporary directory, do a git clone, and then use that clone for the sync. This would prevent the issue that I dubbed "the mysterious case of the knative 2022-TOC disparition".

So earlier this week, @jberkus asked us to investigate why the elekto instance we setup on Openshift had lost 1 election after the upgrade to 0.6. Upon further debugging yesterday, we found this was due to the ballot table being empty, and after ruling out data corruptions and the like, we stopped investigating. But we found this happened before the update, which puzzled us a lot.

I restarted the investigation this evening and found I think the cause.

So ballots are not removed from the db, unless a election is removed. Upon looking the database backups we had, I noticed that indeed, the election of 2022 wasn't here, and so my hypothesis at that time was that the election removal triggered the ballots cleanup. However, it was unclear how the election could be removed in the first place. Upon checking the code, I found that the sync function do remove elections if they are not in the yaml file for the election, and that sync is triggered by the webhook.

But the yaml file should never be absent, because the webhook populate them the git repo if it doesn't exist, correct ? Yes, but... the sync function can also be run from the CLI using --sync, and this is done when starting the container

I wasn't aware that the meta directory had to persist (because that's not how I would have done). So I placed it on /tmp, which is likely wiped on restart, or cleaned somehow, for example, when we have a upgrade as it happen on a regular basis.

So the 1st obvious fix is to tell "make sure META_PATH is persistent". But I think a better fix would be to not need META_PATH in the first place, and have the sync() function do a git clone (maybe a shallow one) each time it is called, in a temporary directory and operate from here.

Besides avoiding the problem I faced, this also would be more robust in case of container crash, for example, in the middle of a git clone that leave some lock file and requires a human to fix. It would also simplify the code a bit, by removing 1 option.

mscherer · 2022-09-22T20:37:50Z

And doing a clone also avoid issue if someone do a dirty rebase or a push -f, as it can happen from time to time.

mscherer · 2022-09-23T15:51:13Z

Ok so I was wrong on the container --sync, it also do clone, and indeed, /tmp/ is wiped on restart, so that part is good.

However, the git clone/pull is done without verifying the return code and python do not trigger a exception, so a failure here would not stop the sync. So my point on robustness still apply I think.

jberkus · 2022-09-26T20:45:15Z

So the issue here is failing to check for an exception correctly? Will look into it.

Also, are you saying that you think that maybe there was a container misfire, where /tmp/ was not writeable, and sync deleted the elections for that reason?

mscherer · 2022-09-26T22:07:53Z

Well, I suspect something made elekto remove the election, and the only path I can imagine is related to the sync.

However, I elaborated a false theory as I didn't see elekto clone the git repo at start, and not just with the webhook, so just ignore the 1st comment (I should maybe edit it).

I can't find why only 1 election got removed. If my theory on sync silently failing was right, all elections would have been lost, not just one.

So either the theory is false, or my theory is right and there is a unknown reason that could explain why the 3 others elections (the empty ones) were still here.

The sync function use 2 try/except, so a failure to delete a table with session.delete wouldn't stop the sync, and the code would continue. Maybe it is worth testing what happen if elekto is restart when meta is not here, and if delete behave as we expect (cause sql alchemy wouldn't delete the object, just mark it as deleted, and I wonder if that could have unexpected side effects )

mscherer · 2022-09-26T22:08:47Z

As for the misfire, a more likely explanation would be that the container restarted when github was down, or DNS issue, or something like this.

mscherer · 2022-09-27T13:06:57Z

So I just verified (not on prod), but failure to checkout git repository (her, triggered with a mistyped URL) result in all elections to be removed.

So that do not explain why only 2022-TOC disappeared.

mscherer · 2022-09-27T13:08:04Z

Ok so seems I am confused. The election (from table election) itself was not removed. But the ballots were.

mscherer · 2022-09-27T14:57:15Z

So my hypothesis is likely right, I just badly explained.

In my DB dump from the upgrade day, I have the 3 elections in the elections table, but no ballots. What I think happened is the following:

Every 2 weeks, our cluster is upgraded, which move pods around, by restarting them. I suspect that during one of those upgrade, the elekto pod ran a console --sync on a fresh /tmp/. There is no reason for it to fail, except a issue on github and/or network. However, if a issue on github/network happen, the actual code will just ignore the result and continue, assuming that the existing checkout is fine.

However, on our setup, the checkout is on /tmp, which is wiped by the reboot. So the code will continue, see there is no meta files, and erase all elections, which in turn remove all ballots.

Then 2 weeks later, a new upgrade is triggered. The pod got restarted again, and this time, the git clone work, so elekto add back the elections, but the ballots are gone.

And this is exactly what I see in my dump. The elections are here in the table, the ballots are not, and the ballot_id sequence is not 0. So the DB wasn't removed, just the ballots, and the elections are here as they were added later on next sync (and where all created at the same time).

Ergo, I think the sync operation should account for git errors, and stop the sync without preventing elekto from starting.

See elekto-io#72 and elekto-io#73

jberkus · 2022-09-28T20:36:45Z

I don't think only 2022-TOC disappeared. I think all three elections disappeared; it's that only 2022-TOC was not backed up.

mscherer · 2022-09-29T08:46:48Z

Yu, that's what I meant when I said erase all elections. I looked in github outage history to see if there was a big one that would have blocked git clone from working, but found nothing egregious on that front (except one). However, given that's only code path doing that, I am fairly confident that's what happened.

I will submit a new PR, as #75 is not going to fit.

jberkus · 2022-09-29T21:54:29Z

Addressing part of this with #76

jberkus · 2023-05-10T18:34:32Z

The other parts of this issue have been fixed. Closing this and tracking the remaining enhancement via #74

jberkus mentioned this issue Sep 27, 2022

Replace os.system with subprocess #73

Closed

mscherer added a commit to mscherer/elekto that referenced this issue Sep 28, 2022

Simplify sync by cloning in a temporary directory

87f7fdc

See elekto-io#72 and elekto-io#73

mscherer mentioned this issue Sep 28, 2022

Simplify sync by cloning in a temporary directory #75

Closed

mscherer added a commit to mscherer/elekto that referenced this issue Sep 28, 2022

Simplify sync by cloning in a temporary directory

7fbf740

See elekto-io#72 and elekto-io#73

mscherer added a commit to mscherer/elekto that referenced this issue Sep 28, 2022

Simplify sync by cloning in a temporary directory

179f99a

See elekto-io#72 and elekto-io#73

jberkus closed this as completed May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust sync from git #72

More robust sync from git #72

mscherer commented Sep 22, 2022

mscherer commented Sep 22, 2022

mscherer commented Sep 23, 2022

jberkus commented Sep 26, 2022

mscherer commented Sep 26, 2022

mscherer commented Sep 26, 2022

mscherer commented Sep 27, 2022

mscherer commented Sep 27, 2022

mscherer commented Sep 27, 2022

jberkus commented Sep 28, 2022

mscherer commented Sep 29, 2022 •

edited

Loading

jberkus commented Sep 29, 2022

jberkus commented May 10, 2023

More robust sync from git #72

More robust sync from git #72

Comments

mscherer commented Sep 22, 2022

mscherer commented Sep 22, 2022

mscherer commented Sep 23, 2022

jberkus commented Sep 26, 2022

mscherer commented Sep 26, 2022

mscherer commented Sep 26, 2022

mscherer commented Sep 27, 2022

mscherer commented Sep 27, 2022

mscherer commented Sep 27, 2022

jberkus commented Sep 28, 2022

mscherer commented Sep 29, 2022 • edited Loading

jberkus commented Sep 29, 2022

jberkus commented May 10, 2023

mscherer commented Sep 29, 2022 •

edited

Loading