-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't exit tablet server on reloading invalid ACL #17485
base: main
Are you sure you want to change the base?
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to amend your commit with the proper sign off to pass the DCO check as well.
go/vt/tableacl/tableacl.go
Outdated
@@ -105,6 +105,7 @@ func (tacl *tableACL) init(configFile string, aclCB func()) error { | |||
if configFile == "" { | |||
return nil | |||
} | |||
log.Infof("Reading ACL file %v", configFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should log it here. What's the reason you need it here? Was it not clear that this was happening? I would imagine it was clear from the lines below on errors what the cause of the problem was?
Below here, there's also Info
logging on errors which is wrong, those should be raised at least to warning if not error. It's confusing those are set at the info level.
If there's a good reason to add the logging, maybe instead we should log inside the SIGHUP handler to make it specific for the scenario? I don't think this line is something useful to print on each startup then. We should be careful with not adding too much logging noise that loses signal then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should log it here. What's the reason you need it here? Was it not clear that this was happening? I would imagine it was clear from the lines below on errors what the cause of the problem was?
The issue is that when there is no error, sending a SIGHUP doesn't cause any log output to be generated. So, you don't know if it even did anything.
I considered adding the line to the SIGHUP handler. We can do that too I suppose. I'll change it, along with the sign-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that when there is no error, sending a SIGHUP doesn't cause any log output to be generated. So, you don't know if it even did anything.
Let's log it then after it succeeds, so that we only have the line if there was a successful reload due to a SIGHUP
. We can then make it more something like "Successfully reloaded ACL".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had just force-pushed with the log line in the SIGHUP handler, also to make it clear it responds to SIGHUP. It's a convention this is logged by software, so I think it's a good idea.
But about your proposal: I generally prefer to log intentions rather than result. Otherwise you may get log entries like 'error: too many open files' and you have no idea what it was trying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have no idea what it was trying to do.
I'm not sure what you mean here? This is something that logging can do properly? Looking at the code more I think it would benefit from some additional refactor and remove the exit at a distance kinda behavior and use error returns and let the callers decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here?
I think we're miscommunicating a bit. You said:
What's the reason you need it here? Was it not clear that this was happening? I would imagine it was clear from the lines below on errors what the cause of the problem was?
Indeed, on error. I meant that as a sysop, I would like to see a program respond to something. If I send it a SIGHUP and no log lines are printed, I can't tell whether it did something or not.
Looking at the code more I think it would benefit from some additional refactor and remove the exit at a distance kinda behavior and use error returns and let the callers decide.
OK, I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wiebeytec https://github.com/wiebeytec/vitess/compare/main...dbussink:vitess:acl-refactor-suggestion?expand=1 is kinda what I was thinking about here. Dunno what you think of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. In the branch I just force-pushed I still logged in the callees (deep), but functionality it's similar.
You can remove my code in favor of yours then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think logging at the higher level would be preferable here.
62e220f
to
2c30527
Compare
What's the next step here? I think we can go with your code. |
We can update your PR here with that code? Feel free to use it! We'll also need to add some tests for it as well then to validate the new behavior. |
This fixes potentially bringing down a tablet with an innocuous SIGHUP. It also logs the fact it's read the ACL file, to fix not getting any feedback on SIGHUP. vitessio#17139 Signed-off-by: Wiebe Cazemier <[email protected]>
Alright, I updated the PR and added a test case that tests loading an invalid ACL does not remove the existing one. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17485 +/- ##
==========================================
+ Coverage 67.67% 67.71% +0.04%
==========================================
Files 1584 1584
Lines 254466 254513 +47
==========================================
+ Hits 172215 172351 +136
+ Misses 82251 82162 -89 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. I'm leaning towards not backporting it for two reasons as I'm not confident in how this is used across the Vitess user base:
- What does this mean if the operator isn't watching the logs? Couldn't this potentially mean that the vttablet is running along as if everything is fine while having more permissive permissions than intended/expected? We can say that this is the operators concern/fault, but it's still a change.
- Some people could be relying on the current behavior (I've made this mistake in the past)
It's also for this reason that I think we should add a small blurb about this here https://github.com/vitessio/vitess/blob/main/changelog/22.0/22.0.0/changelog.md
What do others think?
People who are relying on that behavior would have to have a weird setup, because they need to have a running tablet server that they expect to exit after SIGHUP when the file contains errors. One example use is deployment on the replicas as a test, but then they'd be putting their replicas at risk; especially because it calls the OS Having said that, I do see your point that it's a change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Matt, we should drop a line in the summary.md
file so that no one is caught off-guard with the changes. Otherwise rest, LGTM. (merging from main
might be required because of an extra introduced test there)
This fixes potentially bringing down a tablet with an innocuous SIGHUP.
It also logs the fact it's reading the ACL file, to fix not getting any feedback on SIGHUP.
#17139
Description
Currently, deploying new ACL files is not only very involved because I have to test them against a running Vitess instance, but it's also very stressful to do
killall -SIGHUP vttablet
knowing that it can bring down the entire system.Because of the risk of downtime caused by simple ACL definition mistakes, I think this is a candidate for back-porting.
Related Issue(s)
#17139
Checklist
Deployment Notes