Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't exit tablet server on reloading invalid ACL #17485

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wiebeytec
Copy link

This fixes potentially bringing down a tablet with an innocuous SIGHUP.

It also logs the fact it's reading the ACL file, to fix not getting any feedback on SIGHUP.

#17139

Description

Currently, deploying new ACL files is not only very involved because I have to test them against a running Vitess instance, but it's also very stressful to do killall -SIGHUP vttablet knowing that it can bring down the entire system.

Because of the risk of downtime caused by simple ACL definition mistakes, I think this is a candidate for back-porting.

Related Issue(s)

#17139

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Jan 8, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jan 8, 2025
@github-actions github-actions bot added this to the v22.0.0 milestone Jan 8, 2025
Copy link
Contributor

@dbussink dbussink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to amend your commit with the proper sign off to pass the DCO check as well.

@@ -105,6 +105,7 @@ func (tacl *tableACL) init(configFile string, aclCB func()) error {
if configFile == "" {
return nil
}
log.Infof("Reading ACL file %v", configFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should log it here. What's the reason you need it here? Was it not clear that this was happening? I would imagine it was clear from the lines below on errors what the cause of the problem was?

Below here, there's also Info logging on errors which is wrong, those should be raised at least to warning if not error. It's confusing those are set at the info level.

If there's a good reason to add the logging, maybe instead we should log inside the SIGHUP handler to make it specific for the scenario? I don't think this line is something useful to print on each startup then. We should be careful with not adding too much logging noise that loses signal then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should log it here. What's the reason you need it here? Was it not clear that this was happening? I would imagine it was clear from the lines below on errors what the cause of the problem was?

The issue is that when there is no error, sending a SIGHUP doesn't cause any log output to be generated. So, you don't know if it even did anything.

I considered adding the line to the SIGHUP handler. We can do that too I suppose. I'll change it, along with the sign-off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that when there is no error, sending a SIGHUP doesn't cause any log output to be generated. So, you don't know if it even did anything.

Let's log it then after it succeeds, so that we only have the line if there was a successful reload due to a SIGHUP. We can then make it more something like "Successfully reloaded ACL".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had just force-pushed with the log line in the SIGHUP handler, also to make it clear it responds to SIGHUP. It's a convention this is logged by software, so I think it's a good idea.

But about your proposal: I generally prefer to log intentions rather than result. Otherwise you may get log entries like 'error: too many open files' and you have no idea what it was trying to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have no idea what it was trying to do.

I'm not sure what you mean here? This is something that logging can do properly? Looking at the code more I think it would benefit from some additional refactor and remove the exit at a distance kinda behavior and use error returns and let the callers decide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here?

I think we're miscommunicating a bit. You said:

What's the reason you need it here? Was it not clear that this was happening? I would imagine it was clear from the lines below on errors what the cause of the problem was?

Indeed, on error. I meant that as a sysop, I would like to see a program respond to something. If I send it a SIGHUP and no log lines are printed, I can't tell whether it did something or not.

Looking at the code more I think it would benefit from some additional refactor and remove the exit at a distance kinda behavior and use error returns and let the callers decide.

OK, I can do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wiebeytec https://github.com/wiebeytec/vitess/compare/main...dbussink:vitess:acl-refactor-suggestion?expand=1 is kinda what I was thinking about here. Dunno what you think of that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. In the branch I just force-pushed I still logged in the callees (deep), but functionality it's similar.

You can remove my code in favor of yours then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logging at the higher level would be preferable here.

@wiebeytec wiebeytec force-pushed the main branch 2 times, most recently from 62e220f to 2c30527 Compare January 9, 2025 11:31
@wiebeytec
Copy link
Author

What's the next step here? I think we can go with your code.

@dbussink
Copy link
Contributor

dbussink commented Jan 13, 2025

What's the next step here? I think we can go with your code.

We can update your PR here with that code? Feel free to use it! We'll also need to add some tests for it as well then to validate the new behavior.

This fixes potentially bringing down a tablet with an innocuous SIGHUP.

It also logs the fact it's read the ACL file, to fix not getting any
feedback on SIGHUP.

vitessio#17139

Signed-off-by: Wiebe Cazemier <[email protected]>
@wiebeytec
Copy link
Author

Alright, I updated the PR and added a test case that tests loading an invalid ACL does not remove the existing one.

@deepthi deepthi added Component: ACL Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 84.21053% with 3 lines in your changes missing coverage. Please review.

Project coverage is 67.71%. Comparing base (6ac8998) to head (9155ce8).
Report is 31 commits behind head on main.

Files with missing lines Patch % Lines
go/cmd/vttablet/cli/cli.go 50.00% 2 Missing ⚠️
go/vt/tableacl/tableacl.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17485      +/-   ##
==========================================
+ Coverage   67.67%   67.71%   +0.04%     
==========================================
  Files        1584     1584              
  Lines      254466   254513      +47     
==========================================
+ Hits       172215   172351     +136     
+ Misses      82251    82162      -89     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. I'm leaning towards not backporting it for two reasons as I'm not confident in how this is used across the Vitess user base:

  1. What does this mean if the operator isn't watching the logs? Couldn't this potentially mean that the vttablet is running along as if everything is fine while having more permissive permissions than intended/expected? We can say that this is the operators concern/fault, but it's still a change.
  2. Some people could be relying on the current behavior (I've made this mistake in the past)

It's also for this reason that I think we should add a small blurb about this here https://github.com/vitessio/vitess/blob/main/changelog/22.0/22.0.0/changelog.md

What do others think?

@rohit-nayak-ps rohit-nayak-ps removed the Type: Enhancement Logical improvement (somewhere between a bug and feature) label Jan 16, 2025
@wiebeytec
Copy link
Author

People who are relying on that behavior would have to have a weird setup, because they need to have a running tablet server that they expect to exit after SIGHUP when the file contains errors. One example use is deployment on the replicas as a test, but then they'd be putting their replicas at risk; especially because it calls the OS exit system call, so it's an unclean shutdown (probably?).

Having said that, I do see your point that it's a change.

Copy link
Member

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Matt, we should drop a line in the summary.md file so that no one is caught off-guard with the changes. Otherwise rest, LGTM. (merging from main might be required because of an extra introduced test there)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants