-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idempotency for istio-iptables apply flow #50328
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
tools/istio-iptables/pkg/cmd/root.go
Outdated
@@ -136,6 +136,8 @@ func bindCmdlineFlags(cfg *config.Config, cmd *cobra.Command) { | |||
&cfg.NetworkNamespace) | |||
|
|||
flag.BindEnv(fs, constants.CNIMode, "", "Whether to run as CNI plugin.", &cfg.CNIMode) | |||
|
|||
flag.BindEnv(fs, constants.PreemptiveCleanup, "", "Whether to run a preemptive cleanup to prevent duplicated rules and chains", &cfg.PreemptiveCleanup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking, because I haven't formed an opinion yet - does this need to be optional? Why would you not want to always just do this as a pre-step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure if it should be performed all the time. I put it as opt-in for now in order to not impact the VM usecase (usecase of which I don't know much about).
I'm investigating if there could be an impact there by doing a preemptive cleanup which deletes the ISTIO_* chains and every rule in them.
Anyway, I'm still doing some manual tests and I need to add proper coverage for this cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to make this a flag, and simply do it every time, which should be fine if it is idempotent.
If we did have to make it an option, the option should be an opt-out, not an opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option is now enabled by default and cannot be enabled, I don't think the cleanup will affect existing usecases and it is anyway being executed in "quiet-mode" with errors being ignored.
Later on, probably not in this commit, I plan to add a --cleanup-only option which could come in handy for VM workloads.
93485b1
to
028a478
Compare
028a478
to
55f87de
Compare
tools/istio-iptables/pkg/cmd/root.go
Outdated
@@ -136,6 +136,8 @@ func bindCmdlineFlags(cfg *config.Config, cmd *cobra.Command) { | |||
&cfg.NetworkNamespace) | |||
|
|||
flag.BindEnv(fs, constants.CNIMode, "", "Whether to run as CNI plugin.", &cfg.CNIMode) | |||
|
|||
flag.BindEnv(fs, constants.PreemptiveCleanup, "", "Whether to run a preemptive cleanup to prevent duplicated rules and chains", &cfg.PreemptiveCleanup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to make this a flag, and simply do it every time, which should be fine if it is idempotent.
If we did have to make it an option, the option should be an opt-out, not an opt-in.
I think after the last fix the logic should be fine now. |
Looking good, TY! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't delete+apply cause downtime if we are to actually apply this somewhere with rules already? Thinking of VMs, etc?
Or in the pod case as well actually. This seems dangerous
Well, in the case he's trying to fix, traffic disruption already is unavoidable and it shouldn't be unsafe:
any mutation of iptables rules should be construed as causing downtime, I don't think there's a way to do any iptables CRUD with no-downtime guarantees. Whether we fail during cleanup or fail during create, a failed iptables state means we don't know what the heck is happening to traffic, basically, and is always dangerous. If we fail in either of those spots before user workloads are started, we can at least fail-closed.
|
In this case "Because init containers can be restarted, retried, or re-executed, init container code should be idempotent. In particular, code that writes to files on EmptyDirs should be prepared for the possibility that an output file already exists." traffic is 100% fine. Kubernetes/the container runtime is being silly and re-running the init container for no reason, but the application containers are perfectly fine. TBH I don't know how it is acceptable for k8s to do this, but 🤷 |
Maybe we should just stop running |
Oooh, I assumed that init containers would not be re-executed while the application is running but I was wrong, it does indeed look like it can happen... Yes, the iptables operation are not atomic so the delete will lead to a shortly downtime to the iptables coverage. I think we can find a middle-ground to reduce the impact on existing rules in case of reruns. Keep in mind that checking chain/rules existance is not race-condition safe so running multiple istio-iptables cmd in parallel can lead to the usual shenanigans. In case of race condition:
I assume that is fine since we don't expect parallel executions of istio-iptables. Regarding the current cleanup logic of PR this could go behind a flag --reverse flag which, while it will not be used for providing idempotency, it could still provide a simple way to reverse changes performed via istio-iptables. What do you think :) ? |
One thing to keep in mind - the iptables library we have is used in 3 spots. 1 of which is run as a CLI binary in a sidecar context. The other 2 spots use it as a straight library, and may or may not use it in a sidecar context, as the README mentions: https://github.com/istio/istio/tree/master/tools/istio-iptables/pkg#readme Whatever we do here has to work equally well for all 3 spots - it cannot be just for init containers or just for invocations of the library's CLI wrapper.
This is still not an upgrade-safe reversal, outside of the If you do it per-rule, we need 2 modes
otherwise skip if the rule is already there and is exactly what the current code expects. mode 1 should be strictly safe in all contexts that need it, assuming all the rules are checked for correctness before they are removed.
Yep, we should be fine there. If we skip existing rules, that also means we cannot use the We can have idempotency, no traffic disruptions, or safety - pick two, and kube has already picked the first one for us :D |
Another thing we could do is say that there are no traffic disruption guarantees for init containers, because as per kube init containers can be restarted. And that if people are concerned about that, they should use istio CNI and the cni plugin instead of the init container. I think that's very reasonable. |
👎 It is 100% unacceptable for us to take a working, running sidecar and suddenly stop capturing traffic. That is a critical CVE. Especially when the motivation for it is mostly cosmetic (the init container restarting issue just results in it crashlooping, which has no effect on the app). Also I don't really see why we cannot pick all 3. You can update iptables rules without "delete+add". Also, we probably don't even need to update them at all, just see they are already what we want and bail out without doing anything? |
Yes, it is also not what I'm suggesting. I'm suggesting traffic disruption, not traffic escape. That's why I'm making a distinction between
That's true here, but does it change the fact that we will still need a cleanup mechanism, either way, that works in all contexts, of which init containers are one? We should at least document that we ignore the K8S spec for our init containers, then, and tell people to ignore the crashlooping Istio containers.
Yes, if they are all correct, we can do nothing. The complications enter if they are not all correct (because they got mangled, or are outdated, or otherwise not in a known-good state) - what do we do? And if we go rule by rule to update, we are again excluding the possibility of using |
Nice, the new check sounds like it could be a lot more reliable. One thing I worry about is the init container case in Kubernetes. Its important to consider in this case there is nothing wrong with their pod today. All traffic works fine. I worry we will have a false-positive, rewrite the rules, and cause a small outage for these users; all of this for a change that wasn't actually fixing real traffic issues. Is there some way we can only do this outside of k8s or something? Also, kube-proxy has some updating of iptables, they may have some good patterns for this |
|
4e728ee
to
cc9be24
Compare
4cd7bb8
to
c2dd908
Compare
c2dd908
to
0478478
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments, looking better, ty!
func (cfg *IptablesConfigurator) VerifyIptablesState(iptVer, ipt6Ver *dep.IptablesVersion) (bool, bool) { | ||
// These variables track the status of iptables installation | ||
deltaExists := false // Flag to indicate if a difference is found between expected and current state | ||
reconcile := false // Flag to indicate if a reconciliation via cleanup is needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this function set/return reconcile
? I don't think this function needs to return two bools anyway, it can just report deltaFound true|false
to the caller, which can decide what to do based on the config flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right. I unintentionally used the deltaFound as the old leftoverFound.
I will change it to convey the proper intention and simplify the function return value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I believe the function should return both residueExists
and deltaExists
.
Returning only deltaExists wouldn't allow us to distinguish between the first execution and reruns that might require cleanup. This would lead to guardrails+cleanup for the first run, which is not desirable.
"reconcile" on the other end, I agree that should not belong to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel extremely uncomfortable with this. It seems like we are trying to solve a problem that has been solved in numerous projects in the space already, but doing it in a way they all have decided is broken and moved away from?
I don't think I can approve this unless we either move to a proven-safe (via prior art in another project) way to idempotently update iptables rules, or have a very compelling reason why our needs are different than other projects.
return output | ||
} | ||
|
||
func (rb *IptablesRuleBuilder) buildGuardrails() []*Rule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this guardrails concept. isn't the point of iptables-save and restore to allow doing transactions? Why do we need to do delete+add when we can just reconcile the state in one go under a lock?
Do other implementations do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guardrails I believe are in place because iptables-save
output is not relied upon, and we have to go rule-by-rule with iptables -C
as per #50328 (comment)
So, we have to DIY transactions, basically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it would be possible to use iptables-restore to do cleanup+apply with one operation although it cannot be considered as single atomic transaction in every circumstance.
iptables-restore provides for sure provides a certain level of atomicity but the atomicity only applies at a "table-level", while iptables-restore will be the sum of multiple atomic transactions it cannot be considered as single atomic transaction in its entirety. See xtables-restore L96: https://git.netfilter.org/iptables/tree/iptables/xtables-restore.c#n96
Given that sometimes we change more than just the nat table I did not feel safe that using iptables-restore would provide a no-traffic-escape guarantee that would allowed me to get rid of the safety guardrails.
On a side-note, using iptables-restore has some extra challenges as it requires the cleanup to be way more chirurgical in the removal as we must be sure that what we remove exists at that time, otherwise the whole table-transaction would fail. This is not a blocker though, if needed I could make it work.
// Execute iptables-restore | ||
if err := cfg.executeIptablesRestoreCommand(iptVer, true); err != nil { | ||
return err | ||
func (cfg *IptablesConfigurator) getStateFromSave(data string) map[string]map[string][]string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is safe.
There is a rich history in Kubernetes of "parsing iptables-save is broken". kubernetes/kubernetes#112477 and many links from there are good history.
Calico is in a similar boat: https://github.com/projectcalico/calico/blob/23ae58b62765b14aa2c5952b2fc6c40155731a79/felix/iptables/table.go#L141
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can't rely on that, I believe that is why @leosarra moved to iptables- C
instead in #50328 (comment) which should address the potentially wobblyness, but I will let them reply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what Calico comment said is correct.
Parsing the rules provided by iptables-save is unsafe because the rules, as represented in the kernel, are different from how they were originally applied (although both representation are obviously equivalent).
In some scenarios the changes can be easily accounted for (e.g: ordering of flags) while in other cases are really hard to take care of (e.g: flags being removed/added/changed, values being represented differently decimal->hexadecimal). For this reason I rely on iptables-save only for the following things: obtaining *{table}, :{chain}, counting the number of rules.
I use that information to decide whatever a delta exists using the following logic:
- every istio chain in the expected state must exists also in the current state
- every istio chain must have the same number of elements in both current and expected state
- every rule in expected state (may it be in istio or non-istio chains) must exists in the current state, the verification is carried over by using "iptables -C" on the rule produced by our iptables builder. No comparison of the parsed rules is performed.
If all 3 conditions are satisfied then the delta is not present and no new tables/chains have to be applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a look at how kubeproxy, Linkerd, and Kuma fares with respect to the reconciliation of iptables and iptables-save usage.
-
Kubeproxy: Kubeproxy utilize iptables-save and its nft-equivalent to find chain names in a table, some of the chains are then flushed and deleted. Chains are categorized as either active or stale, with only stale chains being deemed safe to remove (https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L1412-L1428). Kubeproxy does not perform any rule-level checks or parse the rules. During the proxy's startup, all pre-existing rules are considered stale and are cleaned up before new ones are created (https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-proxy/app/server.go#L381).
-
Linkerd: In the init container, Linkerd uses iptables-save to dump rules and relies on a regex check to determine if any rule that resembles "-A standard_chain ... -j linkerd_chain" exists. If so, jump rules are not recreated. On the other end, existing Linkerd-made chains are flushed and recreated in a single transaction, the operation is atomic as all the rules are contained in the nat table. Make iptables rule application idempotent upon init pod restart linkerd/linkerd2-proxy-init#164
-
Kuma: By looking at the code it seems that the init container is not idempotent.
Overall it looks like our use of the iptables-save output appears to be similar to that of kubeproxy, with the exception that we also count the number of rules in a chain.
Linkerd is also behaving similarly to the PR although there is no check if compatible iptables are found before performing a cleanup. I don't think there is the need for any guardrails in Linkerd as all the rules resides in one table and it can be updated atomically with one transaction of iptables-restore, this keeps the logic quite simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use that information to decide whatever a delta exists using the following logic:
every istio chain in the expected state must exists also in the current state every istio chain must have the same number of elements in both current and expected state every rule in expected state (may it be in istio or non-istio chains) must exists in the current state, the verification is carried over by using "iptables -C" on the rule produced by our iptables builder. No comparison of the parsed rules is performed.
If all 3 conditions are satisfied then the delta is not present and no new tables/chains have to be applied.
Can you put this blurb in the istio-iptables
README? This logic specifically is important to articulate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this blurb in the
istio-iptables
README? This logic specifically is important to articulate.
Done
istio/tools/istio-iptables/pkg/capture/run.go
Lines 832 to 841 in fb83abe
// VerifyIptablesState function verifies the current iptables state against the expected state. | |
// The current state is considered equal to the expected state if the following three conditions are met: | |
// - Every ISTIO_* chain in the expected state must also exist in the current state. | |
// - Every ISTIO_* chain must have the same number of elements in both the current and expected state. | |
// - Every rule in the expected state (whether it is in an ISTIO or non-ISTIO chain) must also exist in the current state. | |
// The verification is performed by using "iptables -C" on the rule produced by our iptables builder. No comparison of the parsed rules is done. | |
// | |
// Note: The order of the rules is not checked and is not used to determine the equivalence of the two states. | |
// The function returns two boolean values, the first one indicates whether residues exist, | |
// and the second one indicates whether differences were found between the current and expected state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks broadly good to me and makes sense. I don't care as much about init container idempotency personally (beside broadly agreeing we have to deal with the hand k8s deals us), but I do generally want a robust/tested iptables
repair logic - which we don't have ATM, and which this adds.
I feel extremely uncomfortable with this. It seems like we are trying to solve a problem that has been solved in numerous projects in the space already, but doing it in a way they all have decided is broken and moved away from?
I don't think I can approve this unless we either move to a proven-safe (via prior art in another project) way to idempotently update iptables rules, or have a very compelling reason why our needs are different than other projects.
I think @leosarra addressed this in #50328 (comment) - we are doing it how other projects do it, sometimes a little more carefully.
Short of dropping istio-iptables
as an internal dependency and using a non-istio impl (which ATM sadly doesn't exist, if it did would be happy to do that - I don't enjoy reinventing the wheel but istio-iptables
existed that long before @leosarra showed up with this PR) we likely aren't going to better this. This is the space we play in.
I'm good on this and the testing around it, and will defer to @howardjohn if he has residual concerns.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Please provide a description of this PR:
Fixes #30393 and #42792 .
Current behavior on master will make a re-execution of istio-init container likely to fail due to the possibility of having existing ISTIO_* chains (and/or rules).
If, for whatever reason, the init containers are getting re-executed, we should make sure that they do not fail because the settings are already in effect. This is also one of the recommendation of the k8s doc about init containers:
With these changes
pilot-agent ip-tables
will remove rules that would end up being duplicated + existing istio chains.To help us figure out who should review this PR, please put an X in all the areas that this PR affects.
Please check any characteristics that apply to this pull request.