New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP fix e2e-kind-cloud-provider-loadbalancer
#124729
base: master
Are you sure you want to change the base?
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1efb209
to
8f7ac69
Compare
Allow either drop or reject; we previously made the same change for TCP load balancers.
8f7ac69
to
3fbcd32
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3fbcd32
to
0a76031
Compare
The existing test had two problems: - It only made connections from within the cluster, so for VIP-type LBs, the connections would always be short-circuited and so this only tested kube-proxy's LBSR implementation, not the cloud's. - For non-VIP-type LBs, it would only work if pod-to-LB connections were not masqueraded, which is not the case for most network plugins. Fix this by (a) testing connectivity from the test binary, so as to test filtering external IPs, and ensure we're testing the cloud's behavior; and (b) using both pod and node IPs when testing the in-cluster case. Also some general cleanup of the test case.
(And in particular, remove `[Feature:LoadBalancer]` from it.)
It previously assumed that pod-to-other-node-nodeIP would be unmasqueraded, but this is not the case for most network plugins. Use a HostNetwork exec pod to avoid problems. Also, fix up a bit for dual-stack.
The LoadBalancer test "should handle updates to ExternalTrafficPolicy field" made multiple unsafe assumptions; in particular, that nodeports never get recycled, and that Cluster traffic policy services are *required* (rather than merely *allowed*) to masquerade traffic. There is really no good way to prove that a given load balancer is implementing Cluster behavior rather than Local behavior without doing [Disruptive] things to the cluster. However, we can't test that HealthCheckNodePorts are working correctly without creating a LoadBalancer, so rework the test to be more about that.
0a76031
to
0640383
Compare
// FIXME: this test is supposed to ensure that "a LoadBalancer UDP service | ||
// doesn't blackhole the traffic to the node when the pod backend is | ||
// destroyed and the traffic has to fall back to another pod", but it's | ||
// possible that a connection to backend 2 will succeed at this point | ||
// before backend 1 is deleted, in which case no further successful | ||
// connections are required to pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can not happen, the test waits until pod1 is deleted and also validates the endpoints are missing to avoid this problem
e2epod.NewPodClient(f).DeleteSync(ctx, podBackend1, metav1.DeleteOptions{}, e2epod.DefaultPodDeletionTimeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It waits until pod 1 is deleted to start checking if pod 2 has been reached, but it doesn't ensure that pod 2 was actually reached after pod 1 was deleted.
main thread DialUDP goroutine
----------- -----------------
"creating a backend pod"
"checking ... backend 1"
... DialUDP()
... hostnames.Insert("hostname1")
hostnames.Has("hostname1")
"creating a second backend pod"
DialUDP()
hostnames.Insert("hostname2")
DeleteSync
(kube-proxy breaks here due to bugs)
DialUDP()
Logf("Failed to connect")
"checking ... backend 2"
hostnames.Has("hostname2")
Zarro boogs found!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use always the same source ip and source port (see L754) , you can't get to backend2 unless you use a round robin per packet for UDP loadbalancers, but that is not common and I would say this is safe assumption
laddr, err := net.ResolveUDPAddr("udp", ":54321")
if err != nil {
framework.Failf("Failed to resolve local address: %v", err)
}
@@ -2706,6 +2692,40 @@ var _ = common.SIGDescribe("Services", func() { | |||
} | |||
}) | |||
|
|||
ginkgo.It("should support externalTrafficPolicy=Local for type=NodePort", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is failing, we have this test here #123622 where we removed one of the use cases
There are 4 failing and 3 flakes in current state, https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20loadbalancer The test for nodePort and externalTrafficPolicy local I think is not correct Session affinity is implemented now and looks correct, you can see in the testgrid those are not failing, the problem seems to be with externalTrafficPolicy |
I was having trouble fixing the eTP:Local / NodePort test because there is no obvious set of assumptions about source IP preservation that works for both GCE and kind. The test currently uses pod-network pods and expects that pod-to-other-node-IP will preserve source IP, which is not something we require. I tried rewriting it to use host-network pods, assuming that host-network-pod-to-other-node-IP would use the node IP as the source IP, which works on kind, but fails on GCE because the node-to-node connection ends up using the Anyway, that was the point when I pinged you about disabling the kind loadbalancer job, and then I haven't gotten back to it since. Might need to do something like finding all of the IPs on the node / host-network pod, and allowing any of them. (Need to actually check the interfaces, not look at |
we have discussed this in the past and we concluded that masquerading there was common so it was accepted We are close /test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer |
/test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer |
// Make sure dropPod is running. There are certain chances that the pod might be terminated due to unexpected reasons. | ||
dropPod := e2epod.CreateExecPodOrFail(ctx, cs, namespace, "execpod-drop", | ||
func(pod *v1.Pod) { | ||
pod.Spec.NodeName = nodes.Items[1].Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid hardcoding the nodename to not shortcut the scheduler, this had bad consequences on the e2e ending on flakiness , you can see that most of the tests use the following pattern
nodeSelection := e2epod.NodeSelection{Name: nodes.Items[0].Name}
e2epod.SetNodeSelection(&serverPod1.Spec, nodeSelection)
e2epod.NewPodClient(f).CreateSync(ctx, serverPod1)
// https://issues.k8s.io/123714 | ||
ingress := &svc.Status.LoadBalancer.Ingress[0] | ||
if ingress.IP == "" || (ingress.IPMode != nil && *ingress.IPMode == v1.LoadBalancerIPModeProxy) { | ||
e2eskipper.Skipf("LoadBalancer uses 'Proxy' IPMode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is working fine, I can see how these tests are skipped now https://testgrid.k8s.io/sig-network-kind#pr-sig-network-kind,%20loadbalancer
framework.Failf("Source IP was NOT preserved") | ||
} | ||
} else { | ||
gomega.Expect(err).To(gomega.HaveOccurred(), "should not have been able to reach via %s:%d", ips[0], nodePort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fails here now
I0515 23:34:23.117414 71876 loadbalancer.go:1291] ClientIP:Port detected by target pod using NodePort is 172.18.0.3:32882, the IP of test container is 172.18.0.3
STEP: ensuring that the service is not reachable via 172.18.0.3:32118 - k8s.io/kubernetes/test/e2e/network/loadbalancer.go:1288 @ 05/15/24 23:34:23.117
I0515 23:34:23.117509 71876 builder.go:121] Running '/home/prow/go/src/k8s.io/kubernetes/_output/bin/kubectl --server=https://127.0.0.1:36985 --kubeconfig=/root/.kube/kind-test-config --namespace=esipp-7280 exec host-test-container-pod -- /bin/sh -x -c curl -g -q -s --connect-timeout 30 http://172.18.0.3:32118/clientip'
I0515 23:34:23.460353 71876 builder.go:146] stderr: "+ curl -g -q -s --connect-timeout 30 http://172.18.0.3:32118/clientip\n"
I0515 23:34:23.460388 71876 builder.go:147] stdout: "172.18.0.3:50884"
I0515 23:34:23.460400 71876 loadbalancer.go:1291] ClientIP:Port detected by target pod using NodePort is 172.18.0.3:50884, the IP of test container is 172.18.0.3
[FAILED] should not have been able to reach via 172.18.0.3:32118
The UDP one fails because envoy crashes, seems related envoyproxy/envoy#14866
|
Probably fixed 3 weeks ago envoyproxy/envoy#33824 |
opened envoyproxy/envoy#34195 |
workaround for udp merged kubernetes-sigs/cloud-provider-kind#68 /test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer |
/test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer 👀 https://testgrid.k8s.io/sig-network-kind#pr-sig-network-kind,%20loadbalancer |
@danwinship: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
ugh, the test panics now?
|
What type of PR is this?
/kind bug
/kind cleanup
/kind failing-test
What this PR does / why we need it:
Moved out of #124660. Once we get the loadbalancer tests running again, we will get them working with cloud-provider-kind.
Notes:
LoadBalancers should be able to change the type and ports of a [ TCP / UDP ] service
SkipUnlessProviderIs("gce", "gke", "aws")
LoadBalancers should only allow access from service loadbalancer source ranges
LoadBalancerSourceRanges: ${accept_pod_ip}
, checks that accept→LB works and drop→LB does not, changes LBSR to${drop_pod_ip}
, checks that things are reversed, unsets LBSR, checks that both pods can connect.SkipUnlessProviderIs("gce", "gke", "aws", "azure")
e2e.test
binary itself would test the LB rather than kube-proxy in both cases.LoadBalancers should have session affinity work for LoadBalancer service with ESIPP [ on / off ]
e2e.test
) and checks that it gets the same backend each time.SkipIfProviderIs("aws")
LoadBalancers should be able to switch session affinity for LoadBalancer service with ESIPP [ on / off ]
e2e.test
) and checks that it doesn't always get the same backend, enables affinity, checks again that it now does always get the same backendSkipIfProviderIs("aws")
LoadBalancers should handle load balancer cleanup finalizer for service
LoadBalancers should be able to create LoadBalancer Service without NodePort and change it
AllocateLoadBalancerNodePorts
, checks that a NodePort was allocated, checks reachability by LB.SkipUnlessProviderIs("gce", "gke", "aws")
e2e-kind-cloud-provider-loadbalancer
, so doesn't cause a test failureLoadBalancers should be able to preserve UDP traffic when server pod cycles for a LoadBalancer service [on different nodes / on the same node]
SkipUnlessProviderIs("gce", "gke", "azure")
LoadBalancers should not have connectivity disruption during rolling update with externalTrafficPolicy=[ Cluster / Local ]
LoadBalancers ESIPP should work for type=LoadBalancer
e2e.test
), checks if source IP was preservedSkipUnlessProviderIs("gce", "gke")
LoadBalancers ESIPP should work for type=NodePort
SkipUnlessProviderIs("gce", "gke")
hostNetwork
pod insteadLoadBalancers ESIPP should only target nodes with endpoints
SkipUnlessProviderIs("gce", "gke")
LoadBalancers ESIPP should work from pods
SkipUnlessProviderIs("gce", "gke")
LoadBalancers ESIPP should handle updates to ExternalTrafficPolicy field
SkipUnlessProviderIs("gce", "gke")
Does this PR introduce a user-facing change?