Skip to content

[ovn-controller] Change startup mechanism of ovs pods #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

averdagu
Copy link
Contributor

@averdagu averdagu commented Mar 26, 2025

This commit aims to modify the starting scripts of the ovn-controller-ovs daemonset.

This is done to allow modifying the RollingUpdate Strategy to not allow any Unavailable pod during update, what this will cause is that during an update to the pod (delete old one and create new one) instead of first deleting the first one and then deleting the next one, it will create the new pod while the old one is running. Due to how the startup scritps works currently this is not allowed.

The reason behind this is to try to lower the downtime observed if the environment is using centralized floating ip during an update.

With this commit the ovn-controller-ovs will share the PID namespace with the host, in order to allow signaling between old/new pod.

Another change is adding an STATE that the containers (ovsdb-server, ovs-vswitchd and ovsdb-server-init) will handle internally.

The differents states are:

  • NULL (No file): will happen the fist time ds is created on the oc worker.
  • INIT: First time init-ovsdb-server is executed on the oc worker.
  • OVSDB_SERVER: once ovsdb-server pod has run the startup script
  • RUNNING: Once ovsdb-server is up and ovs-vswitchd has run the startup script.
  • UPDATE: Once a new pod is created and ovsdb-server-init has run.
  • RESTART_VSWITCH: After ovsdb-server-init has finished, new ovsdb-server pod has stopped the old ovs-vswitchd process.
  • RESTART_DBSERVER: After old ovs-vswitchd has been restarted the old ovsdb-server is also stop.

The normal flow of states is the following:

NULL -> INIT -> OVSDB_SERVER -> RUNNING

Scale down: If the oc worker is deleted the DS and all the pods and mount points will be deleted, in case of node being up again it should start from NULL

Update: RUNNING -> (Change on CR) -> UPDATE -> RESTART_VSWITCHD ->
RESTART_DBSERVER -> OVSDB_SERVER -> RUNNING

Related: OSPRH-11636
Jira: OSPRH-10821
Depends-on: lib-common#611

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ovn-operator for 423,6713e1371b06f42e53b3d588d33c7662d13a1a0c

Copy link
Contributor

openshift-ci bot commented Mar 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: averdagu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ovn-operator for 423,3cfcabe999bc5378d959d357c052079452d58bfc

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/9d1e6fb949f14b0f902e9c4913239d6e

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 24m 47s
ovn-operator-tempest-multinode FAILURE in 1h 03m 29s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0a45c0f113754dc8af4e86b200206bc7

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 26m 52s
ovn-operator-tempest-multinode FAILURE in 1h 05m 04s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/32fe97f422124ae882519d53895dc2c8

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 21m 05s
ovn-operator-tempest-multinode FAILURE in 1h 01m 49s

@averdagu averdagu force-pushed the ovs-restart branch 2 times, most recently from b94efec to f304f7c Compare April 2, 2025 12:17
}

TLSOptions="--certificate=/etc/pki/tls/certs/ovndb.crt --private-key=/etc/pki/tls/private/ovndb.key --ca-cert=/etc/pki/tls/certs/ovndbca.crt"
DBOptions="--db ssl:ovsdbserver-nb.openstack.svc.cluster.local:6641"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only supporting ssl, need to add mechanism to support also non tls connections

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/83bd56facf314f1f9e13f3c122caadd9

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 22m 12s
ovn-operator-tempest-multinode FAILURE in 1h 02m 39s

@averdagu
Copy link
Contributor Author

averdagu commented Apr 4, 2025

/recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3ff505fab9994c8ab22909f47cb474a4

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 20m 16s
ovn-operator-tempest-multinode FAILURE in 59m 38s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/107eac5bca244667b5651800bb406375

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 15m 47s
ovn-operator-tempest-multinode FAILURE in 58m 32s

@averdagu averdagu force-pushed the ovs-restart branch 2 times, most recently from 1a1a179 to 30fec14 Compare April 9, 2025 08:19
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/9c9fa9ecc3e44aebb4d808d86e48cdf0

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 20m 03s
ovn-operator-tempest-multinode FAILURE in 59m 55s

This commit aims to modify the starting scripts of the
ovn-controller-ovs daemonset.

This is done to allow modifying the RollingUpdate Strategy
to not allow any Unavailable pod during update, what this will
cause is that during an update to the pod (delete old one and create
new one) instead of first deleting the first one and then deleting the
next one, it will create the new pod while the old one is running. Due
to how the startup scritps works currently this is not allowed.

The reason behind this is to try to lower the downtime observed if the
environment is using centralized floating ip during an update.

With this commit the ovn-controller-ovs will share the PID namespace
with the host, in order to allow signaling between old/new pod.

Another change is adding an STATE that the containers (ovsdb-server,
ovs-vswitchd and ovsdb-server-init) will handle internally.

The differents states are:
 - NULL (No file): will happen the fist time ds is created on the oc
   worker.
 - INIT: First time init-ovsdb-server is executed on the oc worker.
 - OVSDB_SERVER: once ovsdb-server pod has run the startup script
 - RUNNING: Once ovsdb-server is up  and ovs-vswitchd has run the
   startup script.
 - UPDATE: Once a new pod is created and ovsdb-server-init has run.
 - RESTART_VSWITCH: After ovsdb-server-init has finished, new
   ovsdb-server pod has stopped the old ovs-vswitchd process.
 - RESTART_DBSERVER: After old ovs-vswitchd has been restarted the
   old ovsdb-server is also stop.

The normal flow of states is the following:

NULL -> INIT -> OVSDB_SERVER -> RUNNING

Scale down: If the oc worker is deleted the DS and all the pods and mount points
will be deleted, in case of node being up again it should start from
NULL

Update: RUNNING -> (Change on CR) -> UPDATE -> RESTART_VSWITCHD ->
RESTART_DBSERVER -> OVSDB_SERVER -> RUNNING

Jira: OSPRH-11636
Related: OSPRH-10821
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/fada0b90762848388e290079e5af88e4

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 21m 43s
ovn-operator-tempest-multinode FAILURE in 1h 01m 01s

Copy link
Contributor

openshift-ci bot commented Apr 9, 2025

@averdagu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ovn-operator-build-deploy-kuttl 2cda39c link true /test ovn-operator-build-deploy-kuttl

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants