Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neonvm: create a Certificate object for each VM #1213

Merged
merged 5 commits into from
Feb 17, 2025

Conversation

conradludgate
Copy link
Contributor

@conradludgate conradludgate commented Jan 21, 2025

https://github.com/neondatabase/cloud/issues/22997

Initially, this PR set out to create a cert-manager.io/Certificate object for each VM. However, a design decision for cert-manager has the associated secrets produced as orphans. This caused secrets to accumulate without cleanup.

Instead, I have augmented the flow to emulate Certificate.

  1. Generate a private key, and store it in a temporary Secret.
  2. Generate a CertificateRequest based on the associated public key.
  3. Wait for the CertificateRequest to be approved
  4. Create a tls Secret based on the CertificateRequest and private key.
  5. Delete the CertificateRequest and the temporary secret.
  6. Wait 23 hours
  7. Return to step 1.

This only takes place if the VM spec lists a TlsCertificateIssuer value

@conradludgate conradludgate force-pushed the conrad/compute-certificate-provisioning branch from 7f8372b to e9121eb Compare January 21, 2025 15:17
Copy link

github-actions bot commented Jan 21, 2025

No changes to the coverage.

HTML Report

Click to open

@conradludgate conradludgate force-pushed the conrad/compute-certificate-provisioning branch 2 times, most recently from 4592eb0 to 13ed571 Compare January 23, 2025 12:21
@edude03
Copy link
Contributor

edude03 commented Feb 6, 2025

If a vm runs longer than 23 hours does the cert reload? #1222 I guess not yet!

@conradludgate
Copy link
Contributor Author

If a vm runs longer than 23 hours does the cert reload

Yeah - when paired with the other PR, yes it does reload as expected. Still needs work to make postgres to recognise the change, but yes it should work

@Omrigan Omrigan assigned stradig and unassigned stradig Feb 11, 2025
@conradludgate conradludgate changed the base branch from main to conrad/vm-fsnotify February 12, 2025 09:57
@sharnoff sharnoff requested review from Omrigan and sharnoff February 12, 2025 10:17
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review -- lots of comments, but overall looks ok enough.

I didn't look at doReconcileCertificateSecret very deeply, figured I'd wait until after refactoring

Base automatically changed from conrad/vm-fsnotify to main February 13, 2025 09:20
conradludgate added a commit that referenced this pull request Feb 13, 2025
For #1213 I need the renewed certificate secret to be updated within the
vm.

It's ok if there's some lag on this synchronisation as certificates are
renewed with generous amount of time left until expiry.

Design:
neonvm-runner will poll the files every 30s and compare a checksum against the files in the VM, via neonvm-daemon.
If the checksums do not match, they are sent to neonvm-daemon where neonvm-daemon utilises atomic-writer to update the files.

Useful links:
https://ahmet.im/blog/kubernetes-secret-volumes-delay/
https://ahmet.im/blog/kubernetes-inotify/

Some discussion here:
https://neondb.slack.com/archives/C03TN5G758R/p1737723555448339
https://neondb.slack.com/archives/C03TN5G758R/p1738669103059979
@conradludgate conradludgate force-pushed the conrad/compute-certificate-provisioning branch from 901832a to 8b3d229 Compare February 13, 2025 09:41
@conradludgate conradludgate marked this pull request as ready for review February 13, 2025 12:56
@mikhail-sakhnov
Copy link
Contributor

However, a design decision for cert-manager has the associated secrets produced as orphans. This caused secrets to accumulate without cleanup.

Sorry, may be it was previously discussed, but are there more context on that statement? It's a bit unclear which design decision is in question

@conradludgate
Copy link
Contributor Author

conradludgate commented Feb 13, 2025

Sorry, may be it was previously discussed, but are there more context on that statement? It's a bit unclear which design decision is in question

A design decision by cert-manager.io

https://cert-manager.io/docs/usage/certificate/#cleaning-up-secrets-when-certificates-are-deleted

By default, cert-manager does not delete the Secret resource containing the signed certificate when the corresponding Certificate resource is deleted. This means that deleting a Certificate won't take down any services that are currently relying on that certificate, but the certificate will no longer be renewed. The Secret needs to be manually deleted if it is no longer needed.

If you would prefer the Secret to be deleted automatically when the Certificate is deleted, you need to configure your installation to pass the --enable-certificate-owner-ref flag to the controller.

Enabling this flag cluster wide would fix things for neonvm, but the outcome is not ideal for our other services that rely on cert-manager.

Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I think it'll probably need one more pass, and then be good to go.

cc @mikhail-sakhnov or @Omrigan, might be worthwhile to double-check my comments here

@sharnoff sharnoff assigned conradludgate and unassigned sharnoff Feb 14, 2025
@mikhail-sakhnov
Copy link
Contributor

Enabling this flag cluster wide would fix things for neonvm, but the outcome is not ideal for our other services that rely on cert-manager.

I think one of the options as well is to have a controller for secrets to clean them up, if they orpaned, instead of re-wrtiting cert manager logic.

@conradludgate conradludgate force-pushed the conrad/compute-certificate-provisioning branch 2 times, most recently from e11656c to f12b0b0 Compare February 14, 2025 13:51
@conradludgate conradludgate removed their assignment Feb 14, 2025
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small things left. Should be good to go once addressed, I think

@sharnoff sharnoff assigned conradludgate and unassigned sharnoff Feb 14, 2025
@conradludgate conradludgate changed the title feat: create a Certificate object for each VM neonvm: create a Certificate object for each VM Feb 17, 2025
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - nice!

@conradludgate conradludgate force-pushed the conrad/compute-certificate-provisioning branch from 3db620a to f28713a Compare February 17, 2025 17:33
@conradludgate conradludgate enabled auto-merge (rebase) February 17, 2025 17:33
@conradludgate conradludgate enabled auto-merge (squash) February 17, 2025 17:33
@conradludgate conradludgate merged commit bee9186 into main Feb 17, 2025
32 of 33 checks passed
@conradludgate conradludgate deleted the conrad/compute-certificate-provisioning branch February 17, 2025 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants