Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peering depends on 'Credentials retriever' task but on shutdown that is stopped sooner #1033

Open
asteven opened this issue Jun 27, 2023 · 1 comment · May be fixed by #1110
Open

peering depends on 'Credentials retriever' task but on shutdown that is stopped sooner #1033

asteven opened this issue Jun 27, 2023 · 1 comment · May be fixed by #1110
Labels
bug Something isn't working

Comments

@asteven
Copy link
Contributor

asteven commented Jun 27, 2023

Long story short

When using a ConnectionInfo with an expiration, which is required to work with expiring tokens,
operator does not properly leave peering and does not exit.

It hangs as the updating/leaving the peering needs valid credentials but the vault ends up in a need_reauth state with no 'Credentials retriever' task left to populate the vault with new credentials.

Kopf version

main

Kubernetes version

any

Python version

3.10.10

Code

@kopf.on.login()
async def authenticate(
        *,
        logger: kopf.Logger,
        **_: Any,
) -> Optional[kopf.ConnectionInfo]:

    try:
        kubernetes_asyncio.config.load_incluster_config()  # cluster env vars
        logger.debug("Async client is configured in cluster with service account.")
    except kubernetes_asyncio.config.ConfigException as e1:
        try:
            await kubernetes_asyncio.config.load_kube_config()  # developer's config files
            logger.debug("Async client is configured via kubeconfig file.")
        except kubernetes_asyncio.config.ConfigException as e2:
            raise kopf.LoginError("Cannot authenticate the async client library "
                                         "neither in-cluster, nor via kubeconfig.")

    # We do not even try to understand how it works and why. Just load it, and extract the results.
    # For kubernetes client >= 12.0.0 use the new 'get_default_copy' method
    if callable(getattr(kubernetes_asyncio.client.Configuration, 'get_default_copy', None)):
        config = kubernetes_asyncio.client.Configuration.get_default_copy()
    else:
        config = kubernetes_asyncio.client.Configuration()

    # For auth-providers, this method is monkey-patched with the auth-provider's one.
    # We need the actual auth-provider's token, so we call it instead of accessing api_key.
    # Other keys (token, tokenFile) also end up being retrieved via this method.
    header: Optional[str] = config.get_api_key_with_prefix('BearerToken')
    parts: Sequence[str] = header.split(' ', 1) if header else []
    scheme, token = ((None, None) if len(parts) == 0 else
                     (None, parts[0]) if len(parts) == 1 else
                     (parts[0], parts[1]))  # RFC-7235, Appendix C.

    #expiration = datetime.datetime.utcnow() + datetime.timedelta(minutes=1)
    expiration = datetime.datetime.utcnow() + datetime.timedelta(seconds=10)
    #expiration = None
    return kopf.ConnectionInfo(
        server=config.host,
        ca_path=config.ssl_ca_cert,  # can be a temporary file
        insecure=not config.verify_ssl,
        username=config.username or None,  # an empty string when not defined
        password=config.password or None,  # an empty string when not defined
        scheme=scheme,
        token=token,
        certificate_path=config.cert_file,  # can be a temporary file
        private_key_path=config.key_file,  # can be a temporary file
        priority=1,
        expiration=expiration
    )

Logs

^C[2023-06-27 21:58:49,358] kopf._core.reactor.r [INFO    ] Signal SIGINT is received. Operator is stopping.
[2023-06-27 21:58:49,358] kopf._core.reactor.r [DEBUG   ] Admission mutating configuration manager is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Admission insights chain is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Namespace observer is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Credentials retriever is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Admission webhook server is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Admission validating configuration manager is cancelled.
[2023-06-27 21:58:49,360] kopf._core.reactor.r [DEBUG   ] Poster of events is cancelled.
[2023-06-27 21:58:49,361] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for customresourcedefinitions.v1.apiextensions.k8s.io cluster-wide.
[2023-06-27 21:58:49,361] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for clusterkopfpeerings.v1.kopf.dev cluster-wide.
[2023-06-27 21:58:49,363] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for netcenterips.v1alpha1.netcenter.hpc.ethz.ch cluster-wide.
[2023-06-27 21:58:49,363] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for services.v1 cluster-wide.
[2023-06-27 21:58:49,363] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for ingresses.v1.networking.k8s.io cluster-wide.
[2023-06-27 21:58:49,363] kopf._core.reactor.r [DEBUG   ] Daemon killer is cancelled.
[2023-06-27 21:58:49,363] kopf._core.reactor.r [DEBUG   ] Resource observer is cancelled.
[2023-06-27 21:58:59,370] kopf._core.reactor.o [DEBUG   ] Streaming tasks are not stopped: finishing normally; tasks left: {<Task pending name='peering keep-alive for default@None' coro=<guard() running at ./kopf/kopf/_cogs/aiokits/aiotasks.py:108> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/lib/python3.10/asyncio/tasks.py:864, Task.task_wakeup()]>>}
[2023-06-27 21:59:09,379] kopf._core.reactor.o [DEBUG   ] Streaming tasks are not stopped: finishing normally; tasks left: {<Task pending name='peering keep-alive for default@None' coro=<guard() running at ./kopf/kopf/_cogs/aiokits/aiotasks.py:108> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/lib/python3.10/asyncio/tasks.py:864, Task.task_wakeup()]>>}
[2023-06-27 21:59:19,386] kopf._core.reactor.o [DEBUG   ] Streaming tasks are not stopped: finishing normally; tasks left: {<Task pending name='peering keep-alive for default@None' coro=<guard() running at ./kopf/kopf/_cogs/aiokits/aiotasks.py:108> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/lib/python3.10/asyncio/tasks.py:864, Task.task_wakeup()]>>}
... for a long time until finally
./run: line 8: 2849365 Killed                  kopf run --all-namespaces $@ ./handlers.py

Additional information

This only happens with peering enabled and when using a ConnectionInfo with expiration set.

@asteven asteven added the bug Something isn't working label Jun 27, 2023
@asteven
Copy link
Contributor Author

asteven commented Jul 22, 2023

Fixed by considering dependencies when shutting down tasks.

main...asteven:kopf:cleaner_shutdown

@asteven asteven linked a pull request Mar 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant