Skip to content

OIDC token refresh never recovers from network disruption, permanent UNAUTHORIZED state requires restart #990

@dmuensterer

Description

@dmuensterer

OIDC token refresh never recovers from brief network disruption — permanent UNAUTHORIZED state requires restart

Summary

A brief network disruption (subsecond TCP RST) permanently kills OIDC-authenticated ziti-edge-tunnel clients. The OIDC token refresh loop retries on a broken HTTP handle that never reconnects, exhausts a 30-second retry window, and the client enters permanent UNAUTHORIZED state. Only a manual process restart recovers it.

This currently blocks us from upgrading from legacy authentication to OIDC. We have been unable to roll out OIDC auth because in practice it means tunnelers permanently drop roughly once a day, requiring manual intervention or external watchdog restarts.

Environment

  • ziti-edge-tunnel v1.10.9
  • Single-controller deployment (non-HA)
  • OIDC authentication

What happens

After any event that kills an established TCP connection to the controller (network flap, LB idle timeout, firewall state flush etc.), the timeline is:

  1. Both the controller API and OIDC HTTP handles discover the dead connection on their next use
  2. OIDC refresh retries every 5s on the same broken tlsuv_http_t handle, every attempt fails in ~30ms with -103/-9
  3. After 30 seconds the access token expires (hardcoded as per library/oidc.c:567-580)
  4. ztx_prepare() triggers ziti_set_unauthenticated() → all edge router channels torn down → all tunneled traffic severed
  5. Recovery attempts (ziti_force_api_session_refresh, ziti_re_auth) go through the same broken handles → same instant failures
  6. Client is permanently dead

With a 30-minute token, any disruption has a ~1.7% chance of overlapping the final 30-second refresh window. Over a day's worth of tokens (48 cycles), the probability of at least one hit is ~56%. In practice this means almost daily permanent failures in our environment with occasional network imperfections.

Log evidence

# Connection dies, controller API fails instantly (30ms, not a fresh TCP attempt):
(3716046)[     1946.837] ctrl_resp_cb() request[/edge/client/v1/current-api-session] failed: -103(software caused connection abort)
(3716046)[     1946.866] ctrl_resp_cb() request[/edge/client/v1/current-api-session] failed: -9(bad file descriptor)

# OIDC refresh hits the same broken handle, retries every 5s, never recovers:
(3716046)[     2364.021] refresh_cb() OIDC token refresh failed (trying again): -103/software caused connection abort
(3716046)[     2369.033] refresh_cb() OIDC token refresh failed (trying again): -103/software caused connection abort
(3716046)[     2374.040] refresh_cb() OIDC token refresh failed (trying again): -9/bad file descriptor
# ... 144 consecutive failures over 11+ minutes ...

# Meanwhile, edge router channels on the SAME host reconnect fine, host is reachable:
channel.c:973 on_tls_connect() ch[1] connected sending Hello with token[...]

# Eventually, token expires → death spiral:
hello_reply_cb() ch[0] not authenticated via session token has expired
ztx_work_cb() ztx[1] controller is not reachable

I'm fairly confident the -9 (EBADF, bad file descriptor) errors describe the socket fd was closed but the handle keeps trying to use it. New TCP connections are never attempted despite the host being reachable.

Reproduction

I've tried reproducing this by using temporary iptables rules to block the oidc refresh mechanism:

Test OIDC behavior Controller API Recovery?
iptables -A OUTPUT -j DROP (silent packet loss) Each retry hangs 130 seconds (kernel SYN timeout), no connect timeout set Fails in 15s (has connect timeout) No, 1 attempt exhausts the 30s window
iptables -A OUTPUT -j REJECT --reject-with tcp-reset Retries at 5s interval (ECONNREFUSED) Fails immediately Yes, fast failure + fresh connections
Kill established connection (production scenario) Retries at 5s interval (-103/-9 in 30ms), handle broken Same broken handle No, handle never reconnects

The REJECT test recovers because ECONNREFUSED on a new TCP socket works normally. The production failure and DROP test do not, for different reasons: DROP because there's no connect timeout, killed-connection because the tlsuv_http_t handle enters a permanently broken state.

Causes I've identified in source (v1.10.9)

1. OIDC tlsuv_http_t handle is never reset after connection failure

refresh_cb() (oidc.c:613–616) schedules a 5-second retry on the same clt->http handle. No code path in oidc.c ever resets the connection on error.

The fix already exists in tlsuv: tlsuv_http_cancel_all() tears down the transport and sets state to Disconnected while keeping the handle alive and reusable. It is already used in the SDK — but only for the controller client during graceful shutdown (ziti_ctrl.c:700–705, called from ziti.c:531). It is never called on the OIDC handle, and never called in any error recovery path.

The controller API handle has the same problem: ctrl_resp_cb() (ziti_ctrl.c:178–191) attempts HA failover via ctrl_next_ep(), which returns NULL for single-controller deployments (ziti_ctrl.c:567–569), and then does nothing.

2. OIDC HTTP client has no connect timeout (kernel default ~130s)

oidc_client_init() (oidc.c:83–112) never calls tlsuv_http_connect_timeout(). The controller API client sets 15s (ziti_ctrl.c:650). The OIDC client inherits the kernel default (~130s on Linux), so a single attempt against a DROPping path consumes the entire retry window.

3. Token refresh margin is only 30 seconds before expiry

oidc_client_set_tokens() (oidc.c:577) schedules refresh at expires_in - 30. For a 1800s token, this gives 30 seconds / 6 retry attempts before the token expires and ziti_set_unauthenticated() (ziti.c:222) tears down all channels.

Suggested fixes (in priority order)

1. Call tlsuv_http_cancel_all() on the OIDC handle after repeated failures — In refresh_cb(), after N consecutive connection errors (e.g., 3), call tlsuv_http_cancel_all(&clt->http) to force the transport to Disconnected. The next retry will open a fresh TCP+TLS connection. This is a ~10-line change using an API that's already part of the SDK's dependency and already called elsewhere (ziti_ctrl_cancel). It's the only fix that breaks the permanent-failure loop.

2. Add connect timeout to OIDC HTTP client — Call tlsuv_http_connect_timeout(&clt->http, 15000) in oidc_client_init(), matching the controller client. Prevents 130s hangs in packet-loss scenarios.

3. Increase token refresh margin — Replace the fixed 30-second margin with a proportion of token lifetime (e.g., expires_in / 2). For a 1800s token this gives ~15 minutes of retry runway instead of 30 seconds.

4. Add controller client retry for single-controller setups — When ctrl_next_ep() returns NULL, schedule a reconnect attempt with backoff instead of silently giving up.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions