OIDC token refresh never recovers from network disruption, permanent UNAUTHORIZED state requires restart

# OIDC token refresh never recovers from brief network disruption — permanent UNAUTHORIZED state requires restart

## Summary

A brief network disruption (subsecond TCP RST) permanently kills OIDC-authenticated `ziti-edge-tunnel` clients. The OIDC token refresh loop retries on a broken HTTP handle that never reconnects, exhausts a 30-second retry window, and the client enters permanent `UNAUTHORIZED` state. Only a manual process restart recovers it.

This currently blocks us from upgrading from legacy authentication to OIDC. We have been unable to roll out OIDC auth because in practice it means tunnelers permanently drop roughly once a day, requiring manual intervention or external watchdog restarts.

## Environment

- **ziti-edge-tunnel** v1.10.9
- Single-controller deployment (non-HA)
- OIDC authentication

## What happens

After any event that kills an established TCP connection to the controller (network flap, LB idle timeout, firewall state flush etc.), the timeline is:

1. Both the controller API and OIDC HTTP handles discover the dead connection on their next use
2. OIDC refresh retries every 5s on the same broken `tlsuv_http_t` handle, every attempt fails in ~30ms with `-103`/`-9`
3. After 30 seconds the access token expires (hardcoded as per `library/oidc.c:567-580`)
4. `ztx_prepare()` triggers `ziti_set_unauthenticated()` → all edge router channels torn down → all tunneled traffic severed
5. Recovery attempts (`ziti_force_api_session_refresh`, `ziti_re_auth`) go through the same broken handles → same instant failures
6. Client is permanently dead

With a 30-minute token, any disruption has a ~1.7% chance of overlapping the final 30-second refresh window. Over a day's worth of tokens (48 cycles), the probability of at least one hit is ~56%. In practice this means **almost daily permanent failures** in our environment with occasional network imperfections.

## Log evidence

```
# Connection dies, controller API fails instantly (30ms, not a fresh TCP attempt):
(3716046)[     1946.837] ctrl_resp_cb() request[/edge/client/v1/current-api-session] failed: -103(software caused connection abort)
(3716046)[     1946.866] ctrl_resp_cb() request[/edge/client/v1/current-api-session] failed: -9(bad file descriptor)

# OIDC refresh hits the same broken handle, retries every 5s, never recovers:
(3716046)[     2364.021] refresh_cb() OIDC token refresh failed (trying again): -103/software caused connection abort
(3716046)[     2369.033] refresh_cb() OIDC token refresh failed (trying again): -103/software caused connection abort
(3716046)[     2374.040] refresh_cb() OIDC token refresh failed (trying again): -9/bad file descriptor
# ... 144 consecutive failures over 11+ minutes ...

# Meanwhile, edge router channels on the SAME host reconnect fine, host is reachable:
channel.c:973 on_tls_connect() ch[1] connected sending Hello with token[...]

# Eventually, token expires → death spiral:
hello_reply_cb() ch[0] not authenticated via session token has expired
ztx_work_cb() ztx[1] controller is not reachable
```

I'm fairly confident the `-9` (`EBADF`, bad file descriptor) errors describe the socket fd was closed but the handle keeps trying to use it. New TCP connections are never attempted despite the host being reachable.

## Reproduction

I've tried reproducing this by using temporary iptables rules to block the oidc refresh mechanism:

| Test | OIDC behavior | Controller API | Recovery? |
|------|--------------|----------------|-----------|
| `iptables -A OUTPUT -j DROP` (silent packet loss) | Each retry hangs **130 seconds** (kernel SYN timeout), no connect timeout set | Fails in 15s (has connect timeout) | No, 1 attempt exhausts the 30s window |
| `iptables -A OUTPUT -j REJECT --reject-with tcp-reset` | Retries at 5s interval (`ECONNREFUSED`) | Fails immediately | **Yes**, fast failure + fresh connections |
| Kill established connection (production scenario) | Retries at 5s interval (`-103`/`-9` in 30ms), handle broken | Same broken handle | **No**, handle never reconnects |

The REJECT test recovers because `ECONNREFUSED` on a *new* TCP socket works normally. The production failure and DROP test do not, for different reasons: DROP because there's no connect timeout, killed-connection because the `tlsuv_http_t` handle enters a permanently broken state.

## Causes I've identified in source (v1.10.9)

### 1. OIDC `tlsuv_http_t` handle is never reset after connection failure

`refresh_cb()` ([oidc.c:613–616](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/oidc.c#L613-L616)) schedules a 5-second retry on the same `clt->http` handle. No code path in `oidc.c` ever resets the connection on error.

The fix already exists in tlsuv: `tlsuv_http_cancel_all()` tears down the transport and sets state to `Disconnected` while keeping the handle alive and reusable. It is already used in the SDK — but only for the **controller** client during graceful shutdown ([ziti_ctrl.c:700–705](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/ziti_ctrl.c#L700-L705), called from [ziti.c:531](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/ziti.c#L531)). It is never called on the OIDC handle, and never called in any error recovery path.

The controller API handle has the same problem: `ctrl_resp_cb()` ([ziti_ctrl.c:178–191](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/ziti_ctrl.c#L178-L191)) attempts HA failover via `ctrl_next_ep()`, which returns `NULL` for single-controller deployments ([ziti_ctrl.c:567–569](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/ziti_ctrl.c#L567-L569)), and then does nothing.

### 2. OIDC HTTP client has no connect timeout (kernel default ~130s)

`oidc_client_init()` ([oidc.c:83–112](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/oidc.c#L83-L112)) never calls `tlsuv_http_connect_timeout()`. The controller API client sets 15s ([ziti_ctrl.c:650](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/ziti_ctrl.c#L650)). The OIDC client inherits the kernel default (~130s on Linux), so a single attempt against a DROPping path consumes the entire retry window.

### 3. Token refresh margin is only 30 seconds before expiry

`oidc_client_set_tokens()` ([oidc.c:577](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/oidc.c#L577)) schedules refresh at `expires_in - 30`. For a 1800s token, this gives 30 seconds / 6 retry attempts before the token expires and `ziti_set_unauthenticated()` ([ziti.c:222](https://github.com/openziti/ziti-sdk-c/blob/1.10.9/library/ziti.c#L222)) tears down all channels.

## Suggested fixes (in priority order)

**1. Call `tlsuv_http_cancel_all()` on the OIDC handle after repeated failures** — In `refresh_cb()`, after N consecutive connection errors (e.g., 3), call `tlsuv_http_cancel_all(&clt->http)` to force the transport to `Disconnected`. The next retry will open a fresh TCP+TLS connection. This is a ~10-line change using an API that's already part of the SDK's dependency and already called elsewhere (`ziti_ctrl_cancel`). It's the only fix that breaks the permanent-failure loop.

**2. Add connect timeout to OIDC HTTP client** — Call `tlsuv_http_connect_timeout(&clt->http, 15000)` in `oidc_client_init()`, matching the controller client. Prevents 130s hangs in packet-loss scenarios.

**3. Increase token refresh margin** — Replace the fixed 30-second margin with a proportion of token lifetime (e.g., `expires_in / 2`). For a 1800s token this gives ~15 minutes of retry runway instead of 30 seconds.

**4. Add controller client retry for single-controller setups** — When `ctrl_next_ep()` returns NULL, schedule a reconnect attempt with backoff instead of silently giving up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OIDC token refresh never recovers from network disruption, permanent UNAUTHORIZED state requires restart #990

OIDC token refresh never recovers from brief network disruption — permanent UNAUTHORIZED state requires restart

Summary

Environment

What happens

Log evidence

Reproduction

Causes I've identified in source (v1.10.9)

1. OIDC `tlsuv_http_t` handle is never reset after connection failure

2. OIDC HTTP client has no connect timeout (kernel default ~130s)

3. Token refresh margin is only 30 seconds before expiry

Suggested fixes (in priority order)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test	OIDC behavior	Controller API	Recovery?
`iptables -A OUTPUT -j DROP` (silent packet loss)	Each retry hangs 130 seconds (kernel SYN timeout), no connect timeout set	Fails in 15s (has connect timeout)	No, 1 attempt exhausts the 30s window
`iptables -A OUTPUT -j REJECT --reject-with tcp-reset`	Retries at 5s interval (`ECONNREFUSED`)	Fails immediately	Yes, fast failure + fresh connections
Kill established connection (production scenario)	Retries at 5s interval (`-103`/`-9` in 30ms), handle broken	Same broken handle	No, handle never reconnects

OIDC token refresh never recovers from network disruption, permanent UNAUTHORIZED state requires restart #990

Description

OIDC token refresh never recovers from brief network disruption — permanent UNAUTHORIZED state requires restart

Summary

Environment

What happens

Log evidence

Reproduction

Causes I've identified in source (v1.10.9)

1. OIDC tlsuv_http_t handle is never reset after connection failure

2. OIDC HTTP client has no connect timeout (kernel default ~130s)

3. Token refresh margin is only 30 seconds before expiry

Suggested fixes (in priority order)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. OIDC `tlsuv_http_t` handle is never reset after connection failure