-
Notifications
You must be signed in to change notification settings - Fork 450
Description
What happened:
While restoring a Stolon cluster from an existing PostgreSQL database. The new cluster had disk‑IO performance issues, after recovery completion Stolon tried to shut down the PostgreSQL instance.
The shutdown process took longer than expected (PG default: 60s), causing the keeper to return an error. Stolon interpreted this as an initialization failure and restarted the restore process again.
What you expected to happen:
Stolon should wait longer for PostgreSQL to shut down gracefully, instead of aborting and restarting the restore process.
How to reproduce it (as minimally and precisely as possible):
Use a PostgreSQL data directory with a disk that has IO performance issue.
Start Stolon restore process.
Observe logs on recovery completion and shutdown.
Shutdown exceeds default timeout → keeper returns error → Stolon restarts restore.
The logs:
2025-10-15T10:18:52.811Z INFO cmd/keeper.go:1276 recovery completed
2025-10-15T10:18:52.999Z INFO postgresql/postgresql.go:384 stopping database
failed pg_ctl: server does not shut down
2025-10-15T10:19:53.275Z ERROR cmd/keeper.go:1297 failed to stop pg instance{"error": "error: exit status 1"}
2025-10-15T10:19:58.283Z ERROR cmd/keeper.go:1116 db failed to initialize or resync
2025-10-15T10:20:25.810Z INFO cmd/keeper.go:1147 current db UID different than cluster data db UID {"db": "", "cdDB": "8bf041fc"}
Anything else we need to know?:
This issue can occur in production environments where disk I/O is slow or checkpoints are large.
By increasing the PGCTLTIMEOUT environment variable, the shutdown succeeds and the restore process completes normally.
Suggest increasing the default timeout or making it configurable at cluster level.
Environment:
- Stolon version: 0.17.0
- Stolon running environment (if useful to understand the bug): kubernetes