Clusters became broken after upgrading to 1.14.0 #2852

baznikin · 2025-01-23T17:59:54Z

First of all, sorry for long logs and unstructured message. To write clean issue you have to have at least some understanding of what happens, but I have no idea yet. I read release notes on 1.12, 1.13 and 1.14 and descide I can upgrade stright to 1.14.0. But...

After upgrading postgres-operator 1.11.0 to 1.14.0 my clusters won't startup:

$ kubectl get postgresqls.acid.zalan.do -A
NAMESPACE            NAME                  TEAM               VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-staging   brandadmin-pg         develop            16        1      100Gi    1             500Mi            429d   SyncFailed
ga                   games-aggregator-pg   games-aggregator   16        2      125Gi    1000m         512Mi            157d   SyncFailed
payments             payments-pg           develop            16        1      20Gi     1             500Mi            457d   Running
sprint-reports       asana-automate-db     sprint             16        1      25Gi     1             500Mi            358d   Running
staging              develop-postgresql    develop            17        2      250Gi    1             2Gi              435d   UpdateFailed

3 clusters successfully started with updated spilo image (payments-pg, asana-automate-db and develop-postgresql) and 2 - not (brandadmin-pg and games-aggregator-pg). Before I noticed not clusters are updated, I initialized upgrade 16 -> 17 on cluster develop-postgresql and it stuck with same symptoms (at first I thought it is this reason, but now I don't thinks so, see below):

2025-01-23 15:59:28,706 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

and no more logs.

Some clusters managed to start there is same error:

$ kubectl -n sprint-reports logs asana-automate-db-0
2025-01-23 15:38:54,983 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:38:55,040 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:38:55,043 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:38:55,191 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:38:55,192 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:38:55,775 - bootstrapping - INFO - Configuring pgqd
2025-01-23 15:38:55,776 - bootstrapping - INFO - Configuring wal-e
2025-01-23 15:38:55,777 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 15:38:55,777 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 15:38:55,778 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 15:38:55,780 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 15:38:55,780 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 15:38:55,782 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 15:38:55,782 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 15:38:55,793 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 15:38:55,793 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:38:55,794 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 15:38:55,808 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring patroni
2025-01-23 15:38:55,826 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 15:38:55,827 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-23 15:38:57,916 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-23 15:38:57,974 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-23 15:38:57,995 WARNING: Postgresql is not running.
2025-01-23 15:38:57,995 INFO: Lock owner: ; I am asana-automate-db-0
2025-01-23 15:38:58,000 INFO: pg_controldata:

After I delete this pod it stuck too!

Processes inside of failed clusters:

root@develop-postgresql-0:/home/postgres# ps ax
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.sh
      7 ?        S      0:00 /bin/sh /launch.sh
     20 ?        S      0:00 /usr/bin/runsvdir -P /etc/service
     21 ?        Ss     0:00 runsv pgqd
     22 ?        S      0:00 /bin/bash /scripts/patroni_wait.sh --role primary -- /usr/bin/pgqd /home/postgres/pgq_ticker.ini
     83 ?        S      0:00 sleep 60
     84 pts/0    Ss     0:00 bash
     97 pts/0    R+     0:00 ps ax

After one more deletion it is managed to start.

I notice one thing in the logs - sometimes container starts with WAL-E variables, sometimes - not. Operator shows its status as OK, but it's not:

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 15:38:43,529 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:38:43,587 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:38:43,588 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:38:43,726 - bootstrapping - INFO - Configuring wal-e
2025-01-23 15:38:43,727 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 15:38:43,728 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 15:38:43,728 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 15:38:43,729 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 15:38:43,729 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 15:38:43,731 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 15:38:43,731 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 15:38:43,733 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 15:38:43,733 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 15:38:43,736 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 15:38:43,736 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:38:43,736 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:38:43,910 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:38:43,910 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:38:43,931 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n brandadmin-staging get postgresqls.acid.zalan.do -A
NAMESPACE            NAME                  TEAM               VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-staging   brandadmin-pg         develop            16        1      100Gi    1             500Mi            429d   SyncFailed
ga                   games-aggregator-pg   games-aggregator   16        2      125Gi    1000m         512Mi            157d   SyncFailed
payments             payments-pg           develop            16        1      20Gi     1             500Mi            457d   Running
sprint-reports       asana-automate-db     sprint             16        1      25Gi     1             500Mi            358d   Running
staging              develop-postgresql    develop            17        2      250Gi    1             2Gi              435d   UpdateFailed

$ kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted

$ kubectl -n brandadmin-staging get pod
NAME                                                         READY   STATUS             RESTARTS         AGE
brand-admin-backend-api-7b7856c75-d2ktr                      1/1     Running            0                22h
brand-admin-backend-api-7b7856c75-vczsg                      1/1     Running            0                22h
brand-admin-backend-async-tasks-69c5876799-nm4nh             1/1     Running            0                22h
brandadmin-pg-0                                              1/2     Running            0                82s

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 15:59:27,840 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:59:27,896 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:59:27,897 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:59:28,051 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:59:28,053 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:59:28,070 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:59:28,070 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:59:28,706 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n brandadmin-staging get pod brandadmin-pg-0
NAME              READY   STATUS    RESTARTS   AGE
brandadmin-pg-0   1/2     Running   0          81m

$ kubectl -n brandadmin-staging get postgresqls.acid.zalan.do brandadmin-pg
NAME            TEAM      VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-pg   develop   16        1      100Gi    1             500Mi            429d   Running

While I wrote this issue passed like an hour or so, in despair I restarted this failed pod one more time and it STARTED (container postgres became Ready), but still not working:

kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted

$ kubectl  -n brandadmin-staging describe pod brandadmin-pg-0
Name:             brandadmin-pg-0
Namespace:        brandadmin-staging
Priority:         0
Service Account:  postgres-pod
Node:             pri-staging-wx2ci/10.106.0.35
Start Time:       Thu, 23 Jan 2025 18:26:41 +0100
Labels:           application=spilo
                  apps.kubernetes.io/pod-index=0
                  cluster-name=brandadmin-pg
                  controller-revision-hash=brandadmin-pg-5f65fc8dbd
                  spilo-role=master
                  statefulset.kubernetes.io/pod-name=brandadmin-pg-0
                  team=develop
Annotations:      prometheus.io/path: /metrics
                  prometheus.io/port: 9187
                  prometheus.io/scrape: true
                  status:
                    {"conn_url":"postgres://10.244.2.104:5432/postgres","api_url":"http://10.244.2.104:8008/patroni","state":"running","role":"primary","versi...
Status:           Running
IP:               10.244.2.104
IPs:
  IP:           10.244.2.104
Controlled By:  StatefulSet/brandadmin-pg
Containers:
  postgres:
    Container ID:   containerd://d67d695d8bce177e07b0ec3c23efbe59cc5349cb81e95abea6ba6e913fe7d836
    Image:          ghcr.io/zalando/spilo-17:4.0-p2
    Image ID:       ghcr.io/zalando/spilo-17@sha256:23861da069941ff5345e6a97455e60a63fc2f16c97857da8f85560370726cbe7
    Ports:          8008/TCP, 5432/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 18:26:46 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     10
      memory:  6Gi
    Requests:
      cpu:      1
      memory:   500Mi
    Readiness:  http-get http://:8008/readiness delay=6s timeout=5s period=10s #success=1 #failure=3
    Environment:
      SCOPE:                        brandadmin-pg
      PGROOT:                       /home/postgres/pgdata/pgroot
      POD_IP:                        (v1:status.podIP)
      POD_NAMESPACE:                brandadmin-staging (v1:metadata.namespace)
      PGUSER_SUPERUSER:             postgres
      KUBERNETES_SCOPE_LABEL:       cluster-name
      KUBERNETES_ROLE_LABEL:        spilo-role
      PGPASSWORD_SUPERUSER:         <set to the key 'password' in secret 'postgres.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      PGUSER_STANDBY:               standby
      PGPASSWORD_STANDBY:           <set to the key 'password' in secret 'standby.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      PAM_OAUTH2:                   https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
      HUMAN_ROLE:                   zalandos
      PGVERSION:                    16
      KUBERNETES_LABELS:            {"application":"spilo"}
      SPILO_CONFIGURATION:          {"postgresql":{"parameters":{"shared_buffers":"1536MB"}},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"dcs":{"postgresql":{"parameters":{"checkpoint_completion_target":"0.9","default_statistics_target":"100","effective_cache_size":"4608MB","effective_io_concurrency":"200","hot_standby_feedback":"on","huge_pages":"off","jit":"false","maintenance_work_mem":"384MB","max_connections":"100","max_standby_archive_delay":"900s","max_standby_streaming_delay":"900s","max_wal_size":"4GB","min_wal_size":"1GB","random_page_cost":"1.1","wal_buffers":"16MB","work_mem":"7864kB"}},"failsafe_mode":true}}}
      DCS_ENABLE_KUBERNETES_API:    true
      ALLOW_NOSSL:                  true
      AWS_ACCESS_KEY_ID:            xxxx
      AWS_ENDPOINT:                 https://fra1.digitaloceanspaces.com
      AWS_SECRET_ACCESS_KEY:        xxxx
      CLONE_AWS_ACCESS_KEY_ID:      xxx
      CLONE_AWS_ENDPOINT:           https://fra1.digitaloceanspaces.com
      CLONE_AWS_SECRET_ACCESS_KEY:  xxxx
      LOG_S3_ENDPOINT:              https://fra1.digitaloceanspaces.com
      WAL_S3_BUCKET:                xxx-staging-db-wal
      WAL_BUCKET_SCOPE_SUFFIX:      /79c4fff8-6efb-477a-83bc-a43d34e8160a
      WAL_BUCKET_SCOPE_PREFIX:      
      LOG_S3_BUCKET:                xxx-staging-db-backups-all
      LOG_BUCKET_SCOPE_SUFFIX:      /79c4fff8-6efb-477a-83bc-a43d34e8160a
      LOG_BUCKET_SCOPE_PREFIX:      
    Mounts:
      /dev/shm from dshm (rw)
      /home/postgres/pgdata from pgdata (rw)
      /var/run/postgresql from postgresql-run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9mghg (ro)
  exporter:
    Container ID:   containerd://48c54ad6591eaf9e60aa92b3235cb4878900fb46e94aacfeedcb70465d005619
    Image:          quay.io/prometheuscommunity/postgres-exporter:latest
    Image ID:       quay.io/prometheuscommunity/postgres-exporter@sha256:6999a7657e2f2fb0ca6ebf417213eebf6dc7d21b30708c622f6fcb11183a2bb0
    Port:           9187/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 18:26:47 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  256Mi
    Requests:
      cpu:     100m
      memory:  200Mi
    Environment:
      POD_NAME:                             brandadmin-pg-0 (v1:metadata.name)
      POD_NAMESPACE:                        brandadmin-staging (v1:metadata.namespace)
      POSTGRES_USER:                        postgres
      POSTGRES_PASSWORD:                    <set to the key 'password' in secret 'postgres.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      DATA_SOURCE_URI:                      127.0.0.1:5432
      DATA_SOURCE_USER:                     $(POSTGRES_USER)
      DATA_SOURCE_PASS:                     $(POSTGRES_PASSWORD)
      PG_EXPORTER_AUTO_DISCOVER_DATABASES:  true
    Mounts:
      /home/postgres/pgdata from pgdata (rw)
      /var/run/postgresql from postgresql-run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9mghg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  pgdata:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pgdata-brandadmin-pg-0
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  postgresql-run:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-9mghg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             workloadKind=postgres:NoSchedule
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  22s   default-scheduler  Successfully assigned brandadmin-staging/brandadmin-pg-0 to pri-staging-wx2ci
  Normal  Pulled     18s   kubelet            Container image "ghcr.io/zalando/spilo-17:4.0-p2" already present on machine
  Normal  Created    18s   kubelet            Created container postgres
  Normal  Started    18s   kubelet            Started container postgres
  Normal  Pulling    18s   kubelet            Pulling image "quay.io/prometheuscommunity/postgres-exporter:latest"
  Normal  Pulled     17s   kubelet            Successfully pulled image "quay.io/prometheuscommunity/postgres-exporter:latest" in 455ms (455ms including waiting). Image size: 11070758 bytes.
  Normal  Created    17s   kubelet            Created container exporter
  Normal  Started    17s   kubelet            Started container exporter

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 17:26:47,349 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 17:26:47,407 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 17:26:47,408 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 17:26:47,460 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 17:26:47,462 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 17:26:47,462 - bootstrapping - INFO - Configuring certificate
2025-01-23 17:26:47,463 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 17:26:47,768 - bootstrapping - INFO - Configuring patroni
2025-01-23 17:26:47,792 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 17:26:47,793 - bootstrapping - INFO - Configuring wal-e
2025-01-23 17:26:47,794 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 17:26:47,794 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 17:26:47,798 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 17:26:47,798 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 17:26:47,801 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 17:26:47,802 - bootstrapping - INFO - Configuring crontab
2025-01-23 17:26:47,803 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 17:26:47,816 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-23 17:26:47,817 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-23 17:26:47,817 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 17:26:47,817 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 17:26:47,818 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-23 17:26:49,683 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-23 17:26:49,754 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-23 17:26:49,774 WARNING: Postgresql is not running.
2025-01-23 17:26:49,775 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:26:49,781 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down
  pg_control last modified: Thu Jan 23 17:32:16 2025
  Latest checkpoint location: 5A/82000028
  Latest checkpoint's REDO location: 5A/82000028
  Latest checkpoint's REDO WAL file: 0000001B0000005A00000082
  Latest checkpoint's TimeLineID: 27
  Latest checkpoint's PrevTimeLineID: 27
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:929334
  Latest checkpoint's NextOID: 873526
  Latest checkpoint's NextMultiXactId: 19
  Latest checkpoint's NextMultiOffset: 37
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 5
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 5
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Jan 23 17:32:16 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29

2025-01-23 17:32:36,148 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:36,326 INFO: starting as a secondary
2025-01-23 17:32:36 UTC [51]: [1-1] 67927d34.33 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-23 17:32:36 UTC [51]: [2-1] 67927d34.33 0     LOG:  pg_stat_kcache.linux_hz is set to 125000
2025-01-23 17:32:36 UTC [51]: [3-1] 67927d34.33 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-23 17:32:36 UTC [51]: [4-1] 67927d34.33 0     LOG:  database system is shut down
2025-01-23 17:32:36,971 INFO: postmaster pid=51
/var/run/postgresql:5432 - no response
2025-01-23 17:32:46,146 WARNING: Postgresql is not running.
2025-01-23 17:32:46,146 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:46,149 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down
  pg_control last modified: Thu Jan 23 17:32:16 2025
  Latest checkpoint location: 5A/82000028
  Latest checkpoint's REDO location: 5A/82000028
  Latest checkpoint's REDO WAL file: 0000001B0000005A00000082
  Latest checkpoint's TimeLineID: 27
  Latest checkpoint's PrevTimeLineID: 27
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:929334
  Latest checkpoint's NextOID: 873526
  Latest checkpoint's NextMultiXactId: 19
  Latest checkpoint's NextMultiOffset: 37
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 5
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 5
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Jan 23 17:32:16 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29

2025-01-23 17:32:46,162 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:46,190 INFO: starting as a secondary
2025-01-23 17:32:46 UTC [62]: [1-1] 67927d3e.3e 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-23 17:32:46 UTC [62]: [2-1] 67927d3e.3e 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2025-01-23 17:32:46 UTC [62]: [3-1] 67927d3e.3e 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-23 17:32:46 UTC [62]: [4-1] 67927d3e.3e 0     LOG:  database system is shut down
2025-01-23 17:32:46,821 INFO: postmaster pid=62
/var/run/postgresql:5432 - no response
2025-01-23 17:32:56,143 WARNING: Postgresql is not running.
2025-01-23 17:32:56,144 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:56,146 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071

All my clusters consisting of two nodes can't start replica node: Probably problem is with WAL variables...

$ kubectl -n staging exec -it develop-postgresql-0 -- patronictl topology
Defaulted container "postgres" out of: postgres, exporter
+ Cluster: develop-postgresql (7369262358642845868) --------+----+-----------+
| Member                 | Host         | Role    | State   | TL | Lag in MB |
+------------------------+--------------+---------+---------+----+-----------+
| develop-postgresql-0   | 10.244.0.253 | Leader  | running | 39 |           |
| + develop-postgresql-1 |              | Replica |         |    |   unknown |
+------------------------+--------------+---------+---------+----+-----------+
$ kubectl -n staging logs develop-postgresql-0 | head -20
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 16:20:51,723 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 16:20:51,766 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 16:20:51,767 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 16:20:51,823 - bootstrapping - INFO - Configuring patroni
2025-01-23 16:20:51,846 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 16:20:51,847 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 16:20:51,847 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-23 16:20:51,848 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring crontab
2025-01-23 16:20:51,848 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 16:20:51,868 - bootstrapping - INFO - Configuring certificate
2025-01-23 16:20:51,868 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 16:20:53,422 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 16:20:53,423 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main

$ kubectl -n staging exec -it develop-postgresql-0 -- tail /home/postgres/pgdata/pgroot/pg_log/postgresql-4.log
Defaulted container "postgres" out of: postgres, exporter
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist

$ kubectl -n staging logs develop-postgresql-1 | head -20
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 16:38:15,383 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 16:38:15,424 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 16:38:15,424 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 16:38:15,473 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 16:38:15,473 - bootstrapping - INFO - Configuring crontab
2025-01-23 16:38:15,473 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 16:38:15,482 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 16:38:15,482 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n staging exec -it develop-postgresql-1 -- tail /home/postgres/pgdata/pgroot/pg_log/postgresql-4.log
Defaulted container "postgres" out of: postgres, exporter
        STRUCTURED: time=2025-01-23T16:30:08.235670-00 pid=8215 action=push-wal key=s3://xxx-staging-db-wal/spilo/develop-postgresql/939ea78b-0caf-458f-a088-989352a97300/wal/16/wal_005/00000026000006700000009C.lzo prefix=spilo/develop-postgresql/939ea78b-0caf-458f-a088-989352a97300/wal/16/ rate=18353.3 seg=00000026000006700000009C state=complete
2025-01-23 16:30:12 UTC [8234]: [5-1] 67926e94.202a 0     LOG:  ending log output to stderr
2025-01-23 16:30:12 UTC [8234]: [6-1] 67926e94.202a 0     HINT:  Future log output will go to log destination "csvlog".
ERROR: 2025/01/23 16:30:12.698764 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:13.204088 Archive '00000026000006700000009D' does not exist.
ERROR: 2025/01/23 16:30:13.573033 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:13.845528 Archive '00000028.history' does not exist.
ERROR: 2025/01/23 16:30:14.117082 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:14.478060 Archive '00000027000006700000009D' does not exist.
ERROR: 2025/01/23 16:30:14.807988 Archive '00000026000006700000009D' does not exist.

$ kubectl -n staging describe pod develop-postgresql-0
Name:             develop-postgresql-0
Namespace:        staging
Priority:         0
Service Account:  postgres-pod
Node:             pri-staging-wx2cv/10.106.0.46
Start Time:       Thu, 23 Jan 2025 17:20:44 +0100
Labels:           application=spilo
                  apps.kubernetes.io/pod-index=0
                  cluster-name=develop-postgresql
                  controller-revision-hash=develop-postgresql-5f869975bf
                  spilo-role=master
                  statefulset.kubernetes.io/pod-name=develop-postgresql-0
                  team=develop
Annotations:      prometheus.io/path: /metrics
                  prometheus.io/port: 9187
                  prometheus.io/scrape: true
                  status:
                    {"conn_url":"postgres://10.244.0.253:5432/postgres","api_url":"http://10.244.0.253:8008/patroni","state":"running","role":"primary","versi...
Status:           Running
IP:               10.244.0.253
IPs:
  IP:           10.244.0.253
Controlled By:  StatefulSet/develop-postgresql
Containers:
  postgres:
    Container ID:   containerd://5004728ea5d71484a313b6124f2534a839da5ef0527427cec1942f135aa33e93
    Image:          ghcr.io/zalando/spilo-17:4.0-p2
    Image ID:       ghcr.io/zalando/spilo-17@sha256:23861da069941ff5345e6a97455e60a63fc2f16c97857da8f85560370726cbe7
    Ports:          8008/TCP, 5432/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 17:20:50 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     10
      memory:  13500Mi
    Requests:
      cpu:      1
      memory:   2Gi
    Readiness:  http-get http://:8008/readiness delay=6s timeout=5s period=10s #success=1 #failure=3
    Environment:
      SCOPE:                        develop-postgresql
      PGROOT:                       /home/postgres/pgdata/pgroot
      POD_IP:                        (v1:status.podIP)
      POD_NAMESPACE:                staging (v1:metadata.namespace)
      PGUSER_SUPERUSER:             postgres
      KUBERNETES_SCOPE_LABEL:       cluster-name
      KUBERNETES_ROLE_LABEL:        spilo-role
      PGPASSWORD_SUPERUSER:         <set to the key 'password' in secret 'postgres.develop-postgresql.credentials.postgresql.acid.zalan.do'>  Optional: false
      PGUSER_STANDBY:               standby
      PGPASSWORD_STANDBY:           <set to the key 'password' in secret 'standby.develop-postgresql.credentials.postgresql.acid.zalan.do'>  Optional: false
      PAM_OAUTH2:                   https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
      HUMAN_ROLE:                   zalandos
      PGVERSION:                    17
      KUBERNETES_LABELS:            {"application":"spilo"}
      SPILO_CONFIGURATION:          {"postgresql":{"parameters":{"shared_buffers":"3GB","shared_preload_libraries":"bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache,decoderbufs"}},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"dcs":{"postgresql":{"parameters":{"checkpoint_completion_target":"0.9","default_statistics_target":"100","effective_cache_size":"9GB","effective_io_concurrency":"200","hot_standby_feedback":"on","huge_pages":"off","jit":"false","maintenance_work_mem":"768MB","max_connections":"200","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_standby_archive_delay":"900s","max_standby_streaming_delay":"900s","max_wal_size":"4GB","max_worker_processes":"8","min_wal_size":"1GB","random_page_cost":"1.1","wal_buffers":"16MB","wal_level":"logical","work_mem":"4MB"}},"failsafe_mode":true}}}
      DCS_ENABLE_KUBERNETES_API:    true
      ALLOW_NOSSL:                  true
      AWS_ACCESS_KEY_ID:            xxx
      AWS_ENDPOINT:                 https://fra1.digitaloceanspaces.com
      AWS_SECRET_ACCESS_KEY:        xxx
      CLONE_AWS_ACCESS_KEY_ID:      xxx
      CLONE_AWS_ENDPOINT:           https://fra1.digitaloceanspaces.com
      CLONE_AWS_SECRET_ACCESS_KEY:  xxx
      LOG_S3_ENDPOINT:              https://fra1.digitaloceanspaces.com
      WAL_S3_BUCKET:                xxx-staging-db-wal
      WAL_BUCKET_SCOPE_SUFFIX:      /939ea78b-0caf-458f-a088-989352a97300
      WAL_BUCKET_SCOPE_PREFIX:      
      LOG_S3_BUCKET:                xxx-staging-db-backups-all
      LOG_BUCKET_SCOPE_SUFFIX:      /939ea78b-0caf-458f-a088-989352a97300
      LOG_BUCKET_SCOPE_PREFIX:      
    Mounts:

It's complete mess!

Operator installed with Helm and terraform. Configured with ConfigMap:

resource "kubectl_manifest" "postgres-pod-config" {
  yaml_body = <<-EOF
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: postgres-pod-config
      namespace: ${var.namespace}
    data:
      ALLOW_NOSSL: "true"
      # WAL archiving and physical basebackups for PITR
      AWS_ENDPOINT: ${local.s3_endpoint}
      AWS_SECRET_ACCESS_KEY: ${local.s3_secret_key}
      AWS_ACCESS_KEY_ID: ${local.s3_access_id}
      # default values for cloning a cluster (same as above)
      CLONE_AWS_ENDPOINT: ${local.clone_s3_endpoint}
      CLONE_AWS_SECRET_ACCESS_KEY: ${local.clone_s3_secret_key}
      CLONE_AWS_ACCESS_KEY_ID: ${local.clone_s3_access_id}
      # send pg_logs to s3 (work in progress)
      LOG_S3_ENDPOINT: ${local.s3_endpoint}
    EOF
}

resource "helm_release" "postgres-operator" {
  name       = "postgres-operator"
  namespace  = var.namespace
  chart      = "postgres-operator"
  repository = "https://opensource.zalando.com/postgres-operator/charts/postgres-operator"
  version    = "1.14.0"

  depends_on = [kubectl_manifest.postgres-pod-config]

  dynamic "set" {
    for_each = var.wal_backup ? ["yes"] : []
    content {
      name  = "configAwsOrGcp.wal_s3_bucket"
      value = local.bucket_name_wal
    }
  }

  dynamic "set" {
    for_each = var.log_backup ? ["yes"] : []
    content {
      name  = "configAwsOrGcp.log_s3_bucket"
      value = "${var.name}-db-backups-all" # bucket with logical backups; 15 days ttl
    }
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_access_key_id"
    value = local.s3_access_id
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_bucket"
    value = local.bucket_name_backups
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_region"
    value = var.bucket_region
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_endpoint"
    value = local.s3_endpoint
  }

  set {
    name  = "configKubernetes.pod_environment_configmap"
    value = "${var.namespace}/postgres-pod-config"
  }
  set {
    name  = "configLogicalBackup.logical_backup_s3_secret_access_key"
    value = local.s3_secret_key
  }

  values = [<<-YAML
    configConnectionPooler:
      connection_pooler_image: "registry.xxx.com/devops/postgres-zalando-pgbouncer:master-32"

    configLogicalBackup:
      logical_backup_docker_image: "registry.xxx.com/devops/postgres-logical-backup:0.6"
      logical_backup_schedule: "32 8 * * *"
      logical_backup_s3_retention_time: "2 week"

    configKubernetes:
      enable_pod_antiaffinity: true
      # it doesn't influence pulling of images from public repos (like operator image) if there is no such secret
      # but will help to fetch postgres-logical-backup image
      pod_service_account_definition: |
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: postgres-pod
        imagePullSecrets:
          - name: gitlab-registry-token
      # became disabled by default since 1.9.0 https://github.com/zalando/postgres-operator/releases/tag/v1.9.0
      # Quote: "We recommend enable_readiness_probe: true with pod_management_policy: parallel"
      enable_readiness_probe: true
      # Quote: "We recommend enable_readiness_probe: true with pod_management_policy: parallel"
      pod_management_policy: "parallel"
      enable_sidecars: true
      share_pgsocket_with_sidecars: true
      custom_pod_annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "9187"

    configPatroni:
      # https://patroni.readthedocs.io/en/master/dcs_failsafe_mode.html
      enable_patroni_failsafe_mode: true

    configGeneral:
      sidecars:
        - name: exporter
          image: quay.io/prometheuscommunity/postgres-exporter:latest
          ports:
            - name: exporter
              containerPort: 9187
              protocol: TCP
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 100m
              memory: 200Mi
          env:
            - name: DATA_SOURCE_URI
              value: "127.0.0.1:5432"
            - name: DATA_SOURCE_USER
              value: "$(POSTGRES_USER)"
            - name: DATA_SOURCE_PASS
              value: "$(POSTGRES_PASSWORD)"
            - name: PG_EXPORTER_AUTO_DISCOVER_DATABASES
              value: "true"
  YAML
  ]
}

The text was updated successfully, but these errors were encountered:

baznikin · 2025-01-23T18:09:11Z

Downgrading to 1.11.0 resolve my issues

baznikin · 2025-01-24T17:38:36Z

Update.
I tried sequential upgrade 1.11.0 -> 1.12.2 -> 1.13.0 -> 1.14.0 on same k8s cluster. Up to 1.13.0 - no issues.
With 1.14.1:

first master replicas started successfully:
a. clusters, consisting of 1 replica rebooted with ghcr.io/zalando/spilo-17:4.0-p2 started up successfully
b. clusters, consisting of 2 replicas, 1st replica booted up OK, second stuck with error

2025-01-24 16:59:12,840 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 16:59:12,853 INFO: Local timeline=43 lsn=679/8E0616D0
2025-01-24 16:59:12,888 INFO: primary_timeline=43
2025-01-24 16:59:12,888 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 16:59:12,889 INFO: starting as a secondary
2025-01-24 16:59:13 UTC [166]: [1-1] 6793c6e1.a6 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-24 16:59:13 UTC [166]: [2-1] 6793c6e1.a6 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2025-01-24 16:59:13 UTC [166]: [3-1] 6793c6e1.a6 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-24 16:59:13 UTC [166]: [4-1] 6793c6e1.a6 0     LOG:  database system is shut down
2025-01-24 16:59:13,616 INFO: postmaster pid=166

+ Cluster: develop-postgresql (7369262358642845868) -------------+----+-----------+
| Member                 | Host         | Role    | State        | TL | Lag in MB |
+------------------------+--------------+---------+--------------+----+-----------+
| develop-postgresql-1   | 10.244.0.119 | Leader  | running      | 43 |           |
| + develop-postgresql-0 | 10.244.0.207 | Replica | start failed |    |   unknown |
+------------------------+--------------+---------+--------------+----+-----------+

After killing failed pod it booted up OK:

$ kubectl -n staging delete pod develop-postgresql-0
pod "develop-postgresql-0" deleted
$ kubectl --context fxg-betsofa-staging -n staging exec -it develop-postgresql-1 -- patronictl topology
Defaulted container "postgres" out of: postgres, exporter
+ Cluster: develop-postgresql (7369262358642845868) ----------+----+-----------+
| Member                 | Host         | Role    | State     | TL | Lag in MB |
+------------------------+--------------+---------+-----------+----+-----------+
| develop-postgresql-1   | 10.244.0.119 | Leader  | running   | 43 |           |
| + develop-postgresql-0 | 10.244.0.199 | Replica | streaming | 43 |         0 |
+------------------------+--------------+---------+-----------+----+-----------+

logs:

  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 51d6c13aa188b64e24e9e0f8a7bd6c8ce18ec69719a92728f12fe51d80b6b22b

2025-01-24 17:00:21,562 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 17:00:21,575 INFO: Local timeline=43 lsn=679/8E0616D0
2025-01-24 17:00:21,613 INFO: primary_timeline=43
2025-01-24 17:00:21,614 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 17:00:21,791 INFO: starting as a secondary
2025-01-24 17:00:22 UTC [58]: [1-1] 6793c726.3a 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-24 17:00:22 UTC [58]: [2-1] 6793c726.3a 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2025-01-24 17:00:22,598 INFO: postmaster pid=58
/var/run/postgresql:5432 - no response
2025-01-24 17:00:22 UTC [58]: [3-1] 6793c726.3a 0     LOG:  redirecting log output to logging collector process
2025-01-24 17:00:22 UTC [58]: [4-1] 6793c726.3a 0     HINT:  Future log output will appear in directory "../pg_log".
2025-01-24 17:00:22,826 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 17:00:22,872 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - accepting connections
2025-01-24 17:00:25,747 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 17:00:25,748 INFO: establishing a new patroni heartbeat connection to postgres
2025-01-24 17:00:25,918 INFO: Dropped unknown replication slot 'debezium'
2025-01-24 17:00:25,918 WARNING: Dropping physical replication slot develop_postgresql_1 because of its xmin value 83467746
2025-01-24 17:00:26,020 INFO: no action. I am (develop-postgresql-0), a secondary, and following a leader (develop-postgresql-1)
2025-01-24 17:00:32,875 INFO: no action. I am (develop-postgresql-0), a secondary, and following a leader (develop-postgresql-1)

if restart pod of cluster, running 1 replica (i.e. successfully started right after operator upgrade), it failed to start:

kubectl --context fxg-betsofa-staging -n brandadmin-staging exec brandadmin-pg-0 -- patronictl topology
Defaulted container "postgres" out of: postgres, exporter
+ Cluster: brandadmin-pg (7369539194529993100) ----+----+-----------+
| Member          | Host        | Role   | State   | TL | Lag in MB |
+-----------------+-------------+--------+---------+----+-----------+
| brandadmin-pg-0 | 10.244.2.68 | Leader | running | 31 |           |
+-----------------+-------------+--------+---------+----+-----------+
$ kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted

2025-01-24 17:04:21,354 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-24 17:04:21,403 - bootstrapping - INFO - No meta-data available for this provider
2025-01-24 17:04:21,404 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-24 17:04:21,441 - bootstrapping - INFO - Configuring patroni
2025-01-24 17:04:21,456 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-24 17:04:21,458 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-24 17:04:21,459 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-24 17:04:21,459 - bootstrapping - INFO - Configuring bootstrap
2025-01-24 17:04:21,459 - bootstrapping - INFO - Configuring crontab
2025-01-24 17:04:21,459 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-24 17:04:21,468 - bootstrapping - INFO - Configuring pgqd
2025-01-24 17:04:21,468 - bootstrapping - INFO - Configuring standby-cluster
2025-01-24 17:04:21,468 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-24 17:04:23,238 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-24 17:04:23,272 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-24 17:04:23,284 WARNING: Postgresql is not running.
2025-01-24 17:04:23,285 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-24 17:04:23,287 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down
  pg_control last modified: Fri Jan 24 17:04:07 2025
  Latest checkpoint location: 5A/B9000028
  Latest checkpoint's REDO location: 5A/B9000028
  Latest checkpoint's REDO WAL file: 0000001F0000005A000000B9
  Latest checkpoint's TimeLineID: 31
  Latest checkpoint's PrevTimeLineID: 31
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:931095
  Latest checkpoint's NextOID: 873929
  Latest checkpoint's NextMultiXactId: 19
  Latest checkpoint's NextMultiOffset: 37
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 5
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 5
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Fri Jan 24 17:04:07 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29

2025-01-24 17:04:23,299 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-24 17:04:23,464 INFO: starting as a secondary
2025-01-24 17:04:23 UTC [56]: [1-1] 6793c817.38 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-24 17:04:23 UTC [56]: [2-1] 6793c817.38 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2025-01-24 17:04:23 UTC [56]: [3-1] 6793c817.38 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-24 17:04:23 UTC [56]: [4-1] 6793c817.38 0     LOG:  database system is shut down
2025-01-24 17:04:24,002 INFO: postmaster pid=56
/var/run/postgresql:5432 - no response
2025-01-24 17:04:33,296 WARNING: Postgresql is not running.
2025-01-24 17:04:33,296 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-24 17:04:33,300 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down

After one more restart it hangs early:

Defaulted container "postgres" out of: postgres, exporter
2025-01-24 17:18:07,151 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-24 17:18:07,213 - bootstrapping - INFO - No meta-data available for this provider
2025-01-24 17:18:07,214 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-24 17:18:07,291 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-24 17:18:07,292 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-24 17:18:07,292 - bootstrapping - INFO - Configuring standby-cluster
2025-01-24 17:18:07,294 - bootstrapping - INFO - Configuring crontab
2025-01-24 17:18:07,295 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-24 17:18:07,311 - bootstrapping - INFO - Configuring bootstrap
2025-01-24 17:18:07,311 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

if delete master replica in working cluster, consisting of 2 nodes, ex-master failed to start with same symptoms.

So, eventually, after series of pod restarts, all cluster will be dead. Reverted to 1.13.0
On version 1.13.0 I can't manage to break clusters.

baznikin · 2025-01-29T15:57:10Z

@FxKu may I have your attention? something weird happens here

baznikin changed the title ~~Issues with upgrading 1.11.0 -> 1.14.0~~ Cluster broken with upgrading 1.11.0 -> 1.14.0 Jan 23, 2025

baznikin changed the title ~~Cluster broken with upgrading 1.11.0 -> 1.14.0~~ Clusters became broken after upgrading to 1.14.0 Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clusters became broken after upgrading to 1.14.0 #2852

Clusters became broken after upgrading to 1.14.0 #2852

baznikin commented Jan 23, 2025 •

edited

Loading

baznikin commented Jan 23, 2025

baznikin commented Jan 24, 2025

baznikin commented Jan 29, 2025

Clusters became broken after upgrading to 1.14.0 #2852

Clusters became broken after upgrading to 1.14.0 #2852

Comments

baznikin commented Jan 23, 2025 • edited Loading

baznikin commented Jan 23, 2025

baznikin commented Jan 24, 2025

baznikin commented Jan 29, 2025

baznikin commented Jan 23, 2025 •

edited

Loading