arctic-prod checklist #2023

artntek · 2024-11-20T22:58:51Z

Quick Reference: Metacat K8s Installation Steps

(For more in-depth explanation and details of configuration steps, see the Metacat Helm
README)

1. Copy Data and Set Ownership & Permissions

first rsync the data over to cephfs (OK to leave postgres & tomcat running)

# can also prepend with time, and use --stats --human-readable, and/or --dry-run
# metacat
sudo rsync -aHAX /var/metacat/data/      /mnt/ceph/repos/$NAME/metacat/data/
sudo rsync -aHAX /var/metacat/dataone/   /mnt/ceph/repos/$NAME/metacat/dataone/
sudo rsync -aHAX /var/metacat/documents/ /mnt/ceph/repos/$NAME/metacat/documents/
sudo rsync -aHAX /var/metacat/logs/      /mnt/ceph/repos/$NAME/metacat/logs/

# postgres
sudo rsync -aHAX /var/lib/postgresql/    /mnt/ceph/repos/$NAME/postgresql/

After rsyncs are complete, change ownership ON CEPHFS as follows:

## tomcat (59997:59997) in metacat directory
sudo chown -R 59997:59997 /mnt/ceph/repos/$NAME/metacat

## postgres (59996:59996) in postgresql data directory
sudo chown -R 59996:59996 /mnt/ceph/repos/$NAME/postgresql

...then ensure all metacat data and documents files have g+rw permissions, otherwise,
hashstore converter can't create hard links (find quicker for huge corpuses):

sudo find /mnt/ceph/repos/$NAME/metacat/data/      -type f ! -perm -g=rw -exec chmod g+rw {} +
sudo find /mnt/ceph/repos/$NAME/metacat/documents/ -type f ! -perm -g=rw -exec chmod g+rw {} +

2. Create Secrets

Make a copy of the metacat/helm/admin/secrets.yaml file and rename to
${RELEASE_NAME}-metacat-secrets.yaml

edit to add release name:

metadata:
  name: ${RELEASE_NAME}-metacat-secrets

edit to add the correct passwords for this release (some may be found in legacy
metacat.properties; e.g. postgres, DOI, etc.)

Deploy it to the cluster:

kubectl apply -f ${RELEASE_NAME}-metacat-secrets.yaml

Save a GPG-ENCRYPTED copy in the NCEAS Security repo.
Delete your local unencrypted copy.

3. Create Persistent Volumes

(Assumes cephfs volume credentials already installed - see prod
example).

Create a PV for the metacat data directory - see arctic prod dev PV
example
Create a PV for the PostgreSQL data directory - see arctic prod PV
example
Create a PVC for PostgreSQL; see arctic prod pvc
example

4. Values: Create a new values override file

e.g. see the
values-prod-cluster-arctic.yaml
example.

TLS ("SSL") setup (ingress.tls.hosts - leave blank to use default, or change if aliases
needed - see Tips, below)
Set up Node cert and replication etc. as needed - see
README.
- Don't forget to install the ca root and run theconfigure-nginx-mutual-auth.sh script, and
  also enable incoming client cert forwarding:
```
metacat:
  dataone.certificate.fromHttpHeader.enabled: true
```

MetacatUI:
- ALWAYS set global.metacatUiThemeName
- If using a bundled theme (arctic, knb etc.):
  - for PROD, no further action required

5. First Install

IMPORTANT IF MOVING DATA FROM AN EXISTING LEGACY DEPLOYMENT!

DO NOT REGISTER THIS NODE WITH THE PRODUCTION CN UNTIL YOU'RE READY TO GO LIVE, or bad things
will happen...

6. FINAL SWITCH-OVER FROM LEGACY TO K8S

BEFORE STARTING: To reduce downtime during switch-over, flag any required values override
updates as @todos. E.g. If you've been using a temporary node name, hostname, and TLS setup,
flag these as TODO, for updates during switchover, with the new values in handy comments:
- metacat.server.name
- global.metacatExternalBaseUrl
- global.d1ClientCnUrl
- Any others that will need changing, e.g. dataone.nodeSynchronize, dataone.nodeReplicate
  etc.
- NOTE: If you need to accommodate hostname aliases, you'll need to update the ingress.tls
  section to reflect the new hostname(s) - see Tips,
  below.

= = = = = = = = = = = = = IN K8S CLUSTER: = = = = = = = = = = = = =

Make a backup of the checksums table so hashstore won't try to reconvert completed files:

# inside metacat pod, run the backup script:
kubectl exec ${RELEASE_NAME}-0 -- bash -c \
  "$TC_HOME/webapps/metacat/WEB-INF/scripts/sql/backup-restore-checksums-table/backup-checksums-table.sh"

helm delete the running installation. (keep all secrets, PVCs etc!)

= = = = = = = = = = = = = ON LEGACY HOST = = = = = = = = = = = = =

ENSURE NOBODY IS IN THE MIDDLE OF A BIG UPLOAD! (Can schedule off-hours, but how to monitor?)

Stop postgres and tomcat

# ssh to legacy host, then...
sudo systemctl stop postgresql
sudo systemctl stop tomcat9

"top-up" rsync from legacy to ceph:

# NOTES:
# 1. Don't use -aHAX (like orig. rsync); use -rltDHX to not overwrite ownership or permissions
# 2. Don't use --delete option for /var/metacat/ rsync
#
sudo rsync -rltDHX --stats --human-readable \
          /var/metacat/data/         /mnt/ceph/repos/$NAME/metacat/data/

sudo rsync -rltDHX --stats --human-readable \
          /var/metacat/dataone/      /mnt/ceph/repos/$NAME/metacat/dataone/

sudo rsync -rltDHX --stats --human-readable \
          /var/metacat/documents/    /mnt/ceph/repos/$NAME/metacat/documents/

sudo rsync -rltDHX --stats --human-readable \
          /var/metacat/logs/         /mnt/ceph/repos/$NAME/metacat/logs/

# Use -aHAX for postgres, since we don't change any perms, and can do a chmod -R quickly
sudo rsync -aHAX --stats --human-readable --delete \
          /var/lib/postgresql/       /mnt/ceph/repos/$NAME/postgresql/

While rsync is in progress, edit /var/metacat/config/metacat-site.properties to add:
```
application.readOnlyMode=true
```
When rsync done (approx 5 mins?), start postgres, and then start tomcat.
```
sudo systemctl start postgresql

sudo systemctl start tomcat9
```
Check it's in RO mode! https://test.arcticdata.io/metacat/d1/mn - look for
<property key="read_only_mode">

= = = = = = = = = = = = = IN K8S CLUSTER: = = = = = = = = = = = = =

fix ownership and permissions of newly-copied files:

## tomcat (59997:59997) in metacat directory
sudo chown -R 59997:59997 /mnt/ceph/repos/$NAME/metacat

sudo find /mnt/ceph/repos/$NAME/metacat/data/      -type f ! -perm -g=rw -exec chmod g+rw {} +
sudo find /mnt/ceph/repos/$NAME/metacat/documents/ -type f ! -perm -g=rw -exec chmod g+rw {} +

## postgres (59996:59996) in postgresql data directory
sudo chown -R 59996:59996 /mnt/ceph/repos/$NAME/postgresql

If (legacy DB version) < (k8s db version), disable Probes until Database Upgrade is finished
Set storage.hashstore.disableConversion: true, so the hashstore converter won't run.
helm-install

Repeat correction of checksum algorithm names for any new records:

kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
  SELECT DISTINCT checksum_algorithm FROM systemmetadata WHERE checksum_algorithm NOT IN
        ('MD2','MD5','SHA-1','SHA-256','SHA-384','SHA-512','SHA-512/224','SHA-512/256');
EOF"

# then manually update each to the correct syntax; e.g:
kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
  UPDATE systemmetadata SET checksum_algorithm='SHA-1' WHERE checksum_algorithm='SHA1';
EOF"
# ...etc

Restore the checksums table from the backup, so hashstore won't try to reconvert
completed files:

# inside metacat pod, run the restore script:
kubectl exec ${RELEASE_NAME}-0 -- bash -c \
  "$TC_HOME/webapps/metacat/WEB-INF/scripts/sql/backup-restore-checksums-table/restore-checksums-table.sh"

Delete the storage.hashstore.disableConversion: setting, so the hashstore converter will
run, and do a helm upgrade

See Tips, below for
how to detect when hashstore conversion finishes

When hashstore conversion has finished successfully...

Check values overrides and update any @todos to match live settings. See BEFORE STARTING,
above).
If applicable, re-enable dataone.nodeSynchronize and/or dataone.nodeReplicate
Point the deployment at the PRODUCTION CN (https://cn.dataone.org/cn, which is the
default) by deleting this entry:
```
## TODO: DELETE ME WHEN READY TO GO LIVE!
global:
  d1ClientCnUrl: https://cn-sandbox.dataone.org/cn
```
In order to push dataone.* member node properties (dataone.nodeId, dataone.subject,
dataone.nodeSynchronize, dataone.nodeReplicate) to the CN, set:

  metacat:
    dataone.autoRegisterMemberNode: ## today's date in format: 2024-07-29

Do a final helm upgrade
Make sure metacatui picked up the changes - may need to do some pod-kicking

When everything is up and running...

Switch Route53 DNS to point to k8s ingress instead of legacy:
```
kubectl get ingress -o yaml | egrep "(\- ip:)|(\- host:)"
```
Take down the legacy instance

Index only the newer datasets:

# on your local machine:
cd <metacat>/src/scripts/bash/k8s
./index-delta.sh <start-time>
# where <start-time> is the time an hour or more before the previous rsync,
#     in the format: yyyy-mm-dd HH:MM:SS (with a space; e.g. 2024-11-01 14:01:00)

Tips:

To change the database user's password for your existing database

Note postgres user:

kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U postgres metacat << EOF
  ALTER USER metacat WITH PASSWORD 'new-password-here';
EOF"

See how many "old" datasets exist in DB, before the upgrade:

kubectl exec metacatarctic-postgresql-0 -- bash -c "psql -U metacat << EOF
  select count(*) as docs from xml_documents where docid not like 'autogen%';
  select count(*) as revs from xml_revisions where docid not like 'autogen%';
EOF"

Monitor Database Upgrade Completion

check in version_history table:

kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
  select version from version_history where status='1';
EOF"

Monitor Hashstore Conversion Progress and Completion

To monitor progress: check the number of rows in the checksums table: total # rows should
be: 5 * (total objects), (approx; not accounting for conversion errors), where total object
count can be found from https://$HOST/metacat/d1/mn/v2/object
```
# get number of entries in `checksums` table -- should be approx 5*(total objects)
kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
  select count(*) from checksums;
EOF"
```

To detect when hashstore conversion finishes:

# EITHER CHECK STATUS FROM DATABASE...
kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
  select storage_upgrade_status from version_history where status='1';
EOF"

# ...OR CHECK LOGS
# If log4j root level is INFO
egrep "\[INFO\]: The conversion took [0-9]+ minutes.*HashStoreUpgrader:upgrade"

# If log4j root level is WARN, can also grep for this, if errors:
egrep "\[WARN\]: The conversion is complete"

Fix Hashstore Error - PID Doesn't Exist in `identifier` Table:

Pid <autogen pid> is missing system metadata. Since the pid starts with autogen and looks like to be
created by DataONE api, it should have the systemmetadata. Please look at the systemmetadata and
identifier table to figure out the real pid.

Steps to resolve:

Given the docid, get all revisions:

select * from identifier where docid='<docid>';

Look for pid beginning 'autogen', and note its revision number

pid should be the 'obsoletedBy' from the previous revision's system metadata:

select obsoletedBy from systemmetadata where guid='<previous revision pid>';

Check by look at 'obsoletes' from the following revision, if one exists:

select obsoletes from systemmetadata where guid='<following revision pid>';

Ensure systemmetadata table has an entry for autogen pid

select checksum from systemmetadata where guid='<autogen pid>';

...and the checksum matches that of the original file, found in:

/var/metacat/(data or documents)/<'autogen' docid>.<revision number>

= = = If these do not match, STOP HERE AND INVESTIGATE FURTHER = = =

Update the autogen-pid entry with the new pid:

update systemmetadata set guid='<pid from steps 3 & 4>' where guid='<autogen pid>';

Replace the 'autogen' pid with the real pid in the 'identifier' table:

update identifier set guid='<pid from steps 3 & 4>' where guid='<autogen pid>';

Set the hashstore conversion status back to pending:
```
update version_history set storage_upgrade_status='pending' where status='1';
```
...and restart the metacat pod to re-run the hashstore conversion and generate the correct
sysmeta file in hashstore

Monitor Indexing Progress via RabbitMQ Dashboard:

Enable port forwarding:

kubectl port-forward service/${RELEASE_NAME}-rabbitmq-headless 15672:15672

then browse http://localhost:15672. Username metacat-rmq-guest and
RabbitMQ password from metacat Secrets, or from:

secret_name=$(kubectl get secrets | egrep ".*\-metacat-secrets" | awk '{print $1}')
rmq_pwd=$(kubectl get secret "$secret_name" \
        -o jsonpath="{.data.rabbitmq-password}" | base64 -d)
echo "rmq_pwd: $rmq_pwd"

If a PV can't be unmounted

e.g. if the PV name is cephfs-metacatarctic-metacat-varmetacat:

kubectl patch pv cephfs-metacatarctic-metacat-varmetacat -p '{"metadata":{"finalizers":null}}'

If the metacat pod keeps restarting

Look for this in the logs:

rm: cannot remove '/var/metacat/config/metacat-site.properties': Permission denied

Ensure the config directory on the PV (for example: /mnt/ceph/repos/$NAME/metacat/config) allows
group write (chmod 660) after the rsync has been completed or repeated.

Where to Find Existing Hostname Aliases

Look at the legacy installation in the /etc/apache2/sites-enabled/ directory; e.g.:

root@arctica:/etc/apache2/sites-enabled# ls
  aoncadis.org.conf      arcticdata.io.conf      beta.arcticdata.io.conf
  # ...etc

the ServerName and ServerAlias directives are in these .conf files, e.g.:
```
  <IfModule mod_ssl.c>
  <VirtualHost *:443>
          DocumentRoot /var/www/arcticdata.io/htdocs
          ServerName arcticdata.io
          ServerAlias www.arcticdata.io permafrost.arcticdata.io
```
NOTE: it may not be necessary to incorporate all these aliases in the k8s environment. For
prod ADC, for example, we left apache running with these aliases in place, and transferred only
the arcticdata.io domain. see Issue #1954)

The text was updated successfully, but these errors were encountered:

artntek · 2024-11-20T23:00:17Z

see #1984

artntek added this to Metacat Releases Nov 19, 2024

artntek converted this from a draft issue Nov 20, 2024

artntek added the k8s Kubernetes/Helm Related label Nov 20, 2024

artntek moved this to In Progress in Metacat Releases Nov 20, 2024

artntek moved this from In Progress to Blocked in Metacat Releases Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arctic-prod checklist #2023

arctic-prod checklist #2023

artntek commented Nov 20, 2024 •

edited

Loading

artntek commented Nov 20, 2024

arctic-prod checklist #2023

arctic-prod checklist #2023

Comments

artntek commented Nov 20, 2024 • edited Loading

Quick Reference: Metacat K8s Installation Steps

1. Copy Data and Set Ownership & Permissions

2. Create Secrets

3. Create Persistent Volumes

4. Values: Create a new values override file

5. First Install

IMPORTANT IF MOVING DATA FROM AN EXISTING LEGACY DEPLOYMENT!

6. FINAL SWITCH-OVER FROM LEGACY TO K8S

Tips:

To change the database user's password for your existing database

See how many "old" datasets exist in DB, before the upgrade:

Monitor Database Upgrade Completion

Monitor Hashstore Conversion Progress and Completion

Fix Hashstore Error - PID Doesn't Exist in identifier Table:

Monitor Indexing Progress via RabbitMQ Dashboard:

If a PV can't be unmounted

If the metacat pod keeps restarting

Where to Find Existing Hostname Aliases

artntek commented Nov 20, 2024

artntek commented Nov 20, 2024 •

edited

Loading

Fix Hashstore Error - PID Doesn't Exist in `identifier` Table: