ESA -> K8s Checklist #2063

artntek · 2025-02-05T21:56:46Z

Checklist: Metacat K8s Installation Steps

from Quick Reference: Metacat K8s Installation Steps

NOTE ESA legacy node was already in Read-Only mode (special case)

1. Copy Data and Set Ownership & Permissions

first rsync the data from the 2.19 instance over to cephfs (OK to leave postgres & tomcat running)

# can also prepend with time, and use --stats --human-readable, and/or --dry-run
#
# metacat:
sudo rsync -aHAX /var/metacat/data/      /mnt/ceph/repos/REPO-NAME/metacat/data/
sudo rsync -aHAX /var/metacat/dataone/   /mnt/ceph/repos/REPO-NAME/metacat/dataone/
sudo rsync -aHAX /var/metacat/documents/ /mnt/ceph/repos/REPO-NAME/metacat/documents/
sudo rsync -aHAX /var/metacat/logs/      /mnt/ceph/repos/REPO-NAME/metacat/logs/

# postgres:
sudo rsync -aHAX /var/lib/postgresql/    /mnt/ceph/repos/REPO-NAME/postgresql/

After rsyncs are complete, change ownership ON CEPHFS as follows:

## postgres (59996:59996) in postgresql data directory
sudo chown -R 59996:59996 /mnt/ceph/repos/REPO-NAME/postgresql

## tomcat (59997:59997) in metacat directory
sudo chown -R 59997:59997 data dataone documents logs

...then ensure all metacat data and documents files have g+rw permissions, otherwise, hashstore converter can't create hard links:
```
sudo chmod -R g+rw data documents dataone
```

2. Create Secrets

Make a copy of the metacat/helm/admin/secrets.yaml file and rename to ${RELEASE_NAME}-metacat-secrets.yaml
edit to replace ${RELEASE_NAME} with the correct release name:
```
metadata:
  name: ${RELEASE_NAME}-metacat-secrets
```
edit to add the correct passwords for this release (some may be found in legacy metacat.properties; e.g. postgres, DOI, etc.)

Deploy it to the cluster:

kubectl apply -f ${RELEASE_NAME}-metacat-secrets.yaml

Save a GPG-ENCRYPTED copy to secure storage.
Delete your local unencrypted copy.

3. Create Persistent Volumes

Assumes cephfs volume credentials already installed as a k8s Secret - see this tip on creating your own secret, and this DataONEorg/k8s-cluster example.

Get the current volume sizes from the legacy installation, to help with sizing the PVs example
Create PV for metacat data directory - example
Create PV for PostgreSQL data directory - example
Create PVC for PostgreSQL - example
Only if using a custom theme: Create a PV for the MetacatUI theme directory example

4. Values: Create a new values override file

e.g. see the values-dev-cluster-example.yaml file.

TLS ("SSL") setup (ingress.tls.hosts - leave blank to use default, or change if aliases needed - see hostname aliases tip, below)
Set up Node cert and replication etc. as needed - see README.
- Don't forget to install the ca chain, and also enable incoming client cert forwarding:
```
metacat:
  dataone.certificate.fromHttpHeader.enabled: true
```

MetacatUI:

ALWAYS set global.metacatUiThemeName

If using a theme from metacatui-themes, this must be made available on a ceph/PV/PVC mount; e.g:

  customTheme:
    enabled: true
    claimName: metacatsfwmd-metacatui-customtheme
    subPath: metacatui-themes/src/cerp/js/themes/cerp

5M. First Install

== IMPORTANT! == IF MOVING DATA FROM AN EXISTING DEPLOYMENT THAT IS ALSO A DATAONE MEMBER NODE, DO NOT REGISTER THIS NODE WITH THE PRODUCTION CN UNTIL YOU'RE READY TO GO LIVE, or bad things will happen...

6. FINAL SWITCH-OVER FROM LEGACY TO K8S

Ensure Nick/Matt available to change route53 dns!

BEFORE STARTING: To reduce downtime during switch-over, flag any required values override
updates as @todos. E.g. If you've been using a temporary node name, hostname, and TLS setup,
flag these as TODO, for updates during switchover, with the new values in handy comments:

metacat.server.name

global.metacatExternalBaseUrl

global.d1ClientCnUrl

Any others that will need changing, e.g. dataone.nodeSynchronize, dataone.nodeReplicate
etc.

NOTE: If you need to accommodate hostname aliases, you'll need to update the ingress.tls
section to reflect the new hostname(s) - see Tips,
below.

= = = = = = = = = = = = = IN K8S CLUSTER: = = = = = = = = = = = = =

Installations should already be running here from step 5
No need to make a backup of the checksums table for hashstore: legacy was in Read-Only mode, so there should be no delta. Therefore no need to rsync again, or to reindex.
Check values overrides and update any @todos to match live settings. See BEFORE STARTING, above.
If applicable, re-enable dataone.nodeSynchronize and/or dataone.nodeReplicate
Point the deployment at the PRODUCTION CN (https://cn.dataone.org/cn, which is the default) by deleting this entry:
```
## TODO: DELETE ME WHEN READY TO GO LIVE!
global:
  d1ClientCnUrl: https://cn-sandbox.dataone.org/cn
```
In order to push dataone.* member node properties (dataone.nodeId, dataone.subject, dataone.nodeSynchronize, dataone.nodeReplicate) to the CN, set:
```
metacat:
  ## Set to today's date (UTC timezone), in the YYYY-MM-DD format; example:
  dataone.autoRegisterMemberNode: 2024-11-29
```
Do a final helm upgrade
Make sure metacatui picked up the changes - may need to do some pod-kicking

When everything is up and running...

Switch DNS to point to k8s ingress instead of legacy. To get current IP address and hostname:
```
kubectl get ingress -o yaml | egrep "(\- ip:)|(\- host:)"
```
Take down the legacy instance

Tips:

To change the database user's password for your existing database

Note postgres user:

kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U postgres metacat << EOF
  ALTER USER metacat WITH PASSWORD 'new-password-here';
EOF"

See how many "old" datasets exist in DB, before the upgrade:

kubectl exec metacatarctic-postgresql-0 -- bash -c "psql -U metacat << EOF
  select count(*) as docs from xml_documents where docid not like 'autogen%';
  select count(*) as revs from xml_revisions where docid not like 'autogen%';
EOF"

Monitor Database Upgrade Completion

check in version_history table:

kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
  select version from version_history where status='1';
EOF"

Monitor Hashstore Conversion Progress and Completion

To monitor progress: check the number of rows in the checksums table: total # rows should be: 5 * (total objects), (approx; not accounting for conversion errors), where total object count can be found from https://HOSTNAME/CONTEXT/d1/mn/v2/object
```
# get number of entries in `checksums` table -- should be approx 5*(total objects)
kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
  select count(*) from checksums;
EOF"
```

To detect when hashstore conversion finishes:

# EITHER CHECK STATUS FROM DATABASE...
kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
  select storage_upgrade_status from version_history where status='1';
EOF"

# ...OR CHECK LOGS
# If log4j root level is INFO
egrep "\[INFO\]: The conversion took [0-9]+ minutes.*HashStoreUpgrader:upgrade"

# If log4j root level is WARN, can also grep for this, if errors:
egrep "\[WARN\]: The conversion is complete"

Fix Hashstore Error - PID Doesn't Exist in `identifier` Table:

# If you see this in the metacat logs:
Pid <autogen pid> is missing system metadata. Since the pid starts with autogen and looks like to be created by DataONE api, it should have the systemmetadata. Please look at the systemmetadata and identifier table to figure out the real pid.

Steps to resolve:

Given the docid, get all revisions:

select * from identifier where docid='<docid>';

Look for pid beginning 'autogen', and note its revision number

pid should be the obsoleted_by from the previous revision's system metadata:

select obsoleted_by from systemmetadata where guid='<previous revision pid>';

Check by look at obsoletes from the following revision, if one exists:

select obsoletes from systemmetadata where guid='<following revision pid>';

Check if systemmetadata table has an entry for autogen pid

select checksum from systemmetadata where guid='<autogen pid>';

...and the checksum matches that of the original file, found in:

/var/metacat/(data or documents)/<'autogen' docid>.<revision number>

= = = If these exist and do not match, STOP HERE AND INVESTIGATE FURTHER! = = =

If an autogen-pid entry was found, update it with the new pid:

update systemmetadata set guid='<pid from steps 3 & 4>' where guid='<autogen pid>';

Replace the 'autogen' pid with the real pid in the 'identifier' table:

update identifier set guid='<pid from steps 3 & 4>' where guid='<autogen pid>';

Set the hashstore conversion status back to pending:
```
update version_history set storage_upgrade_status='pending' where status='1';
```
...and restart the metacat pod to re-run the hashstore conversion and generate the correct
sysmeta file in hashstore

Monitor Indexing Progress:

Using the RabbitMQ Dashboard:

Enable port forwarding:

kubectl port-forward service/${RELEASE_NAME}-rabbitmq-headless 15672:15672

then browse http://localhost:15672. Username metacat-rmq-guest and
RabbitMQ password from metacat Secrets, or from:
```
secret_name=$(kubectl get secrets | egrep ".*\-metacat-secrets" | awk '{print $1}')
rmq_pwd=$(kubectl get secret "$secret_name" \
        -o jsonpath="{.data.rabbitmq-password}" | base64 -d)
echo "rmq_pwd: $rmq_pwd"
```
NOTE: queue activity is not a reliable indicator of indexing progress, since the index
workers continue to process tasks even after the queue has been emptied. The best way to
determine when indexing is complete is to monitor the logs, as follows...

Determining when indexing is complete

Ensure the indexer log level has been set to INFO

grep the logs for the last occurrence of Completed the index task from the index queue:

kubectl logs --max-log-requests 100 -f --tail=100 -l app.kubernetes.io/name=d1index \
     | grep "Completed the index task"

You must be sure indexing has finished before trying to find the last occurrence. Note that some
indexing tasks can take more than an hour.

Creating Volume Credentials Secret for the PVs

VERY IMPORTANT when creating volume credentials secret:

For the userID, omit the “client.” from the beginning of the username before base64 encoding
it; e.g.: if your username is client.k8s-dev-metacatknb-subvol-user, use only
k8s-dev-metacatknb-subvol-user

Use echo -n when encoding; i.e:

echo -n myUserID    |  base64
echo -n mypassword  |  base64

Get sizing information for PVs

$ du -sh /var/metacat /var/lib/postgresql/14
5.6T /var/metacat
255.4G /var/lib/postgresql/14

If a PV can't be unmounted

e.g. if the PV name is cephfs-releasename-metacat-varmetacat:

kubectl patch pv cephfs-releasename-metacat-varmetacat -p '{"metadata":{"finalizers":null}}'

If a PV Mount is Doing Strange Things... (e.g. you're unable to change the `rootPath`)

Kubernetes sometimes has trouble changing a PV mount, even if you delete and re-create it
If you create a PV and then decide you need to change the rootPath, the old version may still be
'cached' on any nodes where it has previously been accessed by a pod. This can lead to confusing
behavior that is inconsistent across nodes.
To work around this, first delete the PV (after deleting any PVC that reference it), and then
create it with a different name.

If the metacat pod keeps restarting

Look for this in the logs:

rm: cannot remove '/var/metacat/config/metacat-site.properties': Permission denied

Ensure the config directory on the PV (for example: /mnt/ceph/repos/REPO-NAME/metacat/config) allows
group write (chmod 660) after the rsync has been completed or repeated.

Where to Find Existing Hostname Aliases

Look at the legacy installation in the /etc/apache2/sites-enabled/ directory; e.g.:

# ls /etc/apache2/sites-enabled/
  aoncadis.org.conf      arcticdata.io.conf      beta.arcticdata.io.conf
  # ...etc

the ServerName and ServerAlias directives are in these .conf files, e.g.:
```
  <IfModule mod_ssl.c>
  <VirtualHost *:443>
          DocumentRoot /var/www/arcticdata.io/htdocs
          ServerName arcticdata.io
          ServerAlias www.arcticdata.io permafrost.arcticdata.io
```
NOTE: it may not be necessary to incorporate all these aliases in the k8s environment. For
prod ADC, for example, we left apache running with these aliases in place, and transferred only
the arcticdata.io domain. see Issue #1954

The text was updated successfully, but these errors were encountered:

artntek mentioned this issue Feb 5, 2025

Move ESA to K8s #2062

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESA -> K8s Checklist #2063

ESA -> K8s Checklist #2063

artntek commented Feb 5, 2025 •

edited

Loading

ESA -> K8s Checklist #2063

ESA -> K8s Checklist #2063

Comments

artntek commented Feb 5, 2025 • edited Loading

Checklist: Metacat K8s Installation Steps

1. Copy Data and Set Ownership & Permissions

2. Create Secrets

3. Create Persistent Volumes

4. Values: Create a new values override file

5M. First Install

6. FINAL SWITCH-OVER FROM LEGACY TO K8S

Ensure Nick/Matt available to change route53 dns!

= = = = = = = = = = = = = IN K8S CLUSTER: = = = = = = = = = = = = =

Tips:

To change the database user's password for your existing database

See how many "old" datasets exist in DB, before the upgrade:

Monitor Database Upgrade Completion

Monitor Hashstore Conversion Progress and Completion

Fix Hashstore Error - PID Doesn't Exist in identifier Table:

= = = If these exist and do not match, STOP HERE AND INVESTIGATE FURTHER! = = =

Monitor Indexing Progress:

Using the RabbitMQ Dashboard:

Determining when indexing is complete

Creating Volume Credentials Secret for the PVs

Get sizing information for PVs

If a PV can't be unmounted

If a PV Mount is Doing Strange Things... (e.g. you're unable to change the rootPath)

If the metacat pod keeps restarting

Where to Find Existing Hostname Aliases

artntek commented Feb 5, 2025 •

edited

Loading

Fix Hashstore Error - PID Doesn't Exist in `identifier` Table:

If a PV Mount is Doing Strange Things... (e.g. you're unable to change the `rootPath`)