You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(For more in-depth explanation and details of configuration steps, see the Metacat Helm
README)
1. Copy Data and Set Ownership & Permissions
first rsync the data over to cephfs (OK to leave postgres & tomcat running)
# can also prepend with time, and use --stats --human-readable, and/or --dry-run# metacat
sudo rsync -aHAX /var/metacat/data/ /mnt/ceph/repos/$NAME/metacat/data/
sudo rsync -aHAX /var/metacat/dataone/ /mnt/ceph/repos/$NAME/metacat/dataone/
sudo rsync -aHAX /var/metacat/documents/ /mnt/ceph/repos/$NAME/metacat/documents/
sudo rsync -aHAX /var/metacat/logs/ /mnt/ceph/repos/$NAME/metacat/logs/
# postgres
sudo rsync -aHAX /var/lib/postgresql/ /mnt/ceph/repos/$NAME/postgresql/
After rsyncs are complete, change ownership ON CEPHFS as follows:
## tomcat (59997:59997) in metacat directory
sudo chown -R 59997:59997 /mnt/ceph/repos/$NAME/metacat
## postgres (59996:59996) in postgresql data directory
sudo chown -R 59996:59996 /mnt/ceph/repos/$NAME/postgresql
...then ensure all metacat data and documents files have g+rw permissions, otherwise,
hashstore converter can't create hard links (find quicker for huge corpuses):
The Node ID (in metacat.dataone.nodeId and metacat.dataone.subject) MUST MATCH the
legacy deployment! (Don't use a temp ID; this will be persisted into
hashstore during conversion!)
The metacat.dataone.autoRegisterMemberNode: flag MUST NOT match today's date!
Existing node already syncing to D1? Set dataone.nodeSynchronize: false until after final
switch-over!
Existing node already accepting D1 replicas? Set dataone.nodeReplicate: false after final
switch-over!
If legacy DB version was < 3.0.0, Disable livenessProbe & readinessProbe until Database
Upgrade is finished.
NOTE: Upgrade only writes OLD datasets -- ones not starting autogen -- from DB to disk.
These should all be finished after the first upgrade - so provided subsequent /var/metacat/ rsyncs are
only additive (don't --delete destination files not on source), then subsequent DB upgrades
after incremental rsyncs will be very fast. Tips,
below show how to check the number
of "old" datasets exist in DB, before the upgrade
set storage.hashstore.disableConversion: true, so the hashstore converter won't run yet
From the PostgreSQL
documentation --
Another option is to use rsync to perform a file system backup. This is done by first running
rsync while the database server is running, then shutting down the database server long enough
to do an rsync --checksum. (--checksum is necessary because rsync only has file
modification-time granularity of one second.) The second rsync will be quicker than the first,
because it has relatively little data to transfer, and the end result will be consistent because
the server was down. This method allows a file system backup to be performed with minimal
downtime.
fix ownership and permissions of newly-copied files:
## tomcat (59997:59997) in metacat directory
sudo chown -R 59997:59997 /mnt/ceph/repos/$NAME/metacat
sudo find /mnt/ceph/repos/$NAME/metacat/data/ -type f ! -perm -g=rw -exec chmod g+rw {} +
sudo find /mnt/ceph/repos/$NAME/metacat/documents/ -type f ! -perm -g=rw -exec chmod g+rw {} +
## postgres (59996:59996) in postgresql data directory
sudo chown -R 59996:59996 /mnt/ceph/repos/$NAME/postgresql
helm install, debug any startup and configuration issues, and allow database upgrade to
finish.
See Tips, below, for how to detect when
database conversion finishes
In the metacat database, verify that all the systemmetadata.checksum_algorithm entries are
on the list of supported
algorithms
(NOTE: syntax matters! E.g. sha-1 is OK, but sha1 isn't):
kubectl exec${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF SELECT DISTINCT checksum_algorithm FROM systemmetadata WHERE checksum_algorithm NOT IN ('MD2','MD5','SHA-1','SHA-256','SHA-384','SHA-512','SHA-512/224','SHA-512/256');EOF"# then manually update each to the correct syntax; e.g:
kubectl exec${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF UPDATE systemmetadata SET checksum_algorithm='SHA-1' WHERE checksum_algorithm='SHA1';EOF"# ...etc
Delete the storage.hashstore.disableConversion: setting, so the hashstore converter will
run, and do a helm upgrade.
See Tips, below for
how to detect when hashstore conversion finishes
When database upgrade and hashstore conversion have both finished, re-enable probes
Re-index all datasets (Did with 25 indexers for test.adc on dev. More on prod?)
kubectl get secret ${RELEASE_NAME}-d1-client-cert -o jsonpath="{.data.d1client\.crt}"| \
base64 -d > DELETEME_NODE_CERT.pem
curl -X PUT --cert ./DELETEME_NODE_CERT.pem "https://MYHOST/MYCONTEXT/d1/mn/v2/index?all=true"# look for <scheduled>true</scheduled> in response# don't forget to delete the cert file:
rm DELETEME_NODE_CERT.pem
See Tips, below for monitoring indexing
progress via RabbitMQ dashboard.
6. FINAL SWITCH-OVER FROM LEGACY TO K8S
BEFORE STARTING: To reduce downtime during switch-over, flag any required values override
updates as @todos. E.g. If you've been using a temporary node name, hostname, and TLS setup,
flag these as TODO, for updates during switchover, with the new values in handy comments:
metacat.server.name
global.metacatExternalBaseUrl
global.d1ClientCnUrl
Any others that will need changing, e.g. dataone.nodeSynchronize, dataone.nodeReplicate
etc.
NOTE: If you need to accommodate hostname aliases, you'll need to update the ingress.tls
section to reflect the new hostname(s) - see Tips,
below.
fix ownership and permissions of newly-copied files:
## tomcat (59997:59997) in metacat directory
sudo chown -R 59997:59997 /mnt/ceph/repos/$NAME/metacat
sudo find /mnt/ceph/repos/$NAME/metacat/data/ -type f ! -perm -g=rw -exec chmod g+rw {} +
sudo find /mnt/ceph/repos/$NAME/metacat/documents/ -type f ! -perm -g=rw -exec chmod g+rw {} +
## postgres (59996:59996) in postgresql data directory
sudo chown -R 59996:59996 /mnt/ceph/repos/$NAME/postgresql
If (legacy DB version) < (k8s db version), disable Probes until Database Upgrade is finished
Set storage.hashstore.disableConversion: true, so the hashstore converter won't run.
helm-install
Repeat correction of checksum algorithm names for any new records:
kubectl exec${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF SELECT DISTINCT checksum_algorithm FROM systemmetadata WHERE checksum_algorithm NOT IN ('MD2','MD5','SHA-1','SHA-256','SHA-384','SHA-512','SHA-512/224','SHA-512/256');EOF"# then manually update each to the correct syntax; e.g:
kubectl exec${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF UPDATE systemmetadata SET checksum_algorithm='SHA-1' WHERE checksum_algorithm='SHA1';EOF"# ...etc
Restore the checksums table from the backup, so hashstore won't try to reconvert
completed files:
# inside metacat pod, run the restore script:
kubectl exec${RELEASE_NAME}-0 -- bash -c \
"$TC_HOME/webapps/metacat/WEB-INF/scripts/sql/backup-restore-checksums-table/restore-checksums-table.sh"
Delete the storage.hashstore.disableConversion: setting, so the hashstore converter will
run, and do a helm upgrade
See Tips, below for
how to detect when hashstore conversion finishes
When hashstore conversion has finished successfully...
If applicable, re-enable dataone.nodeSynchronize and/or dataone.nodeReplicate
Point the deployment at the PRODUCTION CN (https://cn.dataone.org/cn, which is the
default) by deleting this entry:
## TODO: DELETE ME WHEN READY TO GO LIVE!global:
d1ClientCnUrl: https://cn-sandbox.dataone.org/cn
In order to push dataone.* member node properties (dataone.nodeId, dataone.subject, dataone.nodeSynchronize, dataone.nodeReplicate) to the CN, set:
metacat:
dataone.autoRegisterMemberNode: ## today's date in format: 2024-07-29
Do a final helm upgrade
Make sure metacatui picked up the changes - may need to do some pod-kicking
When everything is up and running...
Switch Route53 DNS to point to k8s ingress instead of legacy:
kubectl get ingress -o yaml | egrep "(\- ip:)|(\- host:)"
Take down the legacy instance
Index only the newer datasets:
# on your local machine:cd<metacat>/src/scripts/bash/k8s
./index-delta.sh <start-time># where <start-time> is the time an hour or more before the previous rsync,# in the format: yyyy-mm-dd HH:MM:SS (with a space; e.g. 2024-11-01 14:01:00)
Tips:
To change the database user's password for your existing database
Note postgres user:
kubectl exec${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U postgres metacat << EOF ALTER USER metacat WITH PASSWORD 'new-password-here';EOF"
See how many "old" datasets exist in DB, before the upgrade:
kubectl exec metacatarctic-postgresql-0 -- bash -c "psql -U metacat << EOF select count(*) as docs from xml_documents where docid not like 'autogen%'; select count(*) as revs from xml_revisions where docid not like 'autogen%';EOF"
Monitor Database Upgrade Completion
check in version_history table:
kubectl exec${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF select version from version_history where status='1';EOF"
Monitor Hashstore Conversion Progress and Completion
To monitor progress: check the number of rows in the checksums table: total # rows should
be: 5 * (total objects), (approx; not accounting for conversion errors), where total object
count can be found from https://$HOST/metacat/d1/mn/v2/object
# get number of entries in `checksums` table -- should be approx 5*(total objects)
kubectl exec${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF select count(*) from checksums;EOF"
To detect when hashstore conversion finishes:
# EITHER CHECK STATUS FROM DATABASE...
kubectl exec${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF select storage_upgrade_status from version_history where status='1';EOF"# ...OR CHECK LOGS# If log4j root level is INFO
egrep "\[INFO\]: The conversion took [0-9]+ minutes.*HashStoreUpgrader:upgrade"# If log4j root level is WARN, can also grep for this, if errors:
egrep "\[WARN\]: The conversion is complete"
Fix Hashstore Error - PID Doesn't Exist in identifier Table:
Pid <autogen pid> is missing system metadata. Since the pid starts with autogen and looks like to be
created by DataONE api, it should have the systemmetadata. Please look at the systemmetadata and
identifier table to figure out the real pid.
Steps to resolve:
Given the docid, get all revisions:
select*from identifier where docid='<docid>';
Look for pid beginning 'autogen', and note its revision number
pid should be the 'obsoletedBy' from the previous revision's system metadata:
select obsoletedBy from systemmetadata where guid='<previous revision pid>';
Check by look at 'obsoletes' from the following revision, if one exists:
select obsoletes from systemmetadata where guid='<following revision pid>';
Ensure systemmetadata table has an entry for autogen pid
select checksum from systemmetadata where guid='<autogen pid>';
...and the checksum matches that of the original file, found in:
/var/metacat/(data or documents)/<'autogen' docid>.<revision number>
= = = If these do not match, STOP HERE AND INVESTIGATE FURTHER = = =
Update the autogen-pid entry with the new pid:
update systemmetadata set guid='<pid from steps 3 & 4>'where guid='<autogen pid>';
Replace the 'autogen' pid with the real pid in the 'identifier' table:
update identifier set guid='<pid from steps 3 & 4>'where guid='<autogen pid>';
Set the hashstore conversion status back to pending:
update version_history set storage_upgrade_status='pending'where status='1';
...and restart the metacat pod to re-run the hashstore conversion and generate the correct
sysmeta file in hashstore
Ensure the config directory on the PV (for example: /mnt/ceph/repos/$NAME/metacat/config) allows group write (chmod 660) after the rsync has been completed or repeated.
Where to Find Existing Hostname Aliases
Look at the legacy installation in the /etc/apache2/sites-enabled/ directory; e.g.:
root@arctica:/etc/apache2/sites-enabled# ls
aoncadis.org.conf arcticdata.io.conf beta.arcticdata.io.conf
# ...etc
the ServerName and ServerAlias directives are in these .conf files, e.g.:
NOTE: it may not be necessary to incorporate all these aliases in the k8s environment. For
prod ADC, for example, we left apache running with these aliases in place, and transferred only
the arcticdata.io domain. see Issue #1954)
The text was updated successfully, but these errors were encountered:
Quick Reference: Metacat K8s Installation Steps
(For more in-depth explanation and details of configuration steps, see the Metacat Helm
README)
1. Copy Data and Set Ownership & Permissions
first
rsync
the data over to cephfs (OK to leave postgres & tomcat running)After rsyncs are complete, change ownership ON CEPHFS as follows:
...then ensure all metacat
data
anddocuments
files haveg+rw
permissions, otherwise,hashstore converter can't create hard links (
find
quicker for huge corpuses):2. Create Secrets
Make a copy of the
metacat/helm/admin/secrets.yaml
file and rename to${RELEASE_NAME}-metacat-secrets.yaml
edit to add release name:
edit to add the correct passwords for this release (some may be found in legacy
metacat.properties
; e.g. postgres, DOI, etc.)Deploy it to the cluster:
kubectl apply -f ${RELEASE_NAME}-metacat-secrets.yaml
Save a GPG-ENCRYPTED copy in the NCEAS Security repo.
Delete your local unencrypted copy.
3. Create Persistent Volumes
(Assumes cephfs volume credentials already installed - see prod
example).
example
example
example
4. Values: Create a new values override file
e.g. see the
values-prod-cluster-arctic.yaml
example.
ingress.tls.hosts
- leave blank to use default, or change if aliasesneeded - see Tips, below)
README.
configure-nginx-mutual-auth.sh
script, andalso enable incoming client cert forwarding:
global.metacatUiThemeName
5. First Install
IMPORTANT IF MOVING DATA FROM AN EXISTING LEGACY DEPLOYMENT!
DO NOT REGISTER THIS NODE WITH THE PRODUCTION CN UNTIL YOU'RE READY TO GO LIVE, or bad things
will happen...
Point the deployment at the SANDBOX CN
The Node ID (in
metacat.dataone.nodeId
andmetacat.dataone.subject
) MUST MATCH thelegacy deployment! (Don't use a temp ID; this will be persisted into
hashstore during conversion!)
The
metacat.dataone.autoRegisterMemberNode:
flag MUST NOT match today's date!Existing node already syncing to D1? Set
dataone.nodeSynchronize: false
until after finalswitch-over!
Existing node already accepting D1 replicas? Set
dataone.nodeReplicate: false
after finalswitch-over!
If legacy DB version was < 3.0.0, Disable
livenessProbe
&readinessProbe
until DatabaseUpgrade is finished.
set
storage.hashstore.disableConversion: true
, so the hashstore converter won't run yet"top-up"
rsync
from legacy to ceph:fix ownership and permissions of newly-copied files:
helm install
, debug any startup and configuration issues, and allow database upgrade tofinish.
In the metacat database, verify that all the
systemmetadata.checksum_algorithm
entries areon the list of supported
algorithms
(NOTE: syntax matters! E.g.
sha-1
is OK, butsha1
isn't):Delete the
storage.hashstore.disableConversion:
setting, so the hashstore converter willrun, and do a
helm upgrade
.When database upgrade and hashstore conversion have both finished, re-enable probes
Re-index all datasets (Did with 25 indexers for test.adc on dev. More on prod?)
6. FINAL SWITCH-OVER FROM LEGACY TO K8S
updates as @todos. E.g. If you've been using a temporary node name, hostname, and TLS setup,
flag these as
TODO
, for updates during switchover, with the new values in handy comments:metacat.server.name
global.metacatExternalBaseUrl
global.d1ClientCnUrl
dataone.nodeSynchronize
,dataone.nodeReplicate
etc.
ingress.tls
section to reflect the new hostname(s) - see Tips,
below.
= = = = = = = = = = = = = IN K8S CLUSTER: = = = = = = = = = = = = =
Make a backup of the
checksums
table so hashstore won't try to reconvert completed files:helm delete
the running installation. (keep all secrets, PVCs etc!)= = = = = = = = = = = = = ON LEGACY HOST = = = = = = = = = = = = =
ENSURE NOBODY IS IN THE MIDDLE OF A BIG UPLOAD! (Can schedule off-hours, but how to monitor?)
Stop postgres and tomcat
# ssh to legacy host, then... sudo systemctl stop postgresql sudo systemctl stop tomcat9
"top-up"
rsync
from legacy to ceph:While rsync is in progress, edit
/var/metacat/config/metacat-site.properties
to add:application.readOnlyMode=true
When rsync done (approx 5 mins?), start postgres, and then start tomcat.
Check it's in RO mode! https://test.arcticdata.io/metacat/d1/mn - look for
<property key="read_only_mode">
= = = = = = = = = = = = = IN K8S CLUSTER: = = = = = = = = = = = = =
fix ownership and permissions of newly-copied files:
If (legacy DB version) < (k8s db version), disable Probes until Database Upgrade is finished
Set
storage.hashstore.disableConversion: true
, so the hashstore converter won't run.helm-install
Repeat correction of checksum algorithm names for any new records:
Restore the
checksums
table from the backup, so hashstore won't try to reconvertcompleted files:
Delete the
storage.hashstore.disableConversion:
setting, so the hashstore converter willrun, and do a
helm upgrade
When hashstore conversion has finished successfully...
Check values overrides and update any @todos to match live settings. See BEFORE STARTING,
above).
If applicable, re-enable
dataone.nodeSynchronize
and/ordataone.nodeReplicate
Point the deployment at the PRODUCTION CN (
https://cn.dataone.org/cn
, which is thedefault) by deleting this entry:
In order to push
dataone.*
member node properties (dataone.nodeId
,dataone.subject
,dataone.nodeSynchronize
,dataone.nodeReplicate
) to the CN, set:Do a final
helm upgrade
Make sure metacatui picked up the changes - may need to do some pod-kicking
When everything is up and running...
Switch Route53 DNS to point to k8s ingress instead of legacy:
Take down the legacy instance
Index only the newer datasets:
Tips:
To change the database user's password for your existing database
postgres
user:See how many "old" datasets exist in DB, before the upgrade:
Monitor Database Upgrade Completion
version_history
table:Monitor Hashstore Conversion Progress and Completion
checksums
table: total # rows shouldbe:
5 * (total objects)
, (approx; not accounting for conversion errors), where total objectcount can be found from
https://$HOST/metacat/d1/mn/v2/object
Fix Hashstore Error - PID Doesn't Exist in
identifier
Table:Steps to resolve:
= = = If these do not match, STOP HERE AND INVESTIGATE FURTHER = = =
pending
:sysmeta file in hashstore
Monitor Indexing Progress via RabbitMQ Dashboard:
Enable port forwarding:
kubectl port-forward service/${RELEASE_NAME}-rabbitmq-headless 15672:15672
then browse http://localhost:15672. Username
metacat-rmq-guest
andRabbitMQ password from metacat Secrets, or from:
If a PV can't be unmounted
e.g. if the PV name is
cephfs-metacatarctic-metacat-varmetacat
:kubectl patch pv cephfs-metacatarctic-metacat-varmetacat -p '{"metadata":{"finalizers":null}}'
If the metacat pod keeps restarting
/mnt/ceph/repos/$NAME/metacat/config
) allowsgroup write (
chmod 660
) after the rsync has been completed or repeated.Where to Find Existing Hostname Aliases
Look at the legacy installation in the
/etc/apache2/sites-enabled/
directory; e.g.:root@arctica:/etc/apache2/sites-enabled# ls aoncadis.org.conf arcticdata.io.conf beta.arcticdata.io.conf # ...etc
the
ServerName
andServerAlias
directives are in these.conf
files, e.g.:NOTE: it may not be necessary to incorporate all these aliases in the k8s environment. For
prod ADC, for example, we left apache running with these aliases in place, and transferred only
the
arcticdata.io
domain. see Issue #1954)The text was updated successfully, but these errors were encountered: