Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move arcticdata.io (Production) to Kubernetes #1954

Open
25 tasks done
artntek opened this issue Aug 14, 2024 · 8 comments
Open
25 tasks done

Move arcticdata.io (Production) to Kubernetes #1954

artntek opened this issue Aug 14, 2024 · 8 comments
Assignees
Labels
Epic k8s Kubernetes/Helm Related

Comments

@artntek
Copy link
Contributor

artntek commented Aug 14, 2024

Similar to #1932;

checklist:

  • Work with @nickatnceas to copy production data for testing:
    • Time how long it takes to...
      • Copy the production postgres data (arcticdata.io:/var/lib/postgresql) to the PROD ceph volume at /mnt/ceph/repos/arctic/postgresql (treat it like a hot backup).

        • NOTE: we do not need the /var/lib/postgresql/10 directory
      • copy the following subset of production data from arcticdata.io:/var/metacat to the PROD ceph volume at /mnt/ceph/repos/arctic/metacat:

        # /var/metacat/...
        16K	    ./certs
        63T	    ./data
        8.0K        ./dataone
        3.9G        ./documents
        0           ./inline-data
        500K        ./logs
        • Actual Times taken for /var/metacat/data:
          • initial rsync
            root@arctica:/var/metacat# time rsync -aHAX --delete /var/metacat/data/ /mnt/pdg/repos/arctic/metacat/data/
            
            real    14286m43.628s
            user    1131m15.740s
            sys     3907m38.871s
            ## -> 9.92 days
          • subsequent repeat rsync
            brooke@arctica:~$ time sudo rsync -rltDHX  /var/metacat/data/ /mnt/pdg/repos/arctic/metacat/data/
            [sudo] password for brooke:
            
            real	4m19.047s
            user	0m15.747s
            sys	0m34.912s

Follow the Quick Reference: Metacat K8s Installation Steps. Supplementary TODOs below...

Persistent Volumes

  • Set up a PV to point to PROD cephfs .../repos/arctic/metacat for metacat
  • Set up a PV to point to PROD cephfs .../repos/arctic/postgres for postgres
  • Create a PVC for Postgresql; see prod_cluster/metacatarctic/pvc--metacatarctic-postgres.yaml
  • "csi-cephfs-sc-ephemeral" storageClass missing. Ask @nickatnceas to add, like he did for dev cluster:
  storageclass.storage.k8s.io "csi-cephfs-sc-ephemeral" not found

MetacatUI setup

  • Copy config (tokens) from adc server

Metacat Config

  • Add values.yaml overrides for non-default 2.19 settings (diff arcticdata.io $TOMCAT_HOME/webapps/metacat/WEB-INF/metacat.properties with default metacat.properties from 2.19 release)
  • Add values.yaml overrides for newly-introduced 3.0 settings (diff default metacat.properties from 3.0.0 release with default metacat.properties from 2.19 release)
  • Compare with test.arcticdata.io values overrides as a sanity check

First Deployment

  • Complete steps in "First Install - IMPORTANT IF MOVING DATA FROM AN EXISTING LEGACY DEPLOYMENT" BEFORE first startup!

  • solr pods not starting. root cause from logs:

    $ kc logs pod/metacatarctic-solr-1
      /scripts/setup.sh: line 8: /opt/bitnami/scripts/solr/entrypoint.sh: Permission denied

    SOLVED - was overriding extraVolumes values, and the override didn't include the permissions line

  • https://arctic-prod.test.dataone.org/catalog/ (trailing slash) works, but https://arctic-prod.test.dataone.org/catalog gives a 404 (nginx)

  • ensure all data and documents files are group writeable (otherwise, hashstore upgrader can't create hard links):

    sudo find /mnt/ceph/repos/arctic/metacat/data/ -type f ! -perm -g=w -exec chmod g+w {} +
  • chown -R 59997:59997 the ceph dir corresponding to /var/metacat, and update values.yaml to use this uid:gid

    brooke@datateam:/mnt/ceph/repos/arctic$ time sudo chown -R 59997:59997 metacat
    
    real	4m7.026s
    user	0m0.004s
    sys	0m0.027s
  • Hostname aliases and rewrite rules

    • Figure out how to do these with ingress; see all-sites-enabled.conf. Lots of complexity - eg http://aoncadis.org aliased to adc.io, and site conf has RewriteMaps each having >3700 entries.
    • EXPLANATION: aoncadis.org was the predecessor to the ADC site. These rewrite rules map existing, old dataset urls to their new locations on ADC - so these rewrites need to be maintained somewhere
    • Leave all the redirects/other sites on the current Apache host, and move only arcticdata.io.

ATTENTION: Still To Do Before Final Deployment

  • Time hashstore conversion

  • Time reindex-all

  • MetacatUI + WordPress setup. How do we host it and link to k8s metacat?

    • ACTION: use a wordpress image/bitnami chart, deployed separately from the metacat helm chart
  • ACTION: Ask @nickatnceas for help with letsencrypt certs - do we need to remove arcticdata.io from wildcard cert on arctica? NOTE: we still need subdomain certs there (ie status.adc, beta.adc).

  • Skip 3.0.0 and deploy 3.1.0, but only after it's been running on less-trafficked hosts for a while. See proposed release plan in Issue Metacat 3.1.0 Release Plan #1984.

Testing - see Matt's comment below

@artntek artntek added the k8s Kubernetes/Helm Related label Aug 14, 2024
@artntek artntek self-assigned this Aug 14, 2024
@mbjones
Copy link
Member

mbjones commented Oct 15, 2024

For the Testing section, here's a quick rundown:

Get the R package dataone installed

  • Download and install R (required) and RStudio (nice but optional)
  • Checkout rdataone and (probably) switch to the develop branch, depending on what you need to test
  • open rdataone/dataone.RProj in RStudio
  • run install.packages(c('remotes', 'devtools'))
  • run devtools::load_all() to load the current dataone library code for testing
  • run remotes::install_deps() to install all of the package dependencies

Run the tests against standard nodes

  • login to https://dev.nceas.ucsb.edu and copy your token for R from the web UI; paste the token options command into the R console and run it
  • run devtools::test() to run the original tests against standard nodes

To run tests against a different node

@artntek
Copy link
Contributor Author

artntek commented Nov 12, 2024

hashstore conversion notes

First conversion (with errors) took almost exactly 48 hours
Total 1116383 objects => approx 6.5 objects/second, or approx 0.155 seconds/object

@artntek
Copy link
Contributor Author

artntek commented Nov 19, 2024

11/19/24: Second conversion (comprising only the failed objects from last time) took 42 minutes
(Douglas Adams would approve)

see #1964 (comment) for error analysis

@artntek
Copy link
Contributor Author

artntek commented Nov 19, 2024

11/19/24: Did another rsync and clean hashstore conversion

brooke@arctica:~$ time sudo rsync -aHAX --delete /var/lib/postgresql/ /mnt/ceph/repos/$NAME/postgresql/
real	60m57.679s
user	1m5.106s
sys	4m33.743s

time sudo rsync -rltDHX --stats --human-readable /var/metacat/data/ /mnt/ceph/repos/$NAME/metacat/data/
real	29m29.133s
user	1m56.979s
sys	6m45.844s

time sudo rsync -rltDHX --stats --human-readable /var/metacat/dataone/ /mnt/ceph/repos/$NAME/metacat/dataone/
real	0m10.742s
user	0m0.037s
sys	0m0.018s

time sudo rsync -rltDHX --stats --human-readable /var/metacat/documents/ /mnt/ceph/repos/$NAME/metacat/documents/
real	0m16.327s
user	0m0.490s
sys	0m1.014s

time sudo rsync -rltDHX --stats --human-readable /var/metacat/logs/ /mnt/ceph/repos/$NAME/metacat/logs/
real	0m0.101s
user	0m0.025s
sys	0m0.016s

hashstore conversion started: Wed Nov 20 22:55:15 UTC 2024
hashstore conversion finished: Sat Nov 23 00:29:52 UTC 2024
Total time: 49 hours 34 mins

Total 1116383 objects =>

  • approx 6.256 objects/second
  • approx 0.16 seconds/object

@artntek artntek mentioned this issue Nov 20, 2024
60 tasks
@artntek
Copy link
Contributor Author

artntek commented Nov 25, 2024

Unexplained log entries

[WARN]: XMLService.populateRegisteredSchemaList - Schema file:
/usr/local/tomcat/webapps/metacat/schema/RegistryService/RegistryEntryType.xsd
is registered in the database but does not exist on the file system. So
we don't add it to the registered schema list.
[edu.ucsb.nceas.metacat.service.XMLSchemaService:populateRegisteredSchem

[WARN]: XMLService.populateRegisteredSchemaList - Schema file:
/usr/local/tomcat/webapps/metacat/schema/fgdc-std-001/fgdc-std-001-1998.xsd
is registered  in the database but does not exist on the file system. So
we don't add it to the registered schema list.
[edu.ucsb.nceas.metacat.service.XMLSchemaService:populateRegisteredSchemaList:254]

@artntek
Copy link
Contributor Author

artntek commented Nov 25, 2024

Hashstore Conversion Errors to Follow up on:

nonMatchingChecksum_2024-11-20_22-55-18.txt

  • 3330 entries where the checksum has been corrected.
  • 1315 of these also say:
    Note: the original checksum would have been correct if the <>algo-name>
    algorithm was used. Was the wrong algorithm recorded?
    
  • TODO: Ask data team to follow up on these?
  • TODO: ask @taojing2002 to update the CN (after k8s goes live)

Can't find the object [..] in the Metacat legacy store.

Previously investigated: arctic-data.7741.1 (intentionally deleted - see this comment), and arctic-data.9767.1 (was fixed on ADC filesystem; should be picked up by next rsync - see this comment)

missing system metadata

autogen.2016032114203022800.1 Pid autogen.2016032114203022800.1 is
missing system metadata. Since the pid starts with autogen and looks
like to be created by DataONE api, it should have the
systemmetadata. Please look at the systemmetadata and identifier
table to figure out the real pid.

see these steps to fix. Applies to these 6 pids:

  • autogen.2016032114203022800.1 - bryce

  • autogen.2016032114243034901.1 - bryce

  • autogen.2023101407535418637.1 - http://orcid.org/0000-0003-1848-0703

  • autogen.2024081613411782371.1 - http://orcid.org/0000-0003-3370-9473

  • autogen.2024092603115057978.1 - http://orcid.org/0000-0002-1165-6852

  • These are unreachable, since there is no mapping between docid and pid

  • looked in access_log - no clues there (no delete entries)

  • ask Matt if we should create a pid and systemmetadata entry for each

  • autogen.2024080509310557266.58 - add to identifier table

    select * from identifier where docid like 'autogen.2024080509310557266%' order by rev;
    -- returns:
                         guid                      |            docid            | rev
    -----------------------------------------------+-----------------------------+-----
    urn:uuid:25d0abfc-7e93-407a-93a8-e752c11d2da9 | autogen.2024080509310557266 |   1
    -- [...]
    urn:uuid:d8fcc771-3b90-4170-b8b0-eefe24d19128 | autogen.2024080509310557266 |  57
    
    select obsoleted_by from systemmetadata where guid='urn:uuid:d8fcc771-3b90-4170-b8b0-eefe24d19128';
                     obsoleted_by
    -----------------------------------------------
     urn:uuid:2de418e3-d9bb-4b7f-82af-ef5885da6b9b
    
    select date_modified,obsoletes,obsoleted_by from systemmetadata where guid='urn:uuid:2de418e3-d9bb-4b7f-82af-ef5885da6b9b';
          date_modified      |                   obsoletes                   |                 obsoleted_by                  
    -------------------------+-----------------------------------------------+-----------------------------------------------
     2024-08-21 13:31:28.639 | urn:uuid:d8fcc771-3b90-4170-b8b0-eefe24d19128 | urn:uuid:5386c6d6-c7f0-4540-806b-ce5d8d3f8fde

Docid not found in the identifier table: urn:uuid:2de418e3-d9bb-4b7f-82af-ef5885da6b9b

  • TODO: needs manual intervention. Related to above

@artntek
Copy link
Contributor Author

artntek commented Nov 25, 2024

Initial index-all

[25/Nov/2024:21:47:20 UTC] PUT /metacat/d1/mn/v2/index?all=true
[25/Nov/2024:13:47:20 PST]

Message rate ~7.0/s until ~[27/Nov/2024:10:01:00 PST], then dropped to zero.

[29/Nov/2024:20:09:58 UTC] -- last-seen log error

=> 94h 22m total?

@artntek
Copy link
Contributor Author

artntek commented Dec 2, 2024

indexer log errors to Investigate

$ cat * | grep -c "\[ERROR\]"
15448

15448 total errors across all 50 index workers.

4557 Errors containing: "Cannot index the task for identifier", which is the general top-level error indexer with more than one root cause.

Of these...

  • 3032 include since unable to update solr, non 200 response code. Example:

    dataone-indexer 20241129-21:58:08: [ERROR]: Cannot index the task for identifier
      resource_map_urn:uuid:97e84aaf-f789-41b2-874b-59fc2cfa1445 since unable to update
      solr, non 200 response code.<?xml version="1.0" encoding="UTF-8"?>
    
    • most of these xml errors contain <str name=\"msg\">version conflict for ... e.g:
        <str name="msg">version conflict for urn:uuid:e83834f1-64d1-4290-b816-2d4b43ed8173 expected=1817081395828228096 actual=1817083016563916800</str>
        <int name="code">409</int>
      • Jing: Caused by concurrent modification by 2 different threads.
      • try again - grep for this error and pull out a list of all pids, then reindex, maybe with a 1 second pause between each.
        • If any still fail, can try modifying index.solr.versionConflict.waiting.time and index.solr.versionConflict.max.attempts
  • 1496 include since Solr index doesn't have the information about the id. Example:

    dataone-indexer 20241129-20:40:02: [ERROR]: Cannot index the task for identifier
    resource_map_urn:uuid:3c2a0e81-0c07-4312-bfff-4849a8342019 since Solr index doesn't have
    the information about the id urn:uuid:a22cab39-2cdb-400c-bf37-063e80c8da90 which is a
    component in the resource map resource_map_urn:uuid:3c2a0e81-0c07-4312-bfff-4849a8342019.
    Metacat-Index can't process the resource map prior to its components.
    [org.dataone.cn.indexer.IndexWorker:indexObject:464]
    
    • First, make a list and reindex the pids denoted by Solr index doesn't have the information about the id (eg urn:uuid:a22cab39-2cdb-400c-bf37-063e80c8da90 above); then
    • Make a list and try reindexing the resourcemaps again (eg resource_map_urn:uuid:3c2a0e81-0c07-4312-bfff-4849a8342019 above)

Following are all due to existing bad data/metadata, and are currently not indexed on adc, so we're no worse off. Can fix in future if/when there's time

  • 12 include pid contains empty white spaces, tabs or newlines, e.g.
    dataone-indexer 20241128-06:34:03: [ERROR]: Cannot index the task for identifier
    resource_map_urn:uuid:57038038-d6a5-4f98-9aab-4d0e8d52e1a0 since
    java.lang.IllegalArgumentException: Calling Method: retrieveMetadata()'s argument: pid
    contains empty white spaces, tabs or newlines [org.dataone.cn.indexer.IndexWorker:indexObject:464]
    
  • 7 include No Identifer statement was found for the resourceMap resource, eg:
    dataone-indexer 20241129-04:58:40: [ERROR]: Cannot index the task for identifier
    resource_map_urn:uuid:610441b5-9de8-49b5-99c7-77bab230355c since
    org.dspace.foresite.OREParserException: org.dspace.foresite.OREException: No Identifer
    statement was found for the resourceMap resource
    ('https://cn.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A610441b5-9de8-49b5-99c7-
    77bab230355c') [org.dataone.cn.indexer.IndexWorker:indexObject:464]
    
  • 5 include java.lang.NullPointerException [...] because "documentedByIdentifier" is null, e.g.
    dataone-indexer 20241129-09:07:44: [ERROR]: Cannot index the task for identifier
    urn:uuid:9a3f029a-364e-4754-a294-ec731af684a3 since java.lang.NullPointerException: Cannot
    invoke "org.dataone.service.types.v1.Identifier.getValue()" because
    "documentedByIdentifier" is null [org.dataone.cn.indexer.IndexWorker:indexObject:464]
    
  • 2 NullPointerException [...] because the return value of "java.util.Map$Entry.getKey()" is null, eg:
    dataone-indexer 20241129-07:16:15: [ERROR]: Cannot index the task for identifier
    resource_map_urn:uuid:531f8671-332c-4cc3-a359-f31b380a656c since
    java.lang.NullPointerException: Cannot invoke
    "org.dataone.service.types.v1.Identifier.getValue()" because the return value of
    "java.util.Map$Entry.getKey()" is null [org.dataone.cn.indexer.IndexWorker:indexObject:464]
    
  • 1 Base URI is null, but there are relative URIs to resolve:
    dataone-indexer 20241128-18:27:28: [ERROR]: Cannot index the task for identifier
    resource_map_urn:uuid:7423a962-6ccd-4501-a7ac-5cdb772147f2 since
    org.dspace.foresite.OREParserException: org.apache.jena.riot.RiotException: [line: 34, col: 72]
    {E211} Base URI is null, but there are relative URIs to resolve.: <"DataONE Java Client
    Library"> [org.dataone.cn.indexer.IndexWorker:indexObject:464]
    
  • 1 READ not allowed [...] for subject[s]: CN=urn:node:ARCTICTEMP,DC=dataone,DC=org;public;
    • this shows a read failed to find an object in hashstore and so fell back to using the indexer token (which had the wrong subject anyway). Have now deleted the old token.
    dataone-indexer 20241127-15:58:59: [ERROR]: Cannot index the task for identifier
    urn:uuid:2de418e3-d9bb-4b7f-82af-ef5885da6b9b since READ not allowed on urn:uuid:2de418e3-
    d9bb-4b7f-82af-ef5885da6b9b for subject[s]: CN=urn:node:ARCTICTEMP,DC=dataone,DC=org;
    public; authenticatedUser;  [org.dataone.cn.indexer.IndexWorker:indexObject:464]
    

10,891 start with [ERROR] but do NOT contain Cannot index the task for identifier`

Of these...

  • 7,349 total are multi-line, xml-formatted errors beginning: [ERROR]: <?xml version="1.0" encoding="UTF-8"?>
    • 9 of the xml errors contain: <str name="msg">For input string:, followed by a java.lang.NumberFormatException. e.g:
      <lst name="error">
        <str name="msg">For input string: "[1817018958676492288, 1817017741626834944]"</str>
        <str name="trace">java.lang.NumberFormatException: For input string: "[1817018958676492288, 1817017741626834944]"
      	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
      	at java.base/java.lang.Long.parseLong(Long.java:692)
      	at java.base/java.lang.Long.parseLong(Long.java:817)
      	at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:321)
      Bad data/metadata - ignore for now
    • The rest contain <str name=\"msg\">version conflict for ... e.g:
        <str name="msg">version conflict for urn:uuid:fbc0d429-3765-4870-a75a-99fb709398b0 expected=1817087001693782016 actual=1817091980995330048</str>
        <int name="code">409</int>
      Reindex - see similar errors above

Following are all due to existing bad data/metadata, and are currently not indexed on adc, so we're no worse off. Can fix in future if/when there's time

  • 3,200 total: "Couldn't extract 'start' or 'end' date for pid" - (actually 640 tot, but these are always grouped with 4 other errors - so contribute 640 * 5 = 3,200 total [ERROR] messages. Example:

    dataone-indexer 20241127-03:55:10: [ERROR]: [org.dataone.cn.indexer.parser.utility.TemporalPeriodParsingUtility:parseDateTime:200]
    dataone-indexer 20241127-03:55:10: [ERROR]: Date string could not be parsed: null [org.dataone.cn.indexer.parser.utility.TemporalPeriodParsingUtility:formatDate:178]
    dataone-indexer 20241127-03:55:10: [ERROR]:  [org.dataone.cn.indexer.parser.utility.TemporalPeriodParsingUtility:parseDateTime:200]
    dataone-indexer 20241127-03:55:10: [ERROR]: Date string could not be parsed: null [org.dataone.cn.indexer.parser.utility.TemporalPeriodParsingUtility:formatDate:178]
    dataone-indexer 20241127-03:55:10: [ERROR]: Couldn't extract 'start' or 'end' date for pid dcx_df52098f-4ed9-46ad-9fd4-2967c61747a0_0.Temporal pattern of type period needs to contain at least one of these. Value was:  [org.dataone.cn.indexer.parser.TemporalPeriodSolrField:getFields:79]
    
  • 333 total: "OntologyModelService.expandConcepts(.*) encountered an exception while querying." BUT: looks like there are many of these per dataset, so it's challenging to decipher how many datasets actually impacted. Could group by timestamp?

    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(longitude coordinate) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(contains measurements of type) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(month of year) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(contains measurements of type) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(year of measurement) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(contains measurements of type) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(dominant vegetation) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(contains measurements of type) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(Aboveground Biomas) encountered an exception while querying. [org.dataone.cn.indexer.annotation.OntologyModelService:expandConcepts:141]
    dataone-indexer 20241127-04:56:57: [ERROR]: OntologyModelService.expandConcepts(contains measurements of type) encountered an exception while querying. [org.dataone.cn.indexer.annotation.O
    
  • 8 total:

    dataone-indexer 20241128-18:27:28: [ERROR]: Unable to parse ORE document: [org.dataone.cn.indexer.resourcemap.ForesiteResourceMap:_init:147]
    
  • 1 total:

    dataone-indexer 20241127-01:13:14: [ERROR]: Problem parsing annotation: Cannot invoke 
    "Object.toString()" because the return value of "net.minidev.json.JSONObject.get(Object)" 
    is null [org.dataone.cn.indexer.annotation.AnnotatorSubprocessor:parseAnnotation:295]
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic k8s Kubernetes/Helm Related
Projects
Status: Blocked
Development

No branches or pull requests

2 participants