Skip to content

DNA data protocol

Rukaya edited this page Mar 9, 2023 · 1 revision

NTNU

A record of decisions taken when publishing NTNU DNA datasets on the the NTNU IPT:

  • Museum specimens (occurrence core, basisOfRecord=PreservedSpecimen) will be published in the same dataset with material samples (occurrence core, basisOfRecord=MaterialSample)
  • When publishing material samples, materialSampleID will be used as occurrenceID. This should be different from the occurrenceID used for the corresponding specimen
  • A new organismID will be created for each individual organism, and used across both datasets (i.e. a material sample from bird x, which is also a preserved specimen, will have the same organismID)
  • Resource relationship will be used to indicate the relation between material sample and the related voucher specimen. This will create an explicit link between the two, which is missing when using organismID alone. The resourceRelationshipID will be a simple UUID and will not follow the example given in the GGBN guidelines. While resourceID must be present in the core, it seems that relatedResourceID does not need to be, and can be in an external dataset (see also https://github.com/tdwg/dwc/issues/194)

Corema

Overview of the datasets

The collections management system Corema (https://www.coremadb.com/) is used to curate the DNA bank datasets at UiO's Natural History Museum.

These are the following datasets:

How the datasets are organised

  • The records in these datasets (* apart from the Mammal and Bird datasets) may possibly have some duplicate records in the non-DNA datasets. This is because the DNA collections are partially made up of DNA samples from the specimens in the main collections, which may sometimes be published separately on GBIF. Additionally, some DNA samples are taken from living organisms in the wild, which are then released and do not become museum specimens.

  • Multiple tissue samples are often taken from a specimen:

    • A great tit captured in a mist net at Tøyen on 19.06.2020 will be registered as one accession (=record) in our collection database; this record will hold info about species, locality, date, ring number, etc.

    • Let’s say I took a blood sample, a sperm sample and a feather sample from the bird before I released it; each of these samples will then be registered as “sub-records” on the main record created above (what we call “items”)

    • This one bird will then appear as three points on the map in Artskart/GBIF

  • Tissue samples are curated as separate collection objects from the specimen they are taken from (kept in different locations etc), and so are assigned individual occurrenceIDs - i.e. identifiers for each individual item (tissue sample, DNA extract, etc) belonging to a record/accession.

    • This means that each sample has a row in the occurrence table. Think of occurrenceID more as identifying a piece of evidence for a species occurrence than the actual species occurrence itself.
  • Occurrence records from the "same individual" (i.e. the same occurrence of an organism in space and time) have the same organismID in the simpledwc file and are also linked together through the resource relationship file.

  • The resource relationship file may seem to sometimes contain discrepancies. However, these can usually be explained:

    • AccNo NHMO-DFH-782 have totally 7 items registered in Corema; 6 different tissue samples and 1 DNA extract
    • The DNA extract, NHMO-DFH-782/7-D, has been extracted from NHMO-DFH-782/6-T, and hence this relationship appears in the resourceRelationship; however, as there is no link between any of the other items, apart from them all belonging to the same accession/individual, no further entries appear in the resourceRelationship
    • Further, item 6 (.../6-T) was consumed during the extraction and therefore have status = Empty. As the data exported to GBIF (and others) only contain information on existing items, this one item is not included in the dataset – hence the apparent non-existance of this item