Table of Contents
- Overview
- Export Discovery Metadata into File
- Publish Discovery Metadata from File
- DOIs in Gen3
- dbGaP FHIR Metadata in Gen3 Discovery
- Publish Discovery Metadata Objects from File
The Gen3 Discovery Page allows the visualization of metadata. There are a collection of SDK/CLI functionality that assists with the managing of such metadata in Gen3.
gen3 discovery --help
will provide the most up to date information about CLI
functionality.
Like other CLI functions, the CLI code mostly just wraps an SDK function call.
So you can choose to use the CLI or write your own Python script and use the SDK functions yourself. Generally this provides the most flexibility, at less of a convenience.
Gen3's SDK can be used to export discovery metadata from a certain Gen3 environment into a file by using the output_expanded_discovery_metadata()
function. By default this function will query for metadata with guid_type=discovery_metadata
for the dump, and export the metadata into a TSV file. User can also specify a different guid_type
values for this operation, and/or choose to export the metadata into a JSON file. When using TSV format, some certain fields from metadata will be flattened or "jsonified" so that each metadata record can be fitted into one row.
Example of usage:
from gen3.tools.metadata.discovery import (
output_expanded_discovery_metadata,
)
from gen3.utils import get_or_create_event_loop_for_thread
from gen3.auth import Gen3Auth
if __name__ == "__main__":
auth = Gen3Auth()
loop = get_or_create_event_loop_for_thread()
loop.run_until_complete(
output_expanded_discovery_metadata(
auth, endpoint="GEN3_ENV_HOSTNAME", output_format="json"
)
)
Gen3's SDK can also be used to publish discovery metadata onto a target Gen3 environment from a file by using the publish_discovery_metadata()
function. Ideally the metadata file should be originated from a metadata dump obtained by using the output_expanded_discovery_metadata()
function.
Example of usage:
from gen3.tools.metadata.discovery import (
publish_discovery_metadata,
)
from gen3.utils import get_or_create_event_loop_for_thread
from gen3.auth import Gen3Auth
if __name__ == "__main__":
auth = Gen3Auth()
loop = get_or_create_event_loop_for_thread()
loop.run_until_complete(
publish_discovery_metadata(
auth, "./metadata.tsv", endpoint=HOSTNAME, guid_field="_hdp_uid"
)
)
Gen3's SDK supports minting DOIs from DataCite, storing DOI metadata in a Gen3 instance, and visualizing the DOI metadata in our Discovery Page to serve as a DOI "Landing Page".
DOI? A digital object identifier (DOI) is a persistent identifier or handle used to identify objects uniquely, standardized by the International Organization for Standardization (ISO). DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports, data sets, and official publications. However, they also have been used to identify other types of information resources, such as commercial videos.
The general overview for how Gen3 supports DOIs is as follows:
- Gen3 SDK/CLI used to gather Metadata from External Public Metadata Sources
- Gen3 SDK/CLI used to do any conversions to DOI Metadata
- Gen3 SDK/CLI communicates with DataCite API to mint DOI
- NOTE: the gathering of metadata, conversion to DOI fields, and final minting may or may not be a part of a regular data ingestion. It’s possible that this is used ad-hocly, as needed
- Gen3 SDK/CLI persists metadata in Gen3
- Persisted metadata in Gen3 exposed via Discovery Page
- Discovery Page is used as the required DOI Landing Page
What is DataCite? In order to create a DOI, one must use a DOI registration service. In the US there are two: CrossRef and DataCite. We are focusing on DataCite, because that is what we were provided access to.
Prerequisites:
- Environment variable
DATACITE_USERNAME
set as a valid DataCite username for interacting with their API - Environment variable
DATACITE_PASSWORD
set as a valid DataCite password for interacting with their API
This shows a full example of:
- Setting up the necessary classes for interacting with Gen3 & Datacite
- Getting the DOI metadata (ideally from some external source like a file or another API, but here we've hard-coded it)
- Creating/Minting the DOI in DataCite
- Persisting the DOI metadata into a Gen3 Discovery record in the metadata service
import os
from requests.auth import HTTPBasicAuth
from cdislogging import get_logger
from gen3.doi import (
DataCite,
DigitalObjectIdentifier,
DigitalObjectIdentifierCreator,
DigitalObjectIdentifierTitle,
)
from gen3.auth import Gen3Auth
logging = get_logger("__name__", log_level="info")
# This prefix should be provided by DataCite
PREFIX = "10.12345"
PUBLISHER = "Example"
COMMONS_DISCOVERY_PAGE = "https://example.com/discovery"
DOI_DISCLAIMER = ""
DOI_ACCESS_INFORMATION = "You can find information about how to access this resource in the link below."
DOI_ACCESS_INFORMATION_LINK = "https://example.com/more/info"
DOI_CONTACT = "https://example.com/contact/"
def test_manual_single_doi(publish_dois=False):
# Setup
gen3_auth = Gen3Auth()
datacite = DataCite(
use_prod=False,
auth_provider=HTTPBasicAuth(
os.environ.get("DATACITE_USERNAME"),
os.environ.get("DATACITE_PASSWORD"),
),
)
gen3_metadata_guid = "Example-Study-01"
# Get DOI metadata (ideally from some external source)
identifier = "10.82483/BDC-268Z-O151"
creators = [
DigitalObjectIdentifierCreator(
name="Bar, Foo",
name_type=DigitalObjectIdentifierCreator.NAME_TYPE_PERSON,
).as_dict()
]
titles = [DigitalObjectIdentifierTitle("Some Example Study in Gen3").as_dict()]
publisher = "Example Gen3 Sponsor"
publication_year = 2023
doi_type_general = "Dataset"
version = 1
doi_metadata = {
"identifier": identifier,
"creators": creators,
"titles": titles,
"publisher": publisher,
"publication_year": publication_year,
"doi_type_general": doi_type_general,
"version": version,
}
# Create/Mint the DOI in DataCite
# The default url generated is "root_url" + identifier
# If your Discovery metadata records don't use the DOI as the GUID,
# you may need to supply the URL yourself like below
url = COMMONS_DISCOVERY_PAGE.rstrip("/") + f"/{gen3_metadata_guid}"
doi = DigitalObjectIdentifier(url=url, use_prod=False, **doi_metadata)
if publish_dois:
logging.info(f"Publishing DOI `{identifier}`...")
doi.event = "publish"
# works for only new DOIs
# You can use this for updates: `datacite.update_doi(doi)`
response = datacite.create_doi(doi)
doi = DigitalObjectIdentifier.from_datacite_create_doi_response(response, use_prod=False)
# Persist necessary DOI Metadata in Gen3 Discovery to support the landing page
metadata = datacite.persist_doi_metadata_in_gen3(
guid=gen3_metadata_guid,
doi=doi,
auth=gen3_auth,
additional_metadata={
"disclaimer": DOI_DISCLAIMER,
"access_information": DOI_ACCESS_INFORMATION,
"access_information_link": DOI_ACCESS_INFORMATION_LINK,
"contact": DOI_CONTACT,
},
prefix="doi_",
)
logging.debug(f"Gen3 Metadata for GUID `{gen3_metadata_guid}`: {metadata}")
def main():
test_manual_single_doi()
if __name__ == "__main__":
main()
This is portion of the Gen3 Data Portal configuration that pertains to the Discovery Page. The code provided shows an example of how to configure the visualization of the DOI metadata.
In order to be compliant with Landing Pages, the URL you provide during minting needs to automatically display all this information. So if you have other tabs of non-DOI information, they cannot be the first focused tab upon resolving the DOI url.
"discoveryConfig": {
// ...
"features": {
// ...
"search": {
"searchBar": {
"enabled": true,
"searchableTextFields": [
"doi_titles",
"doi_version_information",
"doi_citation",
"doi_creators",
"doi_publisher",
"doi_identifier",
"doi_alternateIdentifiers",
"doi_contributors",
"doi_descriptions",
"doi_publication_year",
"doi_resolvable_link",
"doi_fundingReferences",
"doi_relatedIdentifiers"
]
},
// ...
"detailView": {
// ...
"tabs": [
{
"tabName": "DOI",
"groups": [
{
"header": "Dataset Information",
"fields": [
{
"type": "block",
"label": "",
"sourceField": "doi_disclaimer",
"default": ""
},
{
"type": "text",
"label": "Title:",
"sourceField": "doi_titles",
"default": "Not specified"
},
{
"type": "link",
"label": "DOI:",
"sourceField": "doi_resolvable_link",
"default": "None"
},
{
"type": "text",
"label": "Data available:",
"sourceField": "doi_is_available",
"default": "None"
},
{
"type": "text",
"label": "Creators:",
"sourceField": "doi_creators",
"default": "Not specified"
},
{
"type": "text",
"label": "Citation:",
"sourceField": "doi_citation",
"default": "Not specified"
},
{
"type": "link",
"label": "Contact:",
"sourceField": "doi_contact",
"default": "Not specified"
}
]
},
{
"header": "How to Access the Data",
"fields": [
{
"type": "block",
"label": "How to access the data:",
"sourceField": "doi_access_information",
"default": "Not specified"
},
{
"type": "link",
"label": "Data and access information:",
"sourceField": "doi_access_information_link",
"default": "Not specified"
}
]
},
{
"header": "Additional Information",
"fields": [
{
"type": "text",
"label": "Publisher:",
"sourceField": "doi_publisher",
"default": "Not specified"
},
{
"type": "text",
"label": "Funded by:",
"sourceField": "doi_fundingReferences",
"default": "Not specified"
},
{
"type": "text",
"label": "Publication Year:",
"sourceField": "doi_publication_year",
"default": "Not specified"
},
{
"type": "text",
"label": "Resource Type:",
"sourceField": "doi_resource_type",
"default": "Not specified"
},
{
"type": "text",
"label": "Version:",
"sourceField": "doi_version_information",
"default": "Not specified"
},
{
"type": "text",
"label": "Contributors:",
"sourceField": "doi_contributors",
"default": "Not specified"
},
{
"type": "text",
"label": "Related Identifiers:",
"sourceField": "doi_relatedIdentifiers",
"default": "Not specified"
}
]
},
{
"header": "Description",
"fields": [
{
"type": "block",
"label": "Description:",
"sourceField": "doi_descriptions",
"default": "Not specified"
}
]
}
]
},
// ...
Automates the pulling of current datasets from Discovery, getting identifiers, scraping various APIs for DOI related metadata, and then going through the DOI creation loop to mint the DOI in Datacite and persist the metadata back in Gen3.
See below for a full example using the dbGaP DbgapMetadataInterface
.
More interfaces may exist in the future for doing this by querying non-dbGaP sources.
import os
from requests.auth import HTTPBasicAuth
from cdislogging import get_logger
from gen3.auth import Gen3Auth
from gen3.discovery_dois import mint_dois_for_discovery_datasets, DbgapMetadataInterface
from gen3.utils import get_random_alphanumeric
logging = get_logger("__name__", log_level="info")
PREFIX = "10.12345"
PUBLISHER = "Example"
COMMONS_DISCOVERY_PAGE = "https://example.com/discovery"
DOI_DISCLAIMER = ""
DOI_ACCESS_INFORMATION = "You can find information about how to access this resource in the link below."
DOI_ACCESS_INFORMATION_LINK = "https://example.com/more/info"
DOI_CONTACT = "https://example.com/contact/"
def mint_discovery_dois():
auth = Gen3Auth()
# this alternate ID is some globally unique ID other than the GUID that
# will be needed to get DOI metadata (like the phsid for dbGaP)
metadata_field_for_alternate_id = "dbgap_accession"
# you can choose to exclude certain Discovery Metadata datasets based on
# their GUID or alternate ID (this means they won't get additional DOI metadata
# or have DOIs minted, they'll be skipped)
exclude_datasets=["MetadataGUID_to_exclude", "AlternateID_to_exclude", "..."]
# When this is True, you CANNOT REVERT THIS ACTION. A published DOI
# cannot be deleted. It is recommended to test with "Draft" state DOIs first
# (which is the default when publish_dois is not True).
publish_dois = False
mint_dois_for_discovery_datasets(
gen3_auth=auth,
datacite_auth=HTTPBasicAuth(
os.environ.get("DATACITE_USERNAME"),
os.environ.get("DATACITE_PASSWORD"),
),
metadata_field_for_alternate_id=metadata_field_for_alternate_id,
get_doi_identifier_function=get_doi_identifier,
metadata_interface=DbgapMetadataInterface,
doi_publisher=PUBLISHER,
commons_discovery_page=COMMONS_DISCOVERY_PAGE,
doi_disclaimer=DOI_DISCLAIMER,
doi_access_information=DOI_ACCESS_INFORMATION,
doi_access_information_link=DOI_ACCESS_INFORMATION_LINK,
doi_contact=DOI_CONTACT,
publish_dois=publish_dois,
datacite_use_prod=False,
exclude_datasets=["MetadataGUID_to_exclude", "AlternateID_to_exclude", "..."]
)
def get_doi_identifier():
return (
PREFIX + "/EXAMPLE-" + get_random_alphanumeric(4) + "-" + get_random_alphanumeric(4)
)
def main():
mint_discovery_dois()
if __name__ == "__main__":
main()
For CLI, see gen3 discovery combine --help
.
This will describe how to use the SDK functions directly. If you use the CLI, it will automatically read current Discovery metadata and then combine with the file you provide (after applying a prefix to all the columns, if you specify that).
Note: This supports CSV and TSV formats for the metadata file
Let's assume:
- You don't have the current Discovery metadata in a file locally
- You want to merge new metadata (parsed from dbGaP's FHIR server) with the existing Discovery metadata
- You want to prefix all the new columns with
DBGAP_FHIR_
Here's how you would do that without using the CLI:
from gen3.auth import Gen3Auth
from gen3.tools.metadata.discovery import (
output_expanded_discovery_metadata,
combine_discovery_metadata,
)
from gen3.external.nih.dbgap_fhir import dbgapFHIR
from gen3.utils import get_or_create_event_loop_for_thread
def main():
"""
Read current Discovery metadata, then combine with dbgapFHIR metadata.
"""
# Get current Discovery metadata
loop = get_or_create_event_loop_for_thread()
auth = Gen3Auth(refresh_file="credentials.json")
current_discovery_metadata_file = loop.run_until_complete(
output_expanded_discovery_metadata(auth, endpoint=auth.endpoint)
)
# Get dbGaP FHIR Metadata
studies = [
"phs000007.v31",
"phs000166.v2",
"phs000179.v6",
]
dbgapfhir = dbgapFHIR()
simplified_data = dbgapfhir.get_metadata_for_ids(ids=studies)
dbgapFHIR.write_data_to_file(simplified_data, "fhir_metadata_file.tsv")
# Combine new FHIR Metadata with existing Discovery Metadata
metadata_filename = "fhir_metadata_file.tsv"
discovery_column_to_map_on = "guid"
metadata_column_to_map = "Id"
output_filename = "combined_discovery_metadata.tsv"
metadata_prefix = "DBGAP_FHIR_"
output_file = combine_discovery_metadata(
current_discovery_metadata_file,
metadata_filename,
discovery_column_to_map_on,
metadata_column_to_map,
output_filename,
metadata_prefix=metadata_prefix,
)
# You now have a file with the combined information that you can publish
# NOTE: Combining does NOT publish automatically into Gen3. You should
# QA the output (make sure the result is correct), and then publish.
if __name__ == "__main__":
main()
Gen3's SDK can be used to ingest data objects related to datasets in Gen3 environment from a file by using the publish_discovery_object_metadata()
function. To obtain a file of existing metadata objects, use the output_discovery_objects()
function. By default new objects published from a file are appended to a dataset in a Gen3 environment. If object guids from a file already exist for a dataset in the Gen3 environment, objects are updated. If the overwrite
option is True
, all current metadata objects related to a dataset are instead replaced. You can also use this functionality from the CLI. See gen3 discovery objects --help
Example of usage:
"""
Example script showing reading Discovery Objects Metadata and then
publishing it back, just to demonstrate the functions.
Before running this, ensure your ~/.gen3/credentials.json contains
an API key for a Gen3 instance to interact with and/or adjust the
Gen3Auth logic to provide auth in another way
"""
from cdislogging import get_logger
from gen3.tools.metadata.discovery_objects import (
publish_discovery_object_metadata,
output_discovery_objects,
)
from gen3.utils import get_or_create_event_loop_for_thread
from gen3.auth import Gen3Auth
logging = get_logger("__name__")
if __name__ == "__main__":
auth = Gen3Auth()
loop = get_or_create_event_loop_for_thread()
logging.info(f"Reading discovery objects metadata from: {auth.endpoint}...")
output_filename = loop.run_until_complete(
output_discovery_objects(
auth,
output_format="tsv",
)
)
logging.info(f"Output discovery objects metadata: {output_filename}")
# Here you can modify the file by hand or in code and then publish to update
# Alternatively, you can skip the read above and just provide a file with
# the object metadata you want to publish
logging.info(
f"publishing discovery object metadata to: {auth.endpoint} from file: {output_filename}"
)
loop.run_until_complete(
publish_discovery_object_metadata(
auth,
output_filename,
overwrite=False,
)
)