Skip to content

Commit

Permalink
Reorganized markdown files to match new template
Browse files Browse the repository at this point in the history
  • Loading branch information
AlanSimmons committed Mar 22, 2023
1 parent 305f87c commit e250008
Show file tree
Hide file tree
Showing 7 changed files with 166 additions and 120 deletions.
8 changes: 5 additions & 3 deletions docs/user-guide/glossary.md → docs/glossary/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
layout: default
---
# UBKG Concepts
# Unified Biomedical Knowledge Graph Glossary

This appendix is intended as a working glossary, and not as a formal or exhaustive terminology.
When the discussion of a term in the glossary refers to another term in the glossary, the other term will be in **_bold italic_**.

Expand Down Expand Up @@ -55,7 +56,8 @@ A **_cross-reference_** between a **_concept_** in one **_ontology_** and a conc
## Ingest files
A set of files that describe the entities and relationships of an ontology that is to be integrated into the UBKG.

# Inverse relationship
## Inverse relationship

A **_relationship_** in an **_ontology_** has a direction: it starts with one node and “goes toward” another–e.g.,

(_5'-AMP-activated protein kinase subunit gamma-1_)→**isa**→(_protein_)
Expand Down Expand Up @@ -122,7 +124,7 @@ Relationships in RO can be reviewed in a number of ways, including:
## SAB
The UBKG adopts the UMLS practice of identifying source ontologies with a **Source Abbreviation** (SAB). Examples of UMLS SABs include SNOMED_CT and UBERON. UBKG uses published acronyms for ontologies when possible–e.g., PATO.

## Term: preferred term, synonym
## Term (preferred, synonym)
A usually short text identifier for a **_code_** in an **_ontology**_. For example, a term for code 64033007 in SNOMED_CT is “kidney”.

A term can be a _preferred term_ or a _synonym_.
Expand Down
80 changes: 56 additions & 24 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,45 +2,64 @@
layout: default
---

# UBKG
The **Unified Biomedical Knowledge Graph (UBKG)** is a [knowledge graph](https://en.wikipedia.org/wiki/Knowledge_graph) database that represents a set of interrelated concepts from biomedical ontologies and vocabularies. The UBKG combines information from the National Library of Medicine's [Unified Medical Language System](https://www.nlm.nih.gov/research/umls/index.html) (UMLS) with [_assertions_](https://www.w3.org/TR/owl2-syntax/#Assertions) from “non-UMLS” ontologies or vocabularies, including:
# Unified Biomedical Knowledge Graph (UBKG)
---

The **Unified Biomedical Knowledge Graph (UBKG)** is a [knowledge graph](https://en.wikipedia.org/wiki/Knowledge_graph) infrastructure that represents a set of interrelated concepts from biomedical ontologies and vocabularies.

The UBKG combines information from the National Library of Medicine's [Unified Medical Language System](https://www.nlm.nih.gov/research/umls/index.html) (UMLS) with [_assertions_](https://www.w3.org/TR/owl2-syntax/#Assertions) from “non-UMLS” ontologies or vocabularies, including:
- Ontologies published in references such as the [NCBO Bioportal](https://bioportal.bioontology.org/) and the [OBO Foundry](https://obofoundry.org/).
- Custom ontologies derived from data sources such as [UNIPROTKB](https://www.uniprot.org/).
- Other custom ontologies, such as those for the [HuBMAP](https://hubmapconsortium.org/) platform.

An important goal of the UBKG is to establish connections between ontologies. For example,if information on the relationships between _proteins_ and _genes_ described in one ontology can be connected to information on the relationships between _genes_ and _diseases_ described in another ontology, it may be possible to identify previously unknown relationships between _proteins_ and _diseases_.
An important goal of the UBKG is to establish connections _between_ ontologies. For example,if information on the relationships between _proteins_ and _genes_ described in one ontology can be connected to information on the relationships between _genes_ and _diseases_ described in another ontology, it may be possible to identify previously unknown relationships between _proteins_ and _diseases_.

## UBKG Components
## Components of the UBKG
The primary components of the UBKG are:

- a graph database, deployed in [neo4j](https://neo4j.com/)
- a **source framework** of scripts that obtain information from the UMLS to generate a set of **UMLS CSVs***
- a **generation framework** of scripts that append to the UMLS CSVs sets of assertions to create a set of **ontology CSVs**
- an **ontology knowledge graph database** instance, deployed in [neo4j](https://neo4j.com/), that includes scripts to import the ontology CSVs
- a [REST API](https://restfulapi.net/) that provides access to the information in the graph database

## UBKG Data Sources
The assertion data in the UBKG database is created from a load of a set of CSV files, using [neo4j-admin import](https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/).
The set of CSV import files is the product of two frameworks:
- a _source framework_ that extracts data obtained from a release of the UMLS
- a _generation framework_ that appends to the UMLS data assertions from other data sources
Source for the components are stored in repositories in the [x-atlas-consortia](https://github.com/x-atlas-consortia) Github organization.

repository | content
--|--
ubkg-docs|documentation
ubkg-etl|source and generation frameworks
ubkg-neo4j|neo4j instance
ubkg-api|API server


### Source framework
The [**source framework**] is a combination of manual and automated processes that obtain the base set of nodes (entities) and edges (relationships) of the UBKG graph.
## UBKG Data Sources
The UBKG database is populated from by loading a set of ontology CSV files, using [neo4j-admin import](https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/).
The ontology CSVs are the product of two frameworks:

The source framework is also known as the **UMLS-Graph**.
## Source framework
The **source framework** is a combination of manual and automated processes that obtain the base set of nodes (entities) and edges (relationships) that comprise the UMLS CSVs.
The UMLS CSVs can be loaded into neo4j to form a **UMLS-Graph**, a knowledge graph representation of the UMLS.

- Information on the concepts in the ontologies and vocabularies that are integrated into the UMLS Metathesaurus can be downloaded using the [MetamorphoSys](https://www.ncbi.nlm.nih.gov/books/NBK9683/#:~:text=MetamorphoSys%20is%20the%20UMLS%20installation,to%20create%20customized%20Metathesaurus%20subsets.) application. MetamorphoSys can be configured to download subsets of the entire UMLS.
- Information on the entities and relationships in the ontologies and vocabularies that are integrated into the UMLS Metathesaurus can be downloaded using the [MetamorphoSys](https://www.ncbi.nlm.nih.gov/books/NBK9683/#:~:text=MetamorphoSys%20is%20the%20UMLS%20installation,to%20create%20customized%20Metathesaurus%20subsets.) application. MetamorphoSys can be configured to download subsets of the entire UMLS.
- Additional semantic information related to the UMLS can be downloaded manually from the [Semantic Network](https://lhncbc.nlm.nih.gov/semanticnetwork/).

The result of the Metathesaurus and Semantic Network downloads is a set of files in [Rich Release Format](https://www.ncbi.nlm.nih.gov/books/NBK9685) (RRF). The RRF files contain information on source vocabularies or ontologies, codes, terms, and relationships both with other codes in the same vocabularies and with UMLS concepts.

The RRF files are loaded into a data mart. A python script then executes SQL scripts that perform Extraction, Transformation, and Loading of the RRF data into a set of twelve temporary tables. These tables are exported to CSV format in files that become the **UMLS CSVs**.
The RRF files can be loaded into tables in a data mart. (The University of Pittsburgh's manages its UMLS content in its **Neptune** data mart.)

A python script then executes SQL scripts that perform Extraction, Transformation, and Loading (ETL) of the RRF data into a set of twelve temporary tables. These tables are exported to CSV format in files that become the **UMLS CSVs**.

The following diagram illustrates the source framework workflow.

![Source_framework](https://user-images.githubusercontent.com/10928372/202307155-5bfd7a77-e858-4e5c-89a1-a42d964b871d.jpg)

### Generation framework
The UMLS CSVs can be loaded into neo4j to build a graph version of the UMLS, including concepts and relationships from over 150 vocabularies and ontologies that are integrated into the UMLS, such as SNOMED CT, ICD10, NCI, etc..
## Generation framework

The UBKG extends the UMLS graph by integrating additional assertions from sources outside the UMLS, including a number of standard biomedical ontologies that are published sources such as
NCBO BioPortal or OBO.

The UBKG extends the UMLS graph by integrating additional assertions from sources outside of the UMLS, including a number of standard biomedical ontologies that are published in NCBO BioPortal, including:
The following list lists many of the sources of additional assertions.
This list may change based on the requirements of applications of the UBKG.

Ontology or Source | Description
--- | ---
Expand Down Expand Up @@ -71,15 +90,28 @@ The scripts in the generation framework:
- extract information on assertions (also known as _triples_, or _subject-predicate-object_ relationships) found in ontologies or derived from other sources
- iteratively add assertion information to the base set of UMLS CSVs to create a set of **ontology CSVs**.

Once a set of ontology CSVs is ready, they can be imported into a neo4j database to form a new instance of the UBKG.
Once a set of ontology CSVs is ready, it can be imported into a neo4j database to form a new instance of the UBKG.

The generation framework can work with:
- data from ontologies published in [Web Ontology Language](https://www.w3.org/OWL/) (OWL) files that conform to the [principles](https://obofoundry.org/principles/fp-000-summary.html) of the OBO Foundry
- data from private or custom ontologies that are in the SimpleKnowledge format. (SimpleKnowledge is a lightweight ontology editor based on spreadsheets developed by Pitt UBMI.)
- assertion data that conforms to the _UBKG Edge/Node format_.

### PheKnowLator and OWLNETS
The generation framework obtains assertion data from OWL files with scripts that are based on the [Phenotype Knowledge Translator](https://github.com/callahantiff/PheKnowLator) (PheKnowLator) application. PheKnowLator converts information from an OWL file into the [OWL-NETS](https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0) (OWL NEtwork Transformation for Statistical learning) format.

- assertion data that conforms to the _UBKG Edge/Node format_, as described in the [UBKG User Guide](https://ubkg.docs.xconsortia.org/user-guide/#ingest-files-format-and-content).
- other reference data sources, by means of custom scripts

## PheKnowLator and OWLNETS
When the assertion data source is an OWL file, the generation framework uses the [Phenotype Knowledge Translator](https://github.com/callahantiff/PheKnowLator) (PheKnowLator) package.
PheKnowLator converts information from an OWL file into the [OWL-NETS](https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0) (OWL NEtwork Transformation for Statistical learning) format.

## Solution Architecture
The generation framework is a parameterized ETL script that:
- extracts assertion information from a data source
- transforms assertion information into the format of the UMLS CSVs
- appends assertions to the UMLS CSVs to create the ontology CSVs

The following diagram illustrates the basic workflow, showing four cases:
1. The OWLNETS script that uses PheKnowLator to work with OWL files
2. A custom script that obtains data from UniProtKB
3. The SKOWLNETS script that works with SimpleKnowledge data sources
4. Files in the UBKG edges/nodes format

![generation_framework](https://user-images.githubusercontent.com/10928372/202308840-1abc0684-684d-476a-8ed5-1a1b4118ffc6.jpg)
12 changes: 8 additions & 4 deletions docs/lang/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,17 @@
"href": "/",
"class": "h2"
},
{
"name": "Generation framework",
"href": "/#generation-framework"
},
{
"name": "User Guide",
"href": "/user-guide"
},
{
"name": "Glossary",
"href": "/glossary"
},
{
"name": "PubChem Format",
"href": "/pubchem"
}
]
}
51 changes: 51 additions & 0 deletions docs/pubchem/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
layout: default
---
# PubChem Ingest File format

# edges.tsv
## Fields

| Field | Corresponding element in UBKG | Accepted formats | Examples |
|-----------|-------------------------------|-------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------|
| subject | **Code** node | PUBCHEM _PubChem CID_ | [PUBCHEM 9549299](https://pubchem.ncbi.nlm.nih.gov/compound/9549299 |
| predicate | relationships | For hierarchical relationships, the IRI http://www.w3.org/2000/01/rdf-schema#subClassOf OR the string “isa” | http://www.w3.org/2000/01/rdf-schema#subClassOf |
| | | For non-hierarchical relationships, an IRI for a relationship property in RO | http://purl.obolibrary.org/obo/RO_0002292 |
| | | Custom string | drinks milkshake of |
| object | **Code** node | same as for subject | |

## Relationships (predicates)
The definition of relationships is the principle informatics task of assertion. An appropriate selection of concept in the _node_dbxrefs_ field of **nodes.tsv** will associate cross-referenced assertions.

## An example for PUBCHEM 9549299:

An EGFR inhibitor inhibits the expression of EGFR (UNIPROTKB ID P00533), so a possible assertion is

| subjecy | predicate | object |
|-----------------|-------------------------------------------|------------------|
| PUBCHEM 9549299 | http://purl.obolibrary.org/obo/RO_0002449 | UNIPROTKB P00533 |
| 1 | 2 | 3 |

RO_0002449 = _directly inhibits_

Because UNIPROTKB is already integrated into the UBKG, any relationship with P00533 would also get the link to HGNC 3236:

![image](https://user-images.githubusercontent.com/10928372/203175673-0372303c-ac5c-4122-bb6f-74a4dc31903a.png)

# nodes.tsv

## Fields

| Field | Corresponding element in UBKG | Accepted formats | Examples |
|------------------------------|---------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| node_id | **Code** node | PUBCHEM _PubChem CID_ | [PUBCHEM 9549299](https://pubchem.ncbi.nlm.nih.gov/compound/9549299 |
| node_label | **Term** node, _Preferred Term_ (PT) relationship | Text string for the **Compound Name** | EGFR Inhibitor |
| node_definition (_optional_) | **Definition** node, _DEF_ relationship | Text string - IUPAC Name? | N-[3-[[6-[3-(trifluoromethyl)anilino]pyrimidin-4-yl]amino]phenyl]cyclopropanecarboxamide |
| node_synonyms (_optional_) | **Term** node; _Synonym_ (SYN) relationship | **Pipe-delimited** list of synonyms | See example (pipes are also used to format table cells) |
| node_dbxrefs (_optional_) | Cross-references | Pipe-delimited list of references to cross-referenced concepts. Each cross-reference should be in format SAB:code or UMLS:CUI | UMLS:C5574906 |

## Example of synonyms for EGFR inhibitor
N-[3-[[6-[3-(trifluoromethyl)anilino]pyrimidin-4-yl]amino]phenyl]cyclopropanecarboxamide|1S/C21H18F3N5O/c22-21(23,24)14-3-1-4-15(9-14)27-18-11-19(26-12-25-18)28-16-5-2-6-17(10-16)29-20(30)13-7-8-13/h1-6,9-13H,7-8H2,(H,29,30)(H2,25,26,27,28)|YOHYSYJDKVYCJI-UHFFFAOYSA-N|C1CC1C(=O)NC2=CC=CC(=C2)NC3=NC=NC(=C3)NC4=CC=CC(=C4)C(F)(F)F

i.e.,
2.1.1IUPAC Name|2.1.2InChI|2.1.3InChIKey|2.1.4Canonical SMILES
Loading

0 comments on commit e250008

Please sign in to comment.