Reorganized markdown files to match new template

x-atlas-consortia · Mar 22, 2023 · e250008 · e250008
1 parent 305f87c
commit e250008
Show file tree

Hide file tree

Showing 7 changed files with 166 additions and 120 deletions.
diff --git a/docs/user-guide/glossary.md → docs/glossary/index.md b/docs/user-guide/glossary.md → docs/glossary/index.md
@@ -1,7 +1,8 @@
 ---
 layout: default
 ---
-# UBKG Concepts
+# Unified Biomedical Knowledge Graph Glossary
+
 This appendix is intended as a working glossary, and not as a formal or exhaustive terminology.
 When the discussion of a term in the glossary refers to another term in the glossary, the other term will be in **_bold italic_**.
 
@@ -55,7 +56,8 @@ A **_cross-reference_** between a **_concept_** in one **_ontology_** and a conc
 ## Ingest files
 A set of files that describe the entities and relationships of an ontology that is to be integrated into the UBKG.
 
-# Inverse relationship
+##  Inverse relationship
+
 A **_relationship_** in an **_ontology_** has a direction: it starts with one node and “goes toward” another–e.g., 
 
 (_5'-AMP-activated protein kinase subunit gamma-1_)→**isa**→(_protein_)
@@ -122,7 +124,7 @@ Relationships in RO can be reviewed in a number of ways, including:
 ## SAB
 The UBKG adopts the UMLS practice of identifying source ontologies with a **Source Abbreviation** (SAB). Examples of UMLS SABs include SNOMED_CT and UBERON. UBKG uses published acronyms for ontologies when possible–e.g., PATO.
 
-## Term: preferred term, synonym
+## Term (preferred, synonym)
 A usually short text identifier for a **_code_** in an **_ontology**_. For example, a term for code 64033007 in SNOMED_CT is “kidney”.
 
 A term can be a _preferred term_ or a _synonym_.

diff --git a/docs/index.md b/docs/index.md
@@ -2,45 +2,64 @@
 layout: default
 ---
 
-# UBKG
-The **Unified Biomedical Knowledge Graph (UBKG)** is a [knowledge graph](https://en.wikipedia.org/wiki/Knowledge_graph) database that represents a set of interrelated concepts from biomedical ontologies and vocabularies. The UBKG combines information from the National Library of Medicine's [Unified Medical Language System](https://www.nlm.nih.gov/research/umls/index.html) (UMLS) with [_assertions_](https://www.w3.org/TR/owl2-syntax/#Assertions) from “non-UMLS” ontologies or vocabularies, including:
+# Unified Biomedical Knowledge Graph (UBKG)
+---
+
+The **Unified Biomedical Knowledge Graph (UBKG)** is a  [knowledge graph](https://en.wikipedia.org/wiki/Knowledge_graph) infrastructure that represents a set of interrelated concepts from biomedical ontologies and vocabularies. 
+
+The UBKG combines information from the National Library of Medicine's [Unified Medical Language System](https://www.nlm.nih.gov/research/umls/index.html) (UMLS) with [_assertions_](https://www.w3.org/TR/owl2-syntax/#Assertions) from “non-UMLS” ontologies or vocabularies, including:
 - Ontologies published in references such as the [NCBO Bioportal](https://bioportal.bioontology.org/) and the [OBO Foundry](https://obofoundry.org/).
 - Custom ontologies derived from data sources such as [UNIPROTKB](https://www.uniprot.org/).
 - Other custom ontologies, such as those for the [HuBMAP](https://hubmapconsortium.org/) platform.
 
-An important goal of the UBKG is to establish connections between ontologies. For example,if information on the relationships between _proteins_ and _genes_ described in one ontology can be connected to information on the relationships between _genes_ and _diseases_ described in another ontology, it may be possible to identify previously unknown relationships between _proteins_ and _diseases_.
+An important goal of the UBKG is to establish connections _between_ ontologies. For example,if information on the relationships between _proteins_ and _genes_ described in one ontology can be connected to information on the relationships between _genes_ and _diseases_ described in another ontology, it may be possible to identify previously unknown relationships between _proteins_ and _diseases_.
 
-## UBKG Components
+## Components of the UBKG
 The primary components of the UBKG are:
 
-- a graph database, deployed in [neo4j](https://neo4j.com/)
+- a **source framework** of scripts that obtain information from the UMLS to generate a set of **UMLS CSVs***
+- a **generation framework** of scripts that append to the UMLS CSVs sets of assertions to create a set of **ontology CSVs**
+- an **ontology knowledge graph database** instance, deployed in [neo4j](https://neo4j.com/), that includes scripts to import the ontology CSVs
 - a [REST API](https://restfulapi.net/) that provides access to the information in the graph database
 
-## UBKG Data Sources
-The assertion data in the UBKG database is created from a load of a set of CSV files, using [neo4j-admin import](https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/). 
-The set of CSV import files is the product of two frameworks:
-- a _source framework_ that extracts data obtained from a release of the UMLS
-- a _generation framework_ that appends to the UMLS data assertions from other data sources
+Source for the components are stored in repositories in the [x-atlas-consortia](https://github.com/x-atlas-consortia) Github organization.
+
+repository | content
+--|--
+ubkg-docs|documentation
+ubkg-etl|source and generation frameworks
+ubkg-neo4j|neo4j instance
+ubkg-api|API server
 
 
-### Source framework
-The [**source framework**] is a combination of manual and automated processes that obtain the base set of nodes (entities) and edges (relationships) of the UBKG graph.
+## UBKG Data Sources
+The UBKG database is populated from by loading a set of ontology CSV files, using [neo4j-admin import](https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/). 
+The ontology CSVs are the product of two frameworks:
 
-The source framework is also known as the **UMLS-Graph**.
+## Source framework
+The **source framework** is a combination of manual and automated processes that obtain the base set of nodes (entities) and edges (relationships) that comprise the UMLS CSVs.
+The UMLS CSVs can be loaded into neo4j to form a **UMLS-Graph**, a knowledge graph representation of the UMLS.
 
-- Information on the concepts in the ontologies and vocabularies that are integrated into the UMLS Metathesaurus can be downloaded using the [MetamorphoSys](https://www.ncbi.nlm.nih.gov/books/NBK9683/#:~:text=MetamorphoSys%20is%20the%20UMLS%20installation,to%20create%20customized%20Metathesaurus%20subsets.) application. MetamorphoSys can be configured to download subsets of the entire UMLS.
+- Information on the entities and relationships in the ontologies and vocabularies that are integrated into the UMLS Metathesaurus can be downloaded using the [MetamorphoSys](https://www.ncbi.nlm.nih.gov/books/NBK9683/#:~:text=MetamorphoSys%20is%20the%20UMLS%20installation,to%20create%20customized%20Metathesaurus%20subsets.) application. MetamorphoSys can be configured to download subsets of the entire UMLS.
 - Additional semantic information related to the UMLS can be downloaded manually from the [Semantic Network](https://lhncbc.nlm.nih.gov/semanticnetwork/). 
 
 The result of the Metathesaurus and Semantic Network downloads is a set of files in [Rich Release Format](https://www.ncbi.nlm.nih.gov/books/NBK9685) (RRF). The RRF files contain information on source vocabularies or ontologies, codes, terms, and relationships both with other codes in the same vocabularies and with UMLS concepts.
 
-The RRF files are loaded into a data mart. A python script then executes SQL scripts that perform Extraction, Transformation, and Loading of the RRF data into a set of twelve temporary tables. These tables are exported to CSV format in files that become the **UMLS CSVs**.
+The RRF files can be loaded into tables in a data mart. (The University of Pittsburgh's manages its UMLS content in its **Neptune** data mart.)
+
+A python script then executes SQL scripts that perform Extraction, Transformation, and Loading (ETL) of the RRF data into a set of twelve temporary tables. These tables are exported to CSV format in files that become the **UMLS CSVs**.
+
+The following diagram illustrates the source framework workflow.
 
 ![Source_framework](https://user-images.githubusercontent.com/10928372/202307155-5bfd7a77-e858-4e5c-89a1-a42d964b871d.jpg)
 
-### Generation framework
-The UMLS CSVs can be loaded into neo4j to build a graph version of the UMLS, including concepts and relationships from over 150 vocabularies and ontologies that are integrated into the UMLS, such as SNOMED CT, ICD10, NCI, etc.. 
+## Generation framework
+
+The UBKG extends the UMLS graph by integrating additional assertions from sources outside the UMLS, including a number of standard biomedical ontologies that are published sources such as
+NCBO BioPortal or OBO. 
 
-The UBKG extends the UMLS graph by integrating additional assertions from sources outside of the UMLS, including a number of standard biomedical ontologies that are published in NCBO BioPortal, including:
+The following list lists many of the sources of additional assertions. 
+This list may change based on the requirements of applications of the UBKG.
 
 Ontology or Source | Description
 --- | ---
@@ -71,15 +90,28 @@ The scripts in the generation framework:
 - extract information on assertions (also known as _triples_, or _subject-predicate-object_ relationships) found in ontologies or derived from other sources
 - iteratively add assertion information to the base set of UMLS CSVs to create a set of **ontology CSVs**.
 
-Once a set of ontology CSVs is ready, they can be imported into a neo4j database to form a new instance of the UBKG.
+Once a set of ontology CSVs is ready, it can be imported into a neo4j database to form a new instance of the UBKG.
 
 The generation framework can work with:
 - data from ontologies published in [Web Ontology Language](https://www.w3.org/OWL/) (OWL) files that conform to the [principles](https://obofoundry.org/principles/fp-000-summary.html) of the OBO Foundry
 - data from private or custom ontologies that are in the SimpleKnowledge format. (SimpleKnowledge is a lightweight ontology editor based on spreadsheets developed by Pitt UBMI.)
-- assertion data that conforms to the _UBKG Edge/Node format_.
-
-### PheKnowLator and OWLNETS
-The generation framework obtains assertion data from OWL files with scripts that are based on the [Phenotype Knowledge Translator](https://github.com/callahantiff/PheKnowLator) (PheKnowLator) application. PheKnowLator converts information from an OWL file into the [OWL-NETS](https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0) (OWL NEtwork Transformation for Statistical learning) format.
-
+- assertion data that conforms to the _UBKG Edge/Node format_, as described in the [UBKG User Guide](https://ubkg.docs.xconsortia.org/user-guide/#ingest-files-format-and-content).
+- other reference data sources, by means of custom scripts
+
+## PheKnowLator and OWLNETS
+When the assertion data source is an OWL file, the generation framework uses the [Phenotype Knowledge Translator](https://github.com/callahantiff/PheKnowLator) (PheKnowLator) package. 
+PheKnowLator converts information from an OWL file into the [OWL-NETS](https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0) (OWL NEtwork Transformation for Statistical learning) format.
+
+## Solution Architecture
+The generation framework is a parameterized ETL script that:
+- extracts assertion information from a data source
+- transforms assertion information into the format of the UMLS CSVs
+- appends assertions to the UMLS CSVs to create the ontology CSVs
+
+The following diagram illustrates the basic workflow, showing four cases:
+1. The OWLNETS script that uses PheKnowLator to work with OWL files
+2. A custom script that obtains data from UniProtKB
+3. The SKOWLNETS script that works with SimpleKnowledge data sources
+4. Files in the UBKG edges/nodes format
 
 ![generation_framework](https://user-images.githubusercontent.com/10928372/202308840-1abc0684-684d-476a-8ed5-1a1b4118ffc6.jpg)
diff --git a/docs/lang/en.json b/docs/lang/en.json
@@ -8,13 +8,17 @@
       "href": "/",
       "class": "h2"
     },
-    {
-      "name": "Generation framework",
-      "href": "/#generation-framework"
-    },
     {
       "name": "User Guide",
       "href": "/user-guide"
+    },
+    {
+      "name": "Glossary",
+      "href": "/glossary"
+    },
+    {
+      "name": "PubChem Format",
+      "href": "/pubchem"
     }
   ]
 }
diff --git a/docs/pubchem/index.md b/docs/pubchem/index.md
@@ -0,0 +1,51 @@
+---
+layout: default
+---
+# PubChem Ingest File format
+
+# edges.tsv
+## Fields
+
+| Field     | Corresponding element in UBKG | Accepted formats                                                                                            | Examples                                                            |
+|-----------|-------------------------------|-------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------|
+| subject   | **Code** node                 | PUBCHEM _PubChem CID_                                                                                       | [PUBCHEM 9549299](https://pubchem.ncbi.nlm.nih.gov/compound/9549299 |
+| predicate | relationships                 | For hierarchical relationships, the IRI http://www.w3.org/2000/01/rdf-schema#subClassOf OR the string “isa” | http://www.w3.org/2000/01/rdf-schema#subClassOf                     |
+|           |                               | For non-hierarchical relationships, an IRI for a relationship property in RO	                               | http://purl.obolibrary.org/obo/RO_0002292                           |
+|           |                               | Custom string                                                                                               | drinks milkshake of                                                 |
+| object    | **Code** node                 | same as for subject                                                                                         |                                                                     |
+
+## Relationships (predicates)
+ The definition of relationships is the principle informatics task of assertion. An appropriate selection of concept in the _node_dbxrefs_ field of **nodes.tsv** will associate cross-referenced assertions.
+
+ ## An example for PUBCHEM 9549299:
+
+ An EGFR inhibitor inhibits the expression of EGFR (UNIPROTKB ID P00533), so a possible assertion is
+
+| subjecy         | predicate                                 | object           |
+|-----------------|-------------------------------------------|------------------|
+| PUBCHEM 9549299 | http://purl.obolibrary.org/obo/RO_0002449 | UNIPROTKB P00533 |
+| 1               | 2                                         | 3                |
+
+RO_0002449 = _directly inhibits_
+
+ Because UNIPROTKB is already integrated into the UBKG, any relationship with P00533 would also get the link to HGNC 3236:
+
+ ![image](https://user-images.githubusercontent.com/10928372/203175673-0372303c-ac5c-4122-bb6f-74a4dc31903a.png)
+
+# nodes.tsv
+
+## Fields
+
+| Field                        | Corresponding element in UBKG                     | Accepted formats                                                                                                              | Examples                                                                                 |
+|------------------------------|---------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
+| node_id                      | **Code** node                                     | PUBCHEM _PubChem CID_                                                                                                         | [PUBCHEM 9549299](https://pubchem.ncbi.nlm.nih.gov/compound/9549299                      |
+| node_label                   | **Term** node, _Preferred Term_ (PT) relationship | Text string for the **Compound Name**                                                                                         | EGFR Inhibitor                                                                           |
+| node_definition (_optional_) | **Definition** node, _DEF_ relationship           | Text string - IUPAC Name?                                                                                                     | N-[3-[[6-[3-(trifluoromethyl)anilino]pyrimidin-4-yl]amino]phenyl]cyclopropanecarboxamide |
+| node_synonyms (_optional_)   | **Term** node; _Synonym_ (SYN) relationship       | **Pipe-delimited** list of synonyms                                                                                           | See example (pipes are also used to format table cells)                                  |
+| node_dbxrefs (_optional_)    | Cross-references                                  | Pipe-delimited list of references to cross-referenced concepts. Each cross-reference should be in format SAB:code or UMLS:CUI | UMLS:C5574906                                                                            |
+
+## Example of synonyms for EGFR inhibitor
+N-[3-[[6-[3-(trifluoromethyl)anilino]pyrimidin-4-yl]amino]phenyl]cyclopropanecarboxamide|1S/C21H18F3N5O/c22-21(23,24)14-3-1-4-15(9-14)27-18-11-19(26-12-25-18)28-16-5-2-6-17(10-16)29-20(30)13-7-8-13/h1-6,9-13H,7-8H2,(H,29,30)(H2,25,26,27,28)|YOHYSYJDKVYCJI-UHFFFAOYSA-N|C1CC1C(=O)NC2=CC=CC(=C2)NC3=NC=NC(=C3)NC4=CC=CC(=C4)C(F)(F)F
+
+i.e.,
+2.1.1IUPAC Name|2.1.2InChI|2.1.3InChIKey|2.1.4Canonical SMILES