This repository contains a BioCypher adapter for Open Targets data version 25.09. The project is currently under active development.
- Overview
- Features
- Node and Edge Types
- Prerequisites
- Installation
- Data Preparation
- Usage
- Open Targets Data Schema
- Custom Node/Edge Definitions
- Code Generation
- Future Plans
- Contributing
- License
BioCypher's modular design enables the use of different adapters to consume
various data sources and produce knowledge graphs. This adapter serves as a
"secondary
adapter"
for Open Targets data, meaning it
adapts a pre-harmonised composite of atomic resources via the Open Targets
pipeline. The adapter includes predefined sets of node types (entities) and edge
types (relationships), or in the language of this adapter, presets of node and
edge definitions. A script is provided to run BioCypher with the adapter,
creating a knowledge graph with all predefined nodes and edges. On a consumer
laptop, building the full graph typically takes 1-2 hours.
- Converts Open Targets data (version 25.09) into BioCypher-compatible format
- Includes predefined sets of node types and edge types (node and edge definition presets)
- Uses declarative syntax to minimize code needed for graph schema construction
- Powered by duckdb for fast and memory-efficient processing
- Implements true streaming from datasets to BioCypher with minimal intermediate memory usage
- Target
- Disease
- Gene Ontology (Category)
- Molecule
- Mouse Model
- Mouse Phenotype
- Mouse Target
- Target -> Disease
- Target -> Gene Ontology
- Molecule -> Associated Target
- Molecule -> Associated Disease
- Poetry for dependency management
-
Clone the repository:
git clone https://github.com/biocypher/open-targets.git cd open-targets -
Install dependencies using Poetry:
poetry install
-
Activate the virtual environment:
poetry shell
-
The adapter can now be imported:
from open_targets.adapter import acquisition_context
Required datasets for node/edge definition presets:
The resulting directory should have the following structure:
directory-of-your-choice/
├── targets/
│ └── **
│ └── *.parquet
├── diseases/
│ └── **
│ └── *.parquet
...
Use the download.sh script to download the datasets. First check the version of the Open Targets data you want to download. Then, go to the folder where you want the data to be downloaded to and execute the following command:
{PATH_TO_REPO}/scripts/download.shTo run the adapter using Docker, follow these steps:
-
Ensure you have Docker installed and running.
-
Follow the Data Preparation steps to download the required datasets and place them in the
{PATH_TO_REPO}/data/ot_filesdirectory. You don't have to download and install poetry as mentioned in the Installation steps. -
Create a folder in the root directory of this repo called
dumpwhere a dump of the database will be stored. This makes it easier to copy the database to other machines. Make sure to grant everybody read/write permissions to thisdumpdirectory.mkdir -p dump chmod 777 dump
NOTE: When moving the dump file, make sure not to change file permissions and ownership. Owner should be
7474:7474and permissions should be-rw-r--r--. -
run the following command to start the BioCypher Open Targets adapter:
docker-compose up -d
-
The adapter will start building the knowledge graph using the predefined node and edge definitions. Once everything is ready, you can access the Neo4j graph at
http://localhost:7474. The database is running atlocalhost:7687.
-
Follow the Installation steps
-
Follow the Data Preparation steps and place the downloaded Parquet files in the
{PATH_TO_REPO}/data/ot_filesdirectory -
Run the script:
python ./scripts/open_targets_biocypher_run.py
The script runs BioCypher and generates a knowledge graph using all our node/edge definition presets.
-
In order to load your run into a database to see your graph, you have to update the
importcontainer in thedocker-compose.yamlfile. Mount the absolute path to thebiocypher-outdirectory where the output files are stored. and execute the import script.import: image: neo4j:4.4-enterprise container_name: import environment: NEO4J_AUTH: none NEO4J_ACCEPT_LICENSE_AGREEMENT: "yes" FILL_DB_ON_STARTUP: "yes" volumes: - biocypher_neo4j_volume:/data - ./scripts:/scripts - ./biocypher-out:/absolute/path/to/repo/BioCypher-OT/biocypher-out # add this line command: - /bin/bash # - /scripts/import.sh # remove this line - /absolute/path/to/repo/BioCypher-OT/biocypher-out/[RUN]/neo4j-admin-import-call.sh # add this line
Basically the Quick Start but with your own set of node/edge definitions taken from our presets:
from open_targets.definition import (
...
)
bc = BioCypher(biocypher_config_path=...)
node_definitions = ... # imported node definitions
edge_definitions = ... # imported edge definitions
context = AcquisitionContext(
node_definitions=node_definitions,
edge_definitions=edge_definitions,
datasets_location=..., # directory containing the downloaded datasets
)
for node_definition in node_definitions:
bc.write_nodes(context.get_acquisition_generator(node_definition))
for edge_definition in edge_definitions:
bc.write_edges(context.get_acquisition_generator(edge_definition))In brief, first construct a context by providing a set of node/edge definitions. Then, for each definition, you can obtain a generator that streams data from a dataset to BioCypher. The data querying and transformation logic is defined in the node/edge definitions.
More details about customization are provided below.
The full schema of Open Targets data is represented as Python classes included
in this adapter. This design provides type checking for dataset and field
references in code to minimize human error. All dataset and field classes can be
found in open_targets/data/schema.py.
All dataset and field classes are prefixed with Dataset and Field,
respectively. Field names follow their structural location in their datasets.
For example, FieldTargetsHallmarksAttributes represents the attributes field
in the targets dataset, under the hallmarks field.
The schema can be used for data discovery and is utilized in node/edge definitions.
A node/edge definition describes how nodes/edges are acquired from a dataset. Each node/edge has essential attributes that make it a valid graph component, and a definition specifies how these values are acquired or computed. Each attribute has an expression that describes the chain of actions to acquire the value from the dataset. An expression can be as simple as a field access or a complex chain of transformations. Here's a simple example:
definition = ExpressionNodeAcquisitionDefinition(
scan_operation=RowScanOperation(dataset=DatasetTargets),
primary_id=FieldExpression(FieldTargetsId),
label=LiteralExpression("ensembl"),
properties=[
(LiteralExpression(FieldTargetsApprovedSymbol.name), FieldExpression(FieldTargetsApprovedSymbol))
],
)In plain language, this definition scans through the targets dataset and
generates a node for each row. The node's ID is assigned from the id field,
its label is set to the literal value ensembl, and its properties include a
single property where the key is the name of the referenced field
approvedSymbol and the value comes from that field.
Expressions can be chained together:
expression = NormaliseCurieExpression(ToStringExpression(FieldExpression(FieldEvidenceDiseaseId)))This is equivalent to:
value = normalise_curie(str(data[FieldEvidenceDiseaseId]))In fact, this is almost exactly how a function will be built and run during acquisition.
An edge definition is similar but includes two additional attributes, source
and target, to link two nodes together.
For minor customization, you can derive from one of our presets as follows:
from open_targets.data.schema import FieldTargetsApprovedSymbol
from open_targets.definition import node_target
from dataclasses import replace
node_definition = replace(node_target, primary_id=FieldTargetsApprovedSymbol)This repository uses code generation (powered by
jinja) to generate scripts such
as the Open Targets data schema represented in Python classes. The code
generation scripts are located under code_generation. *.jinja files are
templates for the generated scripts, and each template has its corresponding
script generated in the same directory.
- Implement cloud streaming to eliminate the need for local dataset storage
- Develop a codeless mode for defining node/edge definitions in JSON/YAML files
- Support Open Targets metadata migration to Croissant ML
- Extend beyond Open Targets data to support various tabular data formats
- Create a comprehensive set of scientifically meaningful node/edge definitions and knowledge graph schemas
Contributions are welcome! Please feel free to submit a Pull Request or create an Issue if you discover any problems.
This project is licensed under the MIT License - see the LICENSE file for details.