BioCypher Open Targets Data (25.09) Adapter

This repository contains a BioCypher adapter for Open Targets data version 25.09. The project is currently under active development.

Overview

BioCypher's modular design enables the use of different adapters to consume various data sources and produce knowledge graphs. This adapter serves as a "secondary adapter" for Open Targets data, meaning it adapts a pre-harmonised composite of atomic resources via the Open Targets pipeline. The adapter includes predefined sets of node types (entities) and edge types (relationships), or in the language of this adapter, presets of node and edge definitions. A script is provided to run BioCypher with the adapter, creating a knowledge graph with all predefined nodes and edges. On a consumer laptop, building the full graph typically takes 1-2 hours.

Features

Converts Open Targets data (version 25.09) into BioCypher-compatible format
Includes predefined sets of node types and edge types (node and edge definition presets)
Uses declarative syntax to minimize code needed for graph schema construction
Powered by duckdb for fast and memory-efficient processing
Implements true streaming from datasets to BioCypher with minimal intermediate memory usage

Node and Edge Types

Nodes

Target
Disease
Gene Ontology (Category)
Molecule
Mouse Model
Mouse Phenotype
Mouse Target

Edges

Target -> Disease
Target -> Gene Ontology
Molecule -> Associated Target
Molecule -> Associated Disease

Prerequisites

Poetry for dependency management

Installation

Clone the repository:

git clone https://github.com/biocypher/open-targets.git
cd open-targets

Install dependencies using Poetry:
```
poetry install
```
Activate the virtual environment:
```
poetry shell
```

The adapter can now be imported:

from open_targets.adapter import acquisition_context

Data Preparation

Required datasets for node/edge definition presets:

The resulting directory should have the following structure:

directory-of-your-choice/
├── targets/
│   └── **
│       └── *.parquet
├── diseases/
│   └── **
│       └── *.parquet
...

Use the download.sh script to download the datasets. First check the version of the Open Targets data you want to download. Then, go to the folder where you want the data to be downloaded to and execute the following command:

{PATH_TO_REPO}/scripts/download.sh

Usage

Docker - fully automated

To run the adapter using Docker, follow these steps:

Ensure you have Docker installed and running.
Follow the Data Preparation steps to download the required datasets and place them in the {PATH_TO_REPO}/data/ot_files directory. You don't have to download and install poetry as mentioned in the Installation steps.
Create a folder in the root directory of this repo called dump where a dump of the database will be stored. This makes it easier to copy the database to other machines. Make sure to grant everybody read/write permissions to this dump directory.
```
mkdir -p dump
chmod 777 dump
```
NOTE: When moving the dump file, make sure not to change file permissions and ownership. Owner should be 7474:7474 and permissions should be -rw-r--r--.
run the following command to start the BioCypher Open Targets adapter:
```
docker-compose up -d
```
The adapter will start building the knowledge graph using the predefined node and edge definitions. Once everything is ready, you can access the Neo4j graph at http://localhost:7474. The database is running at localhost:7687.

Local & Docker - semi-automated

Follow the Installation steps
Follow the Data Preparation steps and place the downloaded Parquet files in the {PATH_TO_REPO}/data/ot_files directory
Run the script:
```
python ./scripts/open_targets_biocypher_run.py
```
The script runs BioCypher and generates a knowledge graph using all our node/edge definition presets.

In order to load your run into a database to see your graph, you have to update the import container in the docker-compose.yaml file. Mount the absolute path to the biocypher-out directory where the output files are stored. and execute the import script.

import:
    image: neo4j:4.4-enterprise
    container_name: import
    environment:
        NEO4J_AUTH: none
        NEO4J_ACCEPT_LICENSE_AGREEMENT: "yes"
        FILL_DB_ON_STARTUP: "yes"
    volumes:
        - biocypher_neo4j_volume:/data
        - ./scripts:/scripts
        - ./biocypher-out:/absolute/path/to/repo/BioCypher-OT/biocypher-out # add this line
    command:
        - /bin/bash
        # - /scripts/import.sh # remove this line
        - /absolute/path/to/repo/BioCypher-OT/biocypher-out/[RUN]/neo4j-admin-import-call.sh # add this line

Not So Quick Start

Basically the Quick Start but with your own set of node/edge definitions taken from our presets:

from open_targets.definition import (
    ...
)

bc = BioCypher(biocypher_config_path=...)

node_definitions = ... # imported node definitions
edge_definitions = ... # imported edge definitions

context = AcquisitionContext(
    node_definitions=node_definitions,
    edge_definitions=edge_definitions,
    datasets_location=..., # directory containing the downloaded datasets
)

for node_definition in node_definitions:
    bc.write_nodes(context.get_acquisition_generator(node_definition))
for edge_definition in edge_definitions:
    bc.write_edges(context.get_acquisition_generator(edge_definition))

In brief, first construct a context by providing a set of node/edge definitions. Then, for each definition, you can obtain a generator that streams data from a dataset to BioCypher. The data querying and transformation logic is defined in the node/edge definitions.

More details about customization are provided below.

Open Targets Data Schema

The full schema of Open Targets data is represented as Python classes included in this adapter. This design provides type checking for dataset and field references in code to minimize human error. All dataset and field classes can be found in open_targets/data/schema.py.

All dataset and field classes are prefixed with Dataset and Field, respectively. Field names follow their structural location in their datasets. For example, FieldTargetsHallmarksAttributes represents the attributes field in the targets dataset, under the hallmarks field.

The schema can be used for data discovery and is utilized in node/edge definitions.

Custom Node/Edge Definitions

A node/edge definition describes how nodes/edges are acquired from a dataset. Each node/edge has essential attributes that make it a valid graph component, and a definition specifies how these values are acquired or computed. Each attribute has an expression that describes the chain of actions to acquire the value from the dataset. An expression can be as simple as a field access or a complex chain of transformations. Here's a simple example:

definition = ExpressionNodeAcquisitionDefinition(
    scan_operation=RowScanOperation(dataset=DatasetTargets),
    primary_id=FieldExpression(FieldTargetsId),
    label=LiteralExpression("ensembl"),
    properties=[
        (LiteralExpression(FieldTargetsApprovedSymbol.name), FieldExpression(FieldTargetsApprovedSymbol))
    ],
)

In plain language, this definition scans through the targets dataset and generates a node for each row. The node's ID is assigned from the id field, its label is set to the literal value ensembl, and its properties include a single property where the key is the name of the referenced field approvedSymbol and the value comes from that field.

Expressions can be chained together:

expression = NormaliseCurieExpression(ToStringExpression(FieldExpression(FieldEvidenceDiseaseId)))

This is equivalent to:

value = normalise_curie(str(data[FieldEvidenceDiseaseId]))

In fact, this is almost exactly how a function will be built and run during acquisition.

An edge definition is similar but includes two additional attributes, source and target, to link two nodes together.

For minor customization, you can derive from one of our presets as follows:

from open_targets.data.schema import FieldTargetsApprovedSymbol
from open_targets.definition import node_target
from dataclasses import replace
node_definition = replace(node_target, primary_id=FieldTargetsApprovedSymbol)

Code Generation

This repository uses code generation (powered by jinja) to generate scripts such as the Open Targets data schema represented in Python classes. The code generation scripts are located under code_generation. *.jinja files are templates for the generated scripts, and each template has its corresponding script generated in the same directory.

Future Plans

Implement cloud streaming to eliminate the need for local dataset storage
Develop a codeless mode for defining node/edge definitions in JSON/YAML files
Support Open Targets metadata migration to Croissant ML
Extend beyond Open Targets data to support various tabular data formats
Create a comprehensive set of scientifically meaningful node/edge definitions and knowledge graph schemas

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or create an Issue if you discover any problems.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
code_generation		code_generation
config		config
open_targets		open_targets
scripts		scripts
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BioCypher Open Targets Data (25.09) Adapter

Table of Contents

Overview

Features

Node and Edge Types

Nodes

Edges

Prerequisites

Installation

Data Preparation

Usage

Docker - fully automated

Local & Docker - semi-automated

Not So Quick Start

Open Targets Data Schema

Custom Node/Edge Definitions

Code Generation

Future Plans

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

thehyve/BioCypher-OT

Folders and files

Latest commit

History

Repository files navigation

BioCypher Open Targets Data (25.09) Adapter

Table of Contents

Overview

Features

Node and Edge Types

Nodes

Edges

Prerequisites

Installation

Data Preparation

Usage

Docker - fully automated

Local & Docker - semi-automated

Not So Quick Start

Open Targets Data Schema

Custom Node/Edge Definitions

Code Generation

Future Plans

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages