Skip to content

thehyve/BioCypher-OT

 
 

Repository files navigation

BioCypher Open Targets Data (25.09) Adapter

Python Version License

This repository contains a BioCypher adapter for Open Targets data version 25.09. The project is currently under active development.

Table of Contents

Overview

BioCypher's modular design enables the use of different adapters to consume various data sources and produce knowledge graphs. This adapter serves as a "secondary adapter" for Open Targets data, meaning it adapts a pre-harmonised composite of atomic resources via the Open Targets pipeline. The adapter includes predefined sets of node types (entities) and edge types (relationships), or in the language of this adapter, presets of node and edge definitions. A script is provided to run BioCypher with the adapter, creating a knowledge graph with all predefined nodes and edges. On a consumer laptop, building the full graph typically takes 1-2 hours.

Features

  • Converts Open Targets data (version 25.09) into BioCypher-compatible format
  • Includes predefined sets of node types and edge types (node and edge definition presets)
  • Uses declarative syntax to minimize code needed for graph schema construction
  • Powered by duckdb for fast and memory-efficient processing
  • Implements true streaming from datasets to BioCypher with minimal intermediate memory usage

Node and Edge Types

Nodes

  • Target
  • Disease
  • Gene Ontology (Category)
  • Molecule
  • Mouse Model
  • Mouse Phenotype
  • Mouse Target

Edges

  • Target -> Disease
  • Target -> Gene Ontology
  • Molecule -> Associated Target
  • Molecule -> Associated Disease

Prerequisites

  • Poetry for dependency management

Installation

  1. Clone the repository:

    git clone https://github.com/biocypher/open-targets.git
    cd open-targets
  2. Install dependencies using Poetry:

    poetry install
  3. Activate the virtual environment:

    poetry shell
  4. The adapter can now be imported:

    from open_targets.adapter import acquisition_context

Data Preparation

Required datasets for node/edge definition presets:

The resulting directory should have the following structure:

directory-of-your-choice/
├── targets/
│   └── **
│       └── *.parquet
├── diseases/
│   └── **
│       └── *.parquet
...

Use the download.sh script to download the datasets. First check the version of the Open Targets data you want to download. Then, go to the folder where you want the data to be downloaded to and execute the following command:

{PATH_TO_REPO}/scripts/download.sh

Usage

Docker - fully automated

To run the adapter using Docker, follow these steps:

  1. Ensure you have Docker installed and running.

  2. Follow the Data Preparation steps to download the required datasets and place them in the {PATH_TO_REPO}/data/ot_files directory. You don't have to download and install poetry as mentioned in the Installation steps.

  3. Create a folder in the root directory of this repo called dump where a dump of the database will be stored. This makes it easier to copy the database to other machines. Make sure to grant everybody read/write permissions to this dump directory.

    mkdir -p dump
    chmod 777 dump

    NOTE: When moving the dump file, make sure not to change file permissions and ownership. Owner should be 7474:7474 and permissions should be -rw-r--r--.

  4. run the following command to start the BioCypher Open Targets adapter:

    docker-compose up -d
  5. The adapter will start building the knowledge graph using the predefined node and edge definitions. Once everything is ready, you can access the Neo4j graph at http://localhost:7474. The database is running at localhost:7687.

Local & Docker - semi-automated

  1. Follow the Installation steps

  2. Follow the Data Preparation steps and place the downloaded Parquet files in the {PATH_TO_REPO}/data/ot_files directory

  3. Run the script:

    python ./scripts/open_targets_biocypher_run.py

    The script runs BioCypher and generates a knowledge graph using all our node/edge definition presets.

  4. In order to load your run into a database to see your graph, you have to update the import container in the docker-compose.yaml file. Mount the absolute path to the biocypher-out directory where the output files are stored. and execute the import script.

    import:
        image: neo4j:4.4-enterprise
        container_name: import
        environment:
            NEO4J_AUTH: none
            NEO4J_ACCEPT_LICENSE_AGREEMENT: "yes"
            FILL_DB_ON_STARTUP: "yes"
        volumes:
            - biocypher_neo4j_volume:/data
            - ./scripts:/scripts
            - ./biocypher-out:/absolute/path/to/repo/BioCypher-OT/biocypher-out # add this line
        command:
            - /bin/bash
            # - /scripts/import.sh # remove this line
            - /absolute/path/to/repo/BioCypher-OT/biocypher-out/[RUN]/neo4j-admin-import-call.sh # add this line

Not So Quick Start

Basically the Quick Start but with your own set of node/edge definitions taken from our presets:

from open_targets.definition import (
    ...
)

bc = BioCypher(biocypher_config_path=...)

node_definitions = ... # imported node definitions
edge_definitions = ... # imported edge definitions

context = AcquisitionContext(
    node_definitions=node_definitions,
    edge_definitions=edge_definitions,
    datasets_location=..., # directory containing the downloaded datasets
)

for node_definition in node_definitions:
    bc.write_nodes(context.get_acquisition_generator(node_definition))
for edge_definition in edge_definitions:
    bc.write_edges(context.get_acquisition_generator(edge_definition))

In brief, first construct a context by providing a set of node/edge definitions. Then, for each definition, you can obtain a generator that streams data from a dataset to BioCypher. The data querying and transformation logic is defined in the node/edge definitions.

More details about customization are provided below.

Open Targets Data Schema

The full schema of Open Targets data is represented as Python classes included in this adapter. This design provides type checking for dataset and field references in code to minimize human error. All dataset and field classes can be found in open_targets/data/schema.py.

All dataset and field classes are prefixed with Dataset and Field, respectively. Field names follow their structural location in their datasets. For example, FieldTargetsHallmarksAttributes represents the attributes field in the targets dataset, under the hallmarks field.

The schema can be used for data discovery and is utilized in node/edge definitions.

Custom Node/Edge Definitions

A node/edge definition describes how nodes/edges are acquired from a dataset. Each node/edge has essential attributes that make it a valid graph component, and a definition specifies how these values are acquired or computed. Each attribute has an expression that describes the chain of actions to acquire the value from the dataset. An expression can be as simple as a field access or a complex chain of transformations. Here's a simple example:

definition = ExpressionNodeAcquisitionDefinition(
    scan_operation=RowScanOperation(dataset=DatasetTargets),
    primary_id=FieldExpression(FieldTargetsId),
    label=LiteralExpression("ensembl"),
    properties=[
        (LiteralExpression(FieldTargetsApprovedSymbol.name), FieldExpression(FieldTargetsApprovedSymbol))
    ],
)

In plain language, this definition scans through the targets dataset and generates a node for each row. The node's ID is assigned from the id field, its label is set to the literal value ensembl, and its properties include a single property where the key is the name of the referenced field approvedSymbol and the value comes from that field.

Expressions can be chained together:

expression = NormaliseCurieExpression(ToStringExpression(FieldExpression(FieldEvidenceDiseaseId)))

This is equivalent to:

value = normalise_curie(str(data[FieldEvidenceDiseaseId]))

In fact, this is almost exactly how a function will be built and run during acquisition.

An edge definition is similar but includes two additional attributes, source and target, to link two nodes together.

For minor customization, you can derive from one of our presets as follows:

from open_targets.data.schema import FieldTargetsApprovedSymbol
from open_targets.definition import node_target
from dataclasses import replace
node_definition = replace(node_target, primary_id=FieldTargetsApprovedSymbol)

Code Generation

This repository uses code generation (powered by jinja) to generate scripts such as the Open Targets data schema represented in Python classes. The code generation scripts are located under code_generation. *.jinja files are templates for the generated scripts, and each template has its corresponding script generated in the same directory.

Future Plans

  • Implement cloud streaming to eliminate the need for local dataset storage
  • Develop a codeless mode for defining node/edge definitions in JSON/YAML files
  • Support Open Targets metadata migration to Croissant ML
  • Extend beyond Open Targets data to support various tabular data formats
  • Create a comprehensive set of scientifically meaningful node/edge definitions and knowledge graph schemas

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or create an Issue if you discover any problems.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Other 0.4%