Skip to content

RamanLab/SKG_Drug_repurposing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Link Prediction Framework

A comprehensive framework for graph-based link prediction with multiple algorithms and evaluation methodologies. This repository implements state-of-the-art link prediction techniques and provides automated testing across multiple real-world network datasets.

Project Objective

The primary objective of this framework is to evaluate and compare the performance of various link prediction algorithms across diverse network datasets. By systematically removing edges and attempting to predict them, we can measure each algorithm's effectiveness at recovering missing connections in different types of networks.

This framework enables researchers and data scientists to:

  • Test multiple link prediction algorithms on various graph datasets
  • Compare algorithm performance using standardized metrics
  • Visualize results through recovery plots and confusion matrices
  • Process and manage large network datasets efficiently

Implementation Details

Supported Datasets

The framework supports evaluation on eight diverse real-world network datasets, sourced from publicly available repositories. Below are the datasets and their download URLs:

Dataset Name Description Download URL
Bitcoin-Alpha Trust network from the Bitcoin Alpha platform Download
Bio-Proteins Protein interaction network Download
HepTh-Collab Collaboration network of HEP-TH researchers Download
Transport Air transportation network (OpenFlights) Download
Enron-Email Email communication network from Enron Download
EU-Email Email network from a European research institution Download
Bitcoin-OTC Trust network from Bitcoin OTC platform Download
Reddit-Hyperlinks Network of subreddit connections via hyperlinks Download

Directory Structure

.
├── config
│   └── hyperparams.py           # Configuration parameters
├── data                         # Dataset storage directory
│   ├── bio-CE-CX.edges
│   ├── cit-HepTh.txt
│   ├── email-Enron.txt
│   └── ...                      # Other dataset files
├── main.py                      # Main execution script
├── results                      # Results directory
│   ├── csvs                     # Edge recovery CSVs for all datasets (contains one CSV per dataset)
│   │   └── ...                  # Dataset-specific edge recovery files
│   ├── final_results.json       # Compiled evaluation results
│   ├── plots                    # Visualization outputs
│   │   ├── {dataset_name}       # Dataset-specific directory (repeated for each dataset)
│   │   │   └── heatmaps         # Confusion matrices by algorithm
│   │   │       ├── act          # Average Commute Time results
│   │   │       ├── hits         # HITS algorithm results
│   │   │       ├── katz         # Katz index results
│   │   │       ├── lhn          # Leicht-Holme-Newman results
│   │   │       └── ra           # Resource Allocation results
│   │   └── {dataset_name}_recovery_plot.png  # Recovery plot for each dataset
│   ├── progress.json            # Progress tracking metadata
│   └── progress_log.json        # Detailed execution logs
└── src                          # Source code
    ├── algorithms               # Algorithm implementations
    │   ├── __init__.py
    │   ├── act.py               # Average Commute Time
    │   ├── enhanced_predictor.py # Main predictor class
    │   ├── hits.py              # HITS algorithm
    │   ├── imports.py           # Common imports
    │   ├── katz.py              # Katz index
    │   └── lhn.py               # Leicht-Holme-Newman
    ├── data                     # Data management
    │   ├── __init__.py
    │   └── graph_manager.py     # Graph loading and processing
    └── evaluation               # Evaluation framework
        ├── __init__.py
        └── evaluator.py         # Evaluation logic

Code Flow

The pipeline is modular and follows these steps:

  1. Data Loading (graph_manager.py)

    • Downloads and preprocesses datasets if not present.
    • Loads graphs in a standard format for downstream tasks.
  2. Link Prediction (enhanced_predictor.py)

    • Initializes with a graph.
    • Implements all prediction algorithms.
    • Creates training graphs by removing edges.
    • Predicts missing links using various algorithms.
  3. Evaluation (evaluator.py)

    • Implements edge recovery evaluation methodology.
    • Calculates metrics like true/false positive rates.
    • Generates confusion matrices and performance plots.
    • Saves results in CSV format and as visualizations.
  4. Execution Management (main.py)

    • Parses command line arguments.
    • Orchestrates the evaluation process.
    • Handles logging and progress tracking.
    • Compiles final results.

Algorithm Explanations

Algorithm Formula Intuition
LHN $\frac{common-neighbours(u,v)}{degree(u) \cdot degree(v)}$ Higher if nodes share many neighbours and have low degree.
RA $\sum_{z \in \text{common-neighbours}(u,v)} \frac{1}{\text{degree}(z)}$ Nodes share rare neighbours, making them more likely to connect.
HITS $\text{hub-score}(u) \cdot \text{authority-score}(v)$ Combines node roles as hubs (linking to others) and authorities (linked to).
Katz $(I - \beta A)^{-1} - I$ Sums all paths between nodes, with shorter paths weighted more heavily.
ACT $\frac{1}{L^+[u,u] + L^+[v,v] - 2L^+[u,v]}$ Measures expected random walk commute time; lower values indicate nodes are closer.

Evaluation Framework

The framework evaluates link prediction algorithms using a systematic edge recovery methodology:

  1. Edge Removal:

    • A percentage of edges (configurable via EVALUATION_CONFIG["edge_removal_percentages"]) is randomly removed from the original graph.
    • Default removal percentages: [10, 20, 30, 40, 50].
  2. Training Graph Creation:

    • The remaining edges form the training graph, which is used for link prediction.
  3. Prediction:

    • Each algorithm ranks non-existing edges (including the removed ones) based on their likelihood of forming a link.
  4. Evaluation:

    • The top-K predicted edges are compared against the removed edges to calculate recovery metrics.
    • Metrics include True Positive Rate (TPR), False Positive Rate (FPR), and Accuracy.
    • The process is repeated for num_runs (default: 20) to ensure statistical robustness.
  5. Visualization:

    • Results are visualized using plots (e.g., recovery curves, confusion matrices) with parameters defined in VIZ_CONFIG.

Confusion Matrix Interpretation

Confusion matrices are generated for each algorithm and edge removal percentage:

Predicted Positive Predicted Negative
Actual Positive True Positive Rate False Negative Rate
Actual Negative False Positive Rate True Negative Rate

Usage Instructions

Installation

Clone the repository:

git clone https://github.com/AswinBalamurugan/DDP_2025.git
cd DDP_2025

Install dependencies:

pip install -r requirements.txt

Running Evaluations

Run evaluation on all datasets:

python main.py

Run evaluation on specific datasets:

python main.py --datasets Bitcoin-Alpha Bio-Proteins

Specify custom output directory:

python main.py --output_dir custom_results

Results

  • CSV files: Detailed metrics for each algorithm, dataset, and edge removal percentage.
  • Recovery plots: Show how well each algorithm recovers missing edges at different removal percentages.
  • Confusion matrices: Visualise prediction accuracy for each algorithm.

Extending the Framework

Adding a new algorithm:

  • Add its implementation under src/algorithms/.
  • Register it in EnhancedLinkPredictor.
  • Update evaluation scripts if needed.

Adding a new dataset:

  • Add its URL and name in config/hyperparams.py.
  • Implement a loader in graph_manager.py.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%