Link Prediction Framework

A comprehensive framework for graph-based link prediction with multiple algorithms and evaluation methodologies. This repository implements state-of-the-art link prediction techniques and provides automated testing across multiple real-world network datasets.

Project Objective

The primary objective of this framework is to evaluate and compare the performance of various link prediction algorithms across diverse network datasets. By systematically removing edges and attempting to predict them, we can measure each algorithm's effectiveness at recovering missing connections in different types of networks.

This framework enables researchers and data scientists to:

Test multiple link prediction algorithms on various graph datasets
Compare algorithm performance using standardized metrics
Visualize results through recovery plots and confusion matrices
Process and manage large network datasets efficiently

Implementation Details

Supported Datasets

The framework supports evaluation on eight diverse real-world network datasets, sourced from publicly available repositories. Below are the datasets and their download URLs:

Dataset Name	Description	Download URL
Bitcoin-Alpha	Trust network from the Bitcoin Alpha platform	Download
Bio-Proteins	Protein interaction network	Download
HepTh-Collab	Collaboration network of HEP-TH researchers	Download
Transport	Air transportation network (OpenFlights)	Download
Enron-Email	Email communication network from Enron	Download
EU-Email	Email network from a European research institution	Download
Bitcoin-OTC	Trust network from Bitcoin OTC platform	Download
Reddit-Hyperlinks	Network of subreddit connections via hyperlinks	Download

Directory Structure

.
├── config
│   └── hyperparams.py           # Configuration parameters
├── data                         # Dataset storage directory
│   ├── bio-CE-CX.edges
│   ├── cit-HepTh.txt
│   ├── email-Enron.txt
│   └── ...                      # Other dataset files
├── main.py                      # Main execution script
├── results                      # Results directory
│   ├── csvs                     # Edge recovery CSVs for all datasets (contains one CSV per dataset)
│   │   └── ...                  # Dataset-specific edge recovery files
│   ├── final_results.json       # Compiled evaluation results
│   ├── plots                    # Visualization outputs
│   │   ├── {dataset_name}       # Dataset-specific directory (repeated for each dataset)
│   │   │   └── heatmaps         # Confusion matrices by algorithm
│   │   │       ├── act          # Average Commute Time results
│   │   │       ├── hits         # HITS algorithm results
│   │   │       ├── katz         # Katz index results
│   │   │       ├── lhn          # Leicht-Holme-Newman results
│   │   │       └── ra           # Resource Allocation results
│   │   └── {dataset_name}_recovery_plot.png  # Recovery plot for each dataset
│   ├── progress.json            # Progress tracking metadata
│   └── progress_log.json        # Detailed execution logs
└── src                          # Source code
    ├── algorithms               # Algorithm implementations
    │   ├── __init__.py
    │   ├── act.py               # Average Commute Time
    │   ├── enhanced_predictor.py # Main predictor class
    │   ├── hits.py              # HITS algorithm
    │   ├── imports.py           # Common imports
    │   ├── katz.py              # Katz index
    │   └── lhn.py               # Leicht-Holme-Newman
    ├── data                     # Data management
    │   ├── __init__.py
    │   └── graph_manager.py     # Graph loading and processing
    └── evaluation               # Evaluation framework
        ├── __init__.py
        └── evaluator.py         # Evaluation logic

Code Flow

The pipeline is modular and follows these steps:

Data Loading (graph_manager.py)
- Downloads and preprocesses datasets if not present.
- Loads graphs in a standard format for downstream tasks.
Link Prediction (enhanced_predictor.py)
- Initializes with a graph.
- Implements all prediction algorithms.
- Creates training graphs by removing edges.
- Predicts missing links using various algorithms.
Evaluation (evaluator.py)
- Implements edge recovery evaluation methodology.
- Calculates metrics like true/false positive rates.
- Generates confusion matrices and performance plots.
- Saves results in CSV format and as visualizations.
Execution Management (main.py)
- Parses command line arguments.
- Orchestrates the evaluation process.
- Handles logging and progress tracking.
- Compiles final results.

Algorithm Explanations

Algorithm	Formula	Intuition
LHN	$\frac{common-neighbours(u,v)}{degree(u) \cdot degree(v)}$	Higher if nodes share many neighbours and have low degree.
RA	$\sum_{z \in \text{common-neighbours}(u,v)} \frac{1}{\text{degree}(z)}$	Nodes share rare neighbours, making them more likely to connect.
HITS	$\text{hub-score}(u) \cdot \text{authority-score}(v)$	Combines node roles as hubs (linking to others) and authorities (linked to).
Katz	$(I - \beta A)^{-1} - I$	Sums all paths between nodes, with shorter paths weighted more heavily.
ACT	$\frac{1}{L^+[u,u] + L^+[v,v] - 2L^+[u,v]}$	Measures expected random walk commute time; lower values indicate nodes are closer.

Evaluation Framework

The framework evaluates link prediction algorithms using a systematic edge recovery methodology:

Edge Removal:
- A percentage of edges (configurable via EVALUATION_CONFIG["edge_removal_percentages"]) is randomly removed from the original graph.
- Default removal percentages: [10, 20, 30, 40, 50].
Training Graph Creation:
- The remaining edges form the training graph, which is used for link prediction.
Prediction:
- Each algorithm ranks non-existing edges (including the removed ones) based on their likelihood of forming a link.
Evaluation:
- The top-K predicted edges are compared against the removed edges to calculate recovery metrics.
- Metrics include True Positive Rate (TPR), False Positive Rate (FPR), and Accuracy.
- The process is repeated for num_runs (default: 20) to ensure statistical robustness.
Visualization:
- Results are visualized using plots (e.g., recovery curves, confusion matrices) with parameters defined in VIZ_CONFIG.

Confusion Matrix Interpretation

Confusion matrices are generated for each algorithm and edge removal percentage:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive Rate	False Negative Rate
Actual Negative	False Positive Rate	True Negative Rate

Usage Instructions

Installation

Clone the repository:

git clone https://github.com/AswinBalamurugan/DDP_2025.git
cd DDP_2025

Install dependencies:

pip install -r requirements.txt

Running Evaluations

Run evaluation on all datasets:

python main.py

Run evaluation on specific datasets:

python main.py --datasets Bitcoin-Alpha Bio-Proteins

Specify custom output directory:

python main.py --output_dir custom_results

Results

CSV files: Detailed metrics for each algorithm, dataset, and edge removal percentage.
Recovery plots: Show how well each algorithm recovers missing edges at different removal percentages.
Confusion matrices: Visualise prediction accuracy for each algorithm.

Extending the Framework

Adding a new algorithm:

Add its implementation under src/algorithms/.
Register it in EnhancedLinkPredictor.
Update evaluation scripts if needed.

Adding a new dataset:

Add its URL and name in config/hyperparams.py.
Implement a loader in graph_manager.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Link Prediction Framework

Project Objective

Implementation Details

Supported Datasets

Directory Structure

Code Flow

Algorithm Explanations

Evaluation Framework

Confusion Matrix Interpretation

Usage Instructions

Installation

Running Evaluations

Results

Extending the Framework

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
results		results
src		src
.DS_Store		.DS_Store
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

RamanLab/SKG_Drug_repurposing

Folders and files

Latest commit

History

Repository files navigation

Link Prediction Framework

Project Objective

Implementation Details

Supported Datasets

Directory Structure

Code Flow

Algorithm Explanations

Evaluation Framework

Confusion Matrix Interpretation

Usage Instructions

Installation

Running Evaluations

Results

Extending the Framework

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages