This is the official implementation of the following paper:
IceBerg: Debiased Self-Training for Class-Imbalanced Node Classification (WWW'25) [Paper]
Zhixun Li, Dingshuo Chen, Tong Zhao, Daixin Wang, Hongrui Liu, Zhiqiang Zhang, Jun Zhou*, Jeffrey Xu Yu*
In this work, we propose IceBerg, a debiased self-training framework to address the class-imbalanced and few-shot challenges for GNNs at the same time. We find that leveraging unlabeled nodes can significantly enhance the performance of GNNs in class-imbalanced and few-shot scenarios, and even small, surgical modifications can lead to substantial performance improvements.
- Plug-and-play: Largely improve the performance of existing baselines as a plug-and-play module.
- Simplicity: You only need to add a few lines of code.
- Versatility: State-of-the-art performance in both class-imbalanced and few-shot node classification tasks.
- Lightweight: Achieve similar or even better efficiency compared to BASE balancing methods.
This code needs the following requirements to be satisfied beforehand:
python>=3.9
torch==2.4.0
torch-geometric==2.6.1
ogb==1.3.6
scikit-learn==1.5.2
If you want to use our proposed Double Balancing, you only need to add the following lines of code:
# Double Balancing
if not self.args.no_pseudo and epoch >= self.args.warmup:
# Estimate pseudo class distribution
self.class_num_list_u = torch.tensor([(self.pred_label[self.pseudo_mask] == i).sum().item() for i in range(self.num_cls)])
# Unsupervised loss
loss += criterion_u(output[self.pseudo_mask], self.pred_label[self.pseudo_mask], self.class_num_list_u) * self.args.lamda
If you want to try reproducing the baseline methods, simply run:
bash run_baseline.sh
If you want to try reproducing the performance of IceBerg, simply run:
bash run_iceberg.sh
We have incorporated several baseline methods and benchmark datasets:
Statistic of benchmark datasets is as follows:
Dataset | Type | #nodes | #edges | #features | #classes |
---|---|---|---|---|---|
Cora | Homophily | 2,708 | 10,556 | 1,433 | 7 |
CiteSeer | Homophily | 3,327 | 9,104 | 3,703 | 6 |
PubMed | Homophily | 19,717 | 88,648 | 500 | 3 |
CS | Homophily | 18,333 | 163,788 | 6,805 | 15 |
Physics | Homophily | 34,493 | 495,924 | 8,415 | 5 |
ogbn-arxiv | Homophily | 169,343 | 1,116,243 | 128 | 40 |
CoraFull | Homophily | 19,793 | 126,842 | 8,710 | 70 |
Penn94 | Heterophily | 41,554 | 1,362,229 | 5 | 2 |
Roman-Empire | Heterophily | 22,662 | 32,927 | 300 | 18 |
Our proposed DB and IceBerg are able to achieve significant improvements conbined with several BASE balancing methods.
Due to IceBerg's outstanding ability of leverage unsupervised signals, it also achieves state-of-the-art results in few-shot node classification scenarios.
We acknowledge these excellent works for providing open-source code: GraphENS, GraphSHA, TAM, BAT, D2PT.
Please consider citing our work if you find it helpful:
@article{li2025iceberg,
title={IceBerg: Debiased Self-Training for Class-Imbalanced Node Classification},
author={Li, Zhixun and Chen, Dingshuo and Zhao, Tong and Wang, Daixin and Liu, Hongrui and Zhang, Zhiqiang and Zhou, Jun and Yu, Jeffrey Xu},
journal={arXiv preprint arXiv:2502.06280},
year={2025}
}