This repository contains the code to reproduce the experiments in the paper "Training Neural Networks on Data Sources with Unknown Reliability".
If you would like to implement this method on your own dataset, please use the python package loss_adapted_plasticity, which can be installed with:
pip install loss_adapted_plasticity
When data is generated by multiple sources, conventional training methods update models assuming equal reliability for each source and do not consider their individual data quality during training. However, in many applications, sources have varied levels of reliability that can have negative effects on the performance of a neural network. A key issue is that often the quality of data for individual sources is not known during training. Focusing on supervised learning, we aim to train neural networks on each data source for a number of steps proportional to the source's estimated relative reliability, by using a dynamic weighting. This way, we allow training on all sources during the warm-up, and reduce learning on less reliable sources during the final training stages, when it has been shown models overfit to noise. We show through diverse experiments, this can significantly improve model performance when trained on mixtures of reliable and unreliable data sources, and maintain performance when models are trained on reliable sources only.
The notebook (.ipynb
) files contain the code to load the results and generate the figures and tables presented in the paper, and the python (.py
) files contain the scripts to produce the results, that are saved in the outputs
folder.
To use SourceLossWeighting
, you should use the python package loss_adapted_plasticity. It only requires a small change to the training loop:
# define the loss weighting with the desired parameters #
loss_weighting = LossAdaptedPlasticity(
history_length=LAP_HISTORY_LENGTH,
warmup_iters=WARMUP_ITERS,
depression_strength=DEPRESSION_STRENGTH,
leniency=LENIENCY,
)
# ensure that your loss function #
# returns a loss for each sample in the batch #
criterion = nn.LossFunctionOfSomeKind(reduction="none")
for epoch in range(epochs):
for data, target, sources in train_loader:
optimizer.zero_grad()
output = model(data)
losses = criterion(output, target)
# compute the weighted loss #
loss = loss_weighting(losses, sources).mean()
loss.backward()
optimizer.step()
The loss_weighting
object will keep track of the loss history for each source, and will compute the weighted loss for each batch in loss_weighting
object can be used with any loss function that returns a loss for each sample in the batch and will return a weighted loss for each sample, which can then be reduced (usually by the mean or sum) to get the batch loss used in back-propagation.
The repository is structured as follows:
source_loss_weighting
==============================
# code for the baselines
├─ baselines
├─ arfl
├─ cdr
├─ coteaching
├─ idpa
├─ nls
├─ rrl
==============================
# data gets downloaded to here
├─ data
==============================
# the code for the *test_synthetic* experiments
├─ experiment_code
# data loading and processing
├─ data_utils
# general utility code
├─ utils
==============================
# code to generate figures in the notebooks
├─ graph_code
==============================
# code used in all experiments to scale the gradients or losses
├─ loss_adapted_plasticity
==============================
# where the results are saved and will be loaded from
├─ outputs
# outputs of 13_california_housing_regression.py
├─ california_housing
# outputs of 4_test_cifar10n_different_noise_presnet.py
├─ cifar_10n_different_noise_results
# outputs of 8_test_cifar10n_different_noise_low_capacity.py
├─ cifar_10n_different_noise_results_low_capacity
# outputs of 7_test_synthetic_different_noise_presnet.py
├─ cifar_different_noise_results
# outputs of 10_difficult_data.py
├─ difficult_data
# outputs of 3_test_ecg_different_noise.py
├─ ecg_results
# outputs of 6_test_goemotions.py
├─ goemotions
# graphs from the notebooks
├─ graphs
# outputs of 12_imagenet64_random_label_and_noise.py
├─ imagenet64_random_label_and_noise
# outputs of 5_test_nlp_random_label.py
├─ imdb_random_label
# outputs of 2_test_presnet.py and baselines
├─ presnet_results
├─ baseline
├─ rrl
# outputs of 1_test_synthetic.py and baselines
├─ synthetic_results
├─ baseline
├─ arfl
├─ cifar10
├─ cifar100
├─ fmnist
├─ co-teaching
├─ idpa
# outputs of 9_test_synthetic_batches_multiple_source_varied_sizes.py
├─ synthetic_results_batch_multiple_sources_varied_sizes
# outputs of 11_tiny_imagenet_random_label.py
├─ tiny_imagenet_random_label
# outputs of 0_test_hparams.ipynb
├─ toy_example
==============================
This code was tested with python 3.11.5 and pip 24.0 and the required python packages are located in the requirements.txt
file.
To install the correct python and pip version (catalyst
requires pip 24.0) run:
conda create -n lap python==3.11.5 pip==24.0
conda activate lap
To install the packages, run:
pip install -r requirements.txt
This will install the following packages and versions, which were the ones this code was tested with:
numpy==1.26.2
pandas==2.1.3
pyarrow==14.0.2
tqdm==4.66.1
requests==2.31.0
matplotlib==3.8.2
seaborn==0.13.0
pyyaml==6.0.1
scikit-learn==1.3.2
torch==2.1.1
torchvision==0.16.1
torchtext==0.16.0
tensorboard==2.15.1
catalyst==22.4
datasets==2.19.0
wfdb==4.1.2
faiss-cpu==1.7.4
All figures will be saved in outputs/graphs
, whilst all tables will be displayed within the corresponding notebook.
All results are contained within this repository, and so if you want to load them without re-running experiments, see the relevant notebook for the figure or table you are interested in.
The data for the experiments is downloaded automatically when running the scripts. The data is saved in the data
folder, which is created in the root directory of the repository. However, for Imagenet and Tiny-Imagenet, since a license agreement must be accepted, this is required first. Two scripts exist in the data folder to download the data for these datasets. The script download_imagenet.txt
will download the Imagenet data, and the script download_tiny_imagenet.txt
will download the Tiny-Imagenet data. These scripts should be run from the data
folder and download the data into a structure that the Pytorch datasets in the respective experiment scripts expect.
Figure 1 is produced by running the code in 0_lap_demonstration.ipynb
, which involves a synthetically produced graph to demonstrate the intuitions behind how LAP works.
The results that produce Figure 2 can be run using the 0_test_hparams.ipynb
notebook. This will save the results in outputs/toy_example/
. These results can then be loaded, and the figure generated, by running the code in 0_test_hparams.ipynb
.
In 0_test_compute.ipynb
, we show how the compute scales with the number of sources and the batch size. This is useful for understanding the time complexity of using LAP during training.
The results in Table 1 are generated by running the 1_test_synthetic.py
script. Given the dataset name, noise type, and model, this will run the experiments using LAP training and the standard training methods. The results will be saved in outputs/synthetic_results/
as a json file. These results can then be loaded, and the table generated, by running the code in 1_test_synthetic.ipynb
.
python 1_test_synthetic.py --seed 42 --runs 1 2 3 4 5 --device cuda
Within the script, the following short-hand is used for the noise types:
Where:
no_c
: No Corruptionc_cs
: Chunk Shufflec_rl
: Random Labelc_lbs
: Batch Label Shufflec_lbf
: Batch Label Flipc_ns
: Added Noisec_no
: Replace With Noise
The baseline results are saved in outputs/synthetic_results/baseline/
and are produced by running the synthetic_baseline_experiments.py
script found in either baseline/coteaching
for Co-teaching, baseline/idpa
for IDPA, baseline/cdr
for CDR, or baseline/nls
for Label Smoothing.
The results in Table 2 are produced by running 2_test_presnet.py
. This will run the experiments using LAP training only.
The code in 2_test_presnet.ipynb
can be used to produce the table in the paper, which will also load the baseline results located in outputs/presnet_results/baseline/
.
An example command to run 2_test_presnet.py
:
python 2_test_presnet.py --seed 2 --runs 1 2 3 4 5 --device cuda
The baseline results are saved in outputs/presnet_results/baseline/
and are produced by running the synthetic_baseline_experiments.py
script found in baseline/rrl
.
The results shown in Figure 3a can be produced by running the 3_test_ecg_different_noise.py
script. This will run the experiments using LAP training and the standard training methods applied to a 1D ResNet on the ECG dataset tested (PTB-XL). The results will be saved in outputs/ecg_results/
.
The figure itself can be produced by running the code in 3_test_ecg_different_noise.ipynb
, which will load and process the results before plotting and saving the figure.
An example command to run 3_test_ecg_different_noise.py
:
python 3_test_ecg_different_noise.py --seed 2 --runs 1 2 3 4 5 --device cuda
The results shown in Figure 3b are produced by running the script 4_test_cifar10n_different_noise_presnet.py
, which will run the RRL baseline with and without LAP training on CIFAR-10N with varied noise levels. The results of this will be saved in outputs/cifar_10n_different_noise_results/
.
These results will be loaded and processed to produce Figure 3 by running the code provided in 4_test_cifar10n_different_noise_presnet.ipynb
.
An example command to run 4_test_cifar10n_different_noise_presnet.py
:
python 4_test_cifar10n_different_noise_presnet.py --seed 2 --runs 1 2 3 4 5 --device cuda
The results in Table 3 are generated by running the script 5_test_nlp_random_label.py
, which will run the natural language task on the IMDB dataset, which will be automatically downloaded. These results will be saved in outputs/imdb_random_label/
.
To produce the table, please run the code in 5_test_nlp_random_label.ipynb
, which will load and process the results before generating the table.
An example command to run 5_test_nlp_random_label.py
:
python 5_test_nlp_random_label.py --seed 2 --runs 1 2 3 4 5 --device cuda
The baseline results are also saved in outputs/imdb_random_label/
and are produced by running the IMDB_baseline_experiments.py
script found in either baseline/coteaching
for Co-teaching, baseline/idpa
for IDPA, baseline/cdr
for CDR, or baseline/nls
for Label Smoothing.
The results in Table 3 are generated by running the script 6_test_goemotions.py
, which will run the natural language task on the GoEmotions dataset with imbalanced sources. The data will be automatically downloaded. These results will be saved in outputs/goemotions/
.
To produce the table, please run the code in 6_test_goemotions.ipynb
, which will load and process the results before generating the table.
An example command to run 6_test_goemotions.py
:
python 6_test_goemotions.py --seed 2 --runs 1 2 3 4 5 --device cuda
The baseline results are also saved in outputs/goemotions/
and are produced by running the goemotions_baseline_experiments.py
script found in either baseline/coteaching
for Co-teaching, baseline/idpa
for IDPA, baseline/cdr
for CDR, or baseline/nls
for Label Smoothing.
This figure is produced by running the code in the notebook 0_test_hparams.ipynb
.
This table is generated by running the code in the notebook 1_test_synthetic.ipynb
, which calculates the percentage difference of the values in Table 1.
The results in Table 6 are generated using the script 7_test_synthetic_different_noise_presnet.py
. This will run the experiments using LAP training and the standard training methods applied to RRL (a baseline from our paper) on CIFAR-10. The results will be saved in outputs/cifar_different_noise_results/
.
These results can then be loaded, and the table generated, by running the code in 7_test_synthetic_different_noise.ipynb
.
An example command to run 7_test_synthetic_different_noise_presnet.py
:
python 7_test_synthetic_different_noise_presnet.py --corruption_level 0.5 --seed 2 --runs 1 2 3 4 5 --device cuda
The results in Figure 5 are generated by running the script 8_test_cifar10n_different_noise_low_capacity.py
, which will run the same experiment as in Figure 3b, except with the lower capacity model used to produce the results in Table 1. These results will be saved in outputs/cifar_10n_different_noise_results_low_capacity/
.
To produce the figure, please run the code in 8_test_cifar10n_different_noise_low_capacity.ipynb
, which will load and process the results before plotting and saving the figure.
An example command to run 8_test_cifar10n_different_noise_low_capacity.py
:
python 8_test_cifar10n_different_noise_low_capacity.py --seed 2 --runs 1 2 3 4 5 --device cuda
The results in Table 7 are generated using the script 9_test_synthetic_batches_multiple_source_varied_sizes.py
. This will run the CIFAR-10 experiments using LAP training with much larger numbers of sources. The results will be saved in outputs/synthetic_results_batch_multiple_sources_varied_sizes
.
These results can then be loaded, and the table generated, by running the code in 9_test_synthetic_batches_multiple_source_varied_sizes.ipynb
. The baseline results can be generated by running the synthetic_baseline_experiments_low_capacity_cnn.py
script found in baseline/idpa
, baseline/coteaching
, baseline/cdr
, and baseline/nls
.
An example command to run 9_test_synthetic_batches_multiple_source_varied_sizes.py
:
python 9_test_synthetic_batches_multiple_source_varied_sizes.py --seed 2 --runs 1 2 3 4 5 --device cuda
The results in Table 8 and Table 9 are generated using the script 10_difficult_data.py
. This will run the experiments with a mixture of MNIST and CIFAR-10 data using LAP and standard training to understand the robustness of LAP. The results will be saved in outputs/difficult_data
.
The tables can then be generated by running the code in 10_difficult_data.ipynb
.
python 10_difficult_data.py
The results in Table 10 are generated using the script 11_tiny_imagenet_random_label.py
. This will run the Tiny-Imagenet experiments using LAP training for the original data and random labelling. The results will be saved in outputs/tiny_imagenet_random_label
.
These results can then be loaded, and the table generated, by running the code in 11_tiny_imagenet_random_label.ipynb
.
An example command to run 11_tiny_imagenet_random_label.py
:
python 11_tiny_imagenet_random_label.py --seed 2 --runs 1 2 3 4 5 --device cuda --scheduler
The results in Table 11, Table 12, and Figure 7 are generated using the script 12_imagenet64_random_label_and_noise.py
. This will run the Imagenet experiments using LAP training for the original data and random labelling. The results will be saved in outputs/imagenet64_random_label_and_noise
.
These results can then be loaded, and the table and figure generated, by running the code in 12_imagenet64_random_label_and_noise.ipynb
.
An example command to run 12_imagenet64_random_label_and_noise.py
:
python 12_imagenet64_random_label_and_noise.py --seed 2 --runs 1 2 3 4 5 --device cuda --scheduler
The results in Table 13 are generated using the script 13_california_housing_regression.py
. This will run the regression experiments on the California Housing dataset using LAP training for the original data and random labelling. The results will be saved in outputs/california_housing
.
These results can then be loaded, and the table generated, by running the code in 13_california_housing_regression.ipynb
.
An example command to run 13_california_housing_regression.py
:
python 13_california_housing_regression.py --seed 2 --runs 1 2 3 4 5 --device cuda