Training Neural Networks on Data Sources with Unknown Reliability

Introduction

This repository contains the code to reproduce the experiments in the paper "Training Neural Networks on Data Sources with Unknown Reliability".

If you would like to implement this method on your own dataset, please use the python package loss_adapted_plasticity, which can be installed with:

pip install loss_adapted_plasticity

Abstract

When data is generated by multiple sources, conventional training methods update models assuming equal reliability for each source and do not consider their individual data quality during training. However, in many applications, sources have varied levels of reliability that can have negative effects on the performance of a neural network. A key issue is that often the quality of data for individual sources is not known during training. Focusing on supervised learning, we aim to train neural networks on each data source for a number of steps proportional to the source's estimated relative reliability, by using a dynamic weighting. This way, we allow training on all sources during the warm-up, and reduce learning on less reliable sources during the final training stages, when it has been shown models overfit to noise. We show through diverse experiments, this can significantly improve model performance when trained on mixtures of reliable and unreliable data sources, and maintain performance when models are trained on reliable sources only.

Repository information

The notebook (.ipynb) files contain the code to load the results and generate the figures and tables presented in the paper, and the python (.py) files contain the scripts to produce the results, that are saved in the outputs folder.

To use SourceLossWeighting, you should use the python package loss_adapted_plasticity. It only requires a small change to the training loop:

# define the loss weighting with the desired parameters #
loss_weighting = LossAdaptedPlasticity(
    history_length=LAP_HISTORY_LENGTH,
    warmup_iters=WARMUP_ITERS,
    depression_strength=DEPRESSION_STRENGTH,
    leniency=LENIENCY,
)

# ensure that your loss function #
# returns a loss for each sample in the batch #
criterion = nn.LossFunctionOfSomeKind(reduction="none")

for epoch in range(epochs):
    for data, target, sources in train_loader:
        optimizer.zero_grad()
        output = model(data)
        losses = criterion(output, target)
        
        # compute the weighted loss #
        loss = loss_weighting(losses, sources).mean()
        
        loss.backward()
        optimizer.step()

The loss_weighting object will keep track of the loss history for each source, and will compute the weighted loss for each batch in $O(S \times S_b + B)$ time (where $S$ is the number of unique sources, $S_b$ is the number of unique sources in a batch, and B is the batch size) by computing the weighted mean and standard deviation online. The loss_weighting object can be used with any loss function that returns a loss for each sample in the batch and will return a weighted loss for each sample, which can then be reduced (usually by the mean or sum) to get the batch loss used in back-propagation.

Repository structure

The repository is structured as follows:

source_loss_weighting
==============================
   # code for the baselines
   ├─ baselines
      ├─ arfl
      ├─ cdr
      ├─ coteaching
      ├─ idpa
      ├─ nls
      ├─ rrl
==============================
   # data gets downloaded to here
   ├─ data
==============================
   # the code for the *test_synthetic* experiments
   ├─ experiment_code
      # data loading and processing
      ├─ data_utils 
      # general utility code
      ├─ utils 
==============================
   # code to generate figures in the notebooks
   ├─ graph_code 
==============================
   # code used in all experiments to scale the gradients or losses
   ├─ loss_adapted_plasticity
==============================
   # where the results are saved and will be loaded from
   ├─ outputs
      # outputs of 13_california_housing_regression.py
      ├─ california_housing
      # outputs of 4_test_cifar10n_different_noise_presnet.py
      ├─ cifar_10n_different_noise_results
      # outputs of 8_test_cifar10n_different_noise_low_capacity.py
      ├─ cifar_10n_different_noise_results_low_capacity
      # outputs of 7_test_synthetic_different_noise_presnet.py
      ├─ cifar_different_noise_results
      # outputs of 10_difficult_data.py
      ├─ difficult_data
      # outputs of 3_test_ecg_different_noise.py
      ├─ ecg_results
      # outputs of 6_test_goemotions.py
      ├─ goemotions
      # graphs from the notebooks
      ├─ graphs 
      # outputs of 12_imagenet64_random_label_and_noise.py
      ├─ imagenet64_random_label_and_noise
      # outputs of 5_test_nlp_random_label.py
      ├─ imdb_random_label
      # outputs of 2_test_presnet.py and baselines
      ├─ presnet_results
         ├─ baseline
            ├─ rrl
      # outputs of 1_test_synthetic.py and baselines
      ├─ synthetic_results
         ├─ baseline
            ├─ arfl
               ├─ cifar10
               ├─ cifar100
               ├─ fmnist
         ├─ co-teaching
         ├─ idpa
      # outputs of 9_test_synthetic_batches_multiple_source_varied_sizes.py
      ├─ synthetic_results_batch_multiple_sources_varied_sizes
      # outputs of 11_tiny_imagenet_random_label.py
      ├─ tiny_imagenet_random_label
      # outputs of 0_test_hparams.ipynb
      ├─ toy_example
==============================

Requirements

This code was tested with python 3.11.5 and pip 24.0 and the required python packages are located in the requirements.txt file.

To install the correct python and pip version (catalyst requires pip 24.0) run:

conda create -n lap python==3.11.5 pip==24.0
conda activate lap

To install the packages, run:

pip install -r requirements.txt

This will install the following packages and versions, which were the ones this code was tested with:

numpy==1.26.2
pandas==2.1.3 
pyarrow==14.0.2
tqdm==4.66.1
requests==2.31.0
matplotlib==3.8.2
seaborn==0.13.0
pyyaml==6.0.1
scikit-learn==1.3.2
torch==2.1.1
torchvision==0.16.1
torchtext==0.16.0
tensorboard==2.15.1
catalyst==22.4
datasets==2.19.0
wfdb==4.1.2
faiss-cpu==1.7.4

Running the experiments in the paper

All figures will be saved in outputs/graphs, whilst all tables will be displayed within the corresponding notebook.

All results are contained within this repository, and so if you want to load them without re-running experiments, see the relevant notebook for the figure or table you are interested in.

Data

The data for the experiments is downloaded automatically when running the scripts. The data is saved in the data folder, which is created in the root directory of the repository. However, for Imagenet and Tiny-Imagenet, since a license agreement must be accepted, this is required first. Two scripts exist in the data folder to download the data for these datasets. The script download_imagenet.txt will download the Imagenet data, and the script download_tiny_imagenet.txt will download the Tiny-Imagenet data. These scripts should be run from the data folder and download the data into a structure that the Pytorch datasets in the respective experiment scripts expect.

Figure 3

Figure 1 is produced by running the code in 0_lap_demonstration.ipynb, which involves a synthetically produced graph to demonstrate the intuitions behind how LAP works.

Figure 4

The results that produce Figure 2 can be run using the 0_test_hparams.ipynb notebook. This will save the results in outputs/toy_example/. These results can then be loaded, and the figure generated, by running the code in 0_test_hparams.ipynb.

Compute example

In 0_test_compute.ipynb, we show how the compute scales with the number of sources and the batch size. This is useful for understanding the time complexity of using LAP during training.

Table 1

The results in Table 1 are generated by running the 1_test_synthetic.py script. Given the dataset name, noise type, and model, this will run the experiments using LAP training and the standard training methods. The results will be saved in outputs/synthetic_results/ as a json file. These results can then be loaded, and the table generated, by running the code in 1_test_synthetic.ipynb.

python 1_test_synthetic.py --seed 42 --runs 1 2 3 4 5 --device cuda

Within the script, the following short-hand is used for the noise types:

Where:

no_c: No Corruption
c_cs: Chunk Shuffle
c_rl: Random Label
c_lbs: Batch Label Shuffle
c_lbf: Batch Label Flip
c_ns: Added Noise
c_no: Replace With Noise

The baseline results are saved in outputs/synthetic_results/baseline/ and are produced by running the synthetic_baseline_experiments.py script found in either baseline/coteaching for Co-teaching, baseline/idpa for IDPA, baseline/cdr for CDR, or baseline/nls for Label Smoothing.

Table 2

The results in Table 2 are produced by running 2_test_presnet.py. This will run the experiments using LAP training only.

The code in 2_test_presnet.ipynb can be used to produce the table in the paper, which will also load the baseline results located in outputs/presnet_results/baseline/.

An example command to run 2_test_presnet.py:

python 2_test_presnet.py --seed 2 --runs 1 2 3 4 5 --device cuda

The baseline results are saved in outputs/presnet_results/baseline/ and are produced by running the synthetic_baseline_experiments.py script found in baseline/rrl.

Figure 5a

The results shown in Figure 3a can be produced by running the 3_test_ecg_different_noise.py script. This will run the experiments using LAP training and the standard training methods applied to a 1D ResNet on the ECG dataset tested (PTB-XL). The results will be saved in outputs/ecg_results/.

The figure itself can be produced by running the code in 3_test_ecg_different_noise.ipynb, which will load and process the results before plotting and saving the figure.

An example command to run 3_test_ecg_different_noise.py:

python 3_test_ecg_different_noise.py --seed 2 --runs 1 2 3 4 5 --device cuda

Figure 5b

The results shown in Figure 3b are produced by running the script 4_test_cifar10n_different_noise_presnet.py, which will run the RRL baseline with and without LAP training on CIFAR-10N with varied noise levels. The results of this will be saved in outputs/cifar_10n_different_noise_results/.

These results will be loaded and processed to produce Figure 3 by running the code provided in 4_test_cifar10n_different_noise_presnet.ipynb.

An example command to run 4_test_cifar10n_different_noise_presnet.py:

python 4_test_cifar10n_different_noise_presnet.py --seed 2 --runs 1 2 3 4 5 --device cuda

Table 3

The results in Table 3 are generated by running the script 5_test_nlp_random_label.py, which will run the natural language task on the IMDB dataset, which will be automatically downloaded. These results will be saved in outputs/imdb_random_label/.

To produce the table, please run the code in 5_test_nlp_random_label.ipynb, which will load and process the results before generating the table.

An example command to run 5_test_nlp_random_label.py:

python 5_test_nlp_random_label.py --seed 2 --runs 1 2 3 4 5 --device cuda

The baseline results are also saved in outputs/imdb_random_label/ and are produced by running the IMDB_baseline_experiments.py script found in either baseline/coteaching for Co-teaching, baseline/idpa for IDPA, baseline/cdr for CDR, or baseline/nls for Label Smoothing.

Table 5

The results in Table 3 are generated by running the script 6_test_goemotions.py, which will run the natural language task on the GoEmotions dataset with imbalanced sources. The data will be automatically downloaded. These results will be saved in outputs/goemotions/.

To produce the table, please run the code in 6_test_goemotions.ipynb, which will load and process the results before generating the table.

An example command to run 6_test_goemotions.py:

python 6_test_goemotions.py --seed 2 --runs 1 2 3 4 5 --device cuda

The baseline results are also saved in outputs/goemotions/ and are produced by running the goemotions_baseline_experiments.py script found in either baseline/coteaching for Co-teaching, baseline/idpa for IDPA, baseline/cdr for CDR, or baseline/nls for Label Smoothing.

Figure 6

This figure is produced by running the code in the notebook 0_test_hparams.ipynb.

Table 6

This table is generated by running the code in the notebook 1_test_synthetic.ipynb, which calculates the percentage difference of the values in Table 1.

Table 7

The results in Table 6 are generated using the script 7_test_synthetic_different_noise_presnet.py. This will run the experiments using LAP training and the standard training methods applied to RRL (a baseline from our paper) on CIFAR-10. The results will be saved in outputs/cifar_different_noise_results/.

These results can then be loaded, and the table generated, by running the code in 7_test_synthetic_different_noise.ipynb.

An example command to run 7_test_synthetic_different_noise_presnet.py:

python 7_test_synthetic_different_noise_presnet.py --corruption_level 0.5 --seed 2 --runs 1 2 3 4 5 --device cuda

Figure 7

The results in Figure 5 are generated by running the script 8_test_cifar10n_different_noise_low_capacity.py, which will run the same experiment as in Figure 3b, except with the lower capacity model used to produce the results in Table 1. These results will be saved in outputs/cifar_10n_different_noise_results_low_capacity/.

To produce the figure, please run the code in 8_test_cifar10n_different_noise_low_capacity.ipynb, which will load and process the results before plotting and saving the figure.

An example command to run 8_test_cifar10n_different_noise_low_capacity.py:

python 8_test_cifar10n_different_noise_low_capacity.py --seed 2 --runs 1 2 3 4 5 --device cuda

Table 8

The results in Table 7 are generated using the script 9_test_synthetic_batches_multiple_source_varied_sizes.py. This will run the CIFAR-10 experiments using LAP training with much larger numbers of sources. The results will be saved in outputs/synthetic_results_batch_multiple_sources_varied_sizes.

These results can then be loaded, and the table generated, by running the code in 9_test_synthetic_batches_multiple_source_varied_sizes.ipynb. The baseline results can be generated by running the synthetic_baseline_experiments_low_capacity_cnn.py script found in baseline/idpa, baseline/coteaching, baseline/cdr, and baseline/nls.

An example command to run 9_test_synthetic_batches_multiple_source_varied_sizes.py:

python 9_test_synthetic_batches_multiple_source_varied_sizes.py --seed 2 --runs 1 2 3 4 5 --device cuda

Table 9 and 10

The results in Table 8 and Table 9 are generated using the script 10_difficult_data.py. This will run the experiments with a mixture of MNIST and CIFAR-10 data using LAP and standard training to understand the robustness of LAP. The results will be saved in outputs/difficult_data.

The tables can then be generated by running the code in 10_difficult_data.ipynb.

python 10_difficult_data.py

Table 11

The results in Table 10 are generated using the script 11_tiny_imagenet_random_label.py. This will run the Tiny-Imagenet experiments using LAP training for the original data and random labelling. The results will be saved in outputs/tiny_imagenet_random_label.

These results can then be loaded, and the table generated, by running the code in 11_tiny_imagenet_random_label.ipynb.

An example command to run 11_tiny_imagenet_random_label.py:

python 11_tiny_imagenet_random_label.py --seed 2 --runs 1 2 3 4 5 --device cuda --scheduler

Table 4, Table 12, and Figure 9

The results in Table 11, Table 12, and Figure 7 are generated using the script 12_imagenet64_random_label_and_noise.py. This will run the Imagenet experiments using LAP training for the original data and random labelling. The results will be saved in outputs/imagenet64_random_label_and_noise.

These results can then be loaded, and the table and figure generated, by running the code in 12_imagenet64_random_label_and_noise.ipynb.

An example command to run 12_imagenet64_random_label_and_noise.py:

python 12_imagenet64_random_label_and_noise.py --seed 2 --runs 1 2 3 4 5 --device cuda --scheduler

Table 13

The results in Table 13 are generated using the script 13_california_housing_regression.py. This will run the regression experiments on the California Housing dataset using LAP training for the original data and random labelling. The results will be saved in outputs/california_housing.

These results can then be loaded, and the table generated, by running the code in 13_california_housing_regression.ipynb.

An example command to run 13_california_housing_regression.py:

python 13_california_housing_regression.py --seed 2 --runs 1 2 3 4 5 --device cuda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training Neural Networks on Data Sources with Unknown Reliability

Introduction

Abstract

Repository information

Repository structure

Requirements

Running the experiments in the paper

Data

Figure 3

Figure 4

Compute example

Table 1

Table 2

Figure 5a

Figure 5b

Table 3

Table 5

Figure 6

Table 6

Table 7

Figure 7

Table 8

Table 9 and 10

Table 11

Table 4, Table 12, and Figure 9

Table 13

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
baseline		baseline
data		data
experiment_code		experiment_code
graph_code		graph_code
loss_adapted_plasticity		loss_adapted_plasticity
outputs		outputs
.gitattributes		.gitattributes
.gitignore		.gitignore
0_lap_demonstration.ipynb		0_lap_demonstration.ipynb
0_test_compute.ipynb		0_test_compute.ipynb
0_test_hparams.ipynb		0_test_hparams.ipynb
10_difficult_data.ipynb		10_difficult_data.ipynb
10_difficult_data.py		10_difficult_data.py
11_tiny_imagenet_random_label.ipynb		11_tiny_imagenet_random_label.ipynb
11_tiny_imagenet_random_label.py		11_tiny_imagenet_random_label.py
12_imagenet64_random_label_and_noise.ipynb		12_imagenet64_random_label_and_noise.ipynb
12_imagenet64_random_label_and_noise.py		12_imagenet64_random_label_and_noise.py
13_california_housing_regression.ipynb		13_california_housing_regression.ipynb
13_california_housing_regression.py		13_california_housing_regression.py
1_test_synthetic.ipynb		1_test_synthetic.ipynb
1_test_synthetic.py		1_test_synthetic.py
1_test_synthetic_with_hparam.py		1_test_synthetic_with_hparam.py
2_test_presnet.ipynb		2_test_presnet.ipynb
2_test_presnet.py		2_test_presnet.py
3_test_ecg_different_noise.ipynb		3_test_ecg_different_noise.ipynb
3_test_ecg_different_noise.py		3_test_ecg_different_noise.py
4_test_cifar10n_different_noise_presnet.ipynb		4_test_cifar10n_different_noise_presnet.ipynb
4_test_cifar10n_different_noise_presnet.py		4_test_cifar10n_different_noise_presnet.py
5_test_nlp_random_label.ipynb		5_test_nlp_random_label.ipynb
5_test_nlp_random_label.py		5_test_nlp_random_label.py
6_test_goemotions.ipynb		6_test_goemotions.ipynb
6_test_goemotions.py		6_test_goemotions.py
7_test_synthetic_different_noise.ipynb		7_test_synthetic_different_noise.ipynb
7_test_synthetic_different_noise_presnet.py		7_test_synthetic_different_noise_presnet.py
8_test_cifar10n_different_noise_low_capacity.ipynb		8_test_cifar10n_different_noise_low_capacity.ipynb
8_test_cifar10n_different_noise_low_capacity.py		8_test_cifar10n_different_noise_low_capacity.py
9_test_synthetic_batches_multiple_source_varied_sizes.ipynb		9_test_synthetic_batches_multiple_source_varied_sizes.ipynb
9_test_synthetic_batches_multiple_source_varied_sizes.py		9_test_synthetic_batches_multiple_source_varied_sizes.py
README.md		README.md
requirements.txt		requirements.txt
synthetic_config.yaml		synthetic_config.yaml

tmi-lab/unreliable-sources

Folders and files

Latest commit

History

Repository files navigation

Training Neural Networks on Data Sources with Unknown Reliability

Introduction

Abstract

Repository information

Repository structure

Requirements

Running the experiments in the paper

Data

Figure 3

Figure 4

Compute example

Table 1

Table 2

Figure 5a

Figure 5b

Table 3

Table 5

Figure 6

Table 6

Table 7

Figure 7

Table 8

Table 9 and 10

Table 11

Table 4, Table 12, and Figure 9

Table 13

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages