Skip to content

A Benchmark for Failure Detection under Distribution Shifts in Image Classification

License

Notifications You must be signed in to change notification settings

IML-DKFZ/fd-shifts

Repository files navigation


GitHub Workflow Status GitHub GitHub release DOI


Official Benchmark Implementation

πŸ“œ A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification

The original paper establishing the FD-Shifts benchmark was presented as an Oral at ICLR 2023 (top 5%).

β†’ project page β†’ paper link

πŸ“œ Understanding Silent Failures in Medical Image Classification

Our follow-up study on Failure Detection in Medical Image Classification was presented at MICCAI 2023.

β†’ project page β†’ paper link β†’ interactive tool SF-Visuals

πŸ“œ Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Our paper on a revised evaluation protocol for Selective Classification Systems was accepted as Spotlight paper at NeurIPS 2024.

β†’ project page β†’ paper link β†’ AUGRC implementation


Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-life application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring.

Holistic perspective on failure detection. Detecting failures should be seen in the context of the overarching goal of preventing silent failures of a classifier, which includes two tasks: preventing failures in the first place as measured by the "robustness" of a classifier (Task 1), and detecting the non-prevented failures by means of CSFs (Task 2, focus of this work). For failure prevention across distribution shifts, a consistent task formulation exists (featuring accuracy as the primary evaluation metric) and various benchmarks have been released covering a large variety of realistic shifts (e.g. image corruption shifts, sub-class shifts, or domain shifts). In contrast, progress in the subsequent task of detecting the non-prevented failures by means of CSFs is currently obstructed by three pitfalls: 1) A diverse and inconsistent set of evaluation protocols for CSFs exists (MisD, SC, PUQ, OoD-D) impeding comprehensive competition. 2) Only a fraction of the spectrum of realistic distribution shifts and thus potential failure sources is covered diminishing the practical relevance of evaluation. 3) The task formulation in OoD-D fundamentally deviates from the stated purpose of detecting classification failures. Overall, the holistic perspective on failure detection reveals an obvious need for a unified and comprehensive evaluation protocol, in analogy to current robustness benchmarks, to make classifiers fit for safety-critical applications. Abbreviations: CSF: Confidence Scoring Function, OoD-D: Out-of-Distribution Detection, MisD: Misclassification Detection, PUQ: Predictive Uncertainty Quantification, SC: Selective Classification

Citing This Work

If you use FD-Shifts please cite our paper

@inproceedings{
    jaeger2023a,
    title={A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification},
    author={Paul F Jaeger and Carsten Tim L{\"u}th and Lukas Klein and Till J. Bungert},
    booktitle={International Conference on Learning Representations},
    year={2023},
    url={https://openreview.net/forum?id=YnkGMIh0gvX}
}

Table Of Contents

Installation

FD-Shifts requires Python version 3.10 or later. It is recommended to install FD-Shifts in its own environment (venv, conda environment, ...).

  1. Install an appropriate version of PyTorch. Check that CUDA is available and that the CUDA toolkit version is compatible with your hardware. The currently minimum necessary version of pytorch is v.1.11.0. Testing and Development was done with the pytorch version using CUDA 11.3.

  2. Install FD-Shifts. This will pull in all dependencies including some version of PyTorch, it is strongly recommended that you install a compatible version of PyTorch beforehand. This will also make the fd-shifts cli available to you.

    pip install git+https://github.com/iml-dkfz/fd-shifts.git

How to Integrate Your Own Usecase

To learn about extending FD-Shifts with your own models, datasets and confidence scoring functions check out the tutorial on extending FD-Shifts Open In Colab.

Reproducing our results

While the following section on working with FD-Shifts describes the general usage, descriptions for reproducing specific publications are documented on the respective project page:

Working with FD-Shifts

To use fd-shifts you need to set the following environment variables

export EXPERIMENT_ROOT_DIR=/absolute/path/to/your/experiments
export DATASET_ROOT_DIR=/absolute/path/to/datasets

Alternatively, you may write them to a file and source that before running fd-shifts, e.g.

mv example.env .env

Then edit .env to your needs and run

source .env

To get an overview of available subcommands, run fd-shifts --help.

Data Folder Requirements

For the predefined experiments we expect the data to be in the following folder structure relative to the folder you set for $DATASET_ROOT_DIR.

<$DATASET_ROOT_DIR>
β”œβ”€β”€ breeds
β”‚   └── ILSVRC β‡’ ../imagenet/ILSVRC
β”œβ”€β”€ imagenet
β”‚   β”œβ”€β”€ ILSVRC
β”œβ”€β”€ cifar10
β”œβ”€β”€ cifar100
β”œβ”€β”€ corrupt_cifar10
β”œβ”€β”€ corrupt_cifar100
β”œβ”€β”€ svhn
β”œβ”€β”€ tinyimagenet
β”œβ”€β”€ tinyimagenet_resize
β”œβ”€β”€ wilds_animals
β”‚   └── iwildcam_v2.0
└── wilds_camelyon
    └── camelyon17_v1.0

For information regarding where to download these datasets from and what you have to do with them please check out the dataset documentation.

Training

To get a list of all fully qualified names for all experiments in the paper, use

fd-shifts list-experiments

To run training for a specific experiment:

fd-shifts train --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2

Alternatively, run training from a custom configuration file:

fd-shifts train --config=path/to/config/file

Check out fd-shifts train --help for more training options.

The launch subcommand allows for running multiple experiments, e.g. filtered by dataset:

fd-shifts launch --mode=train --dataset=cifar10

Check out fd-shifts launch --help for more filtering options. You can add custom experiment filters via the register_filter decorator. See experiments/launcher.py for an example.

Model Weights

All pretrained model weights used for "A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification" can be found on Zenodo under the following links:

Inference

To run inference for one of the experiments:

fd-shifts test --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2

Analogously, with the launch subcommand:

fd-shifts launch --mode=test --dataset=cifar10

Analysis

To run analysis for one of the experiments:

fd-shifts analysis --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2

To run analysis over an already available set of inference outputs the outputs have to be in the following format:

For a classifier with d outputs and N samples in total (over all tested datasets) and for M dropout samples

raw_logits.npz
Nx(d+2)

  0, 1, ...                 d─1,   d,      d+1
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|           logits_1            | label | dataset_idx |
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
|           logits_2            | label | dataset_idx |
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
|           logits_3            | label | dataset_idx |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
.
.
.
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|           logits_N            | label | dataset_idx |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
external_confids.npz
Nx1
raw_logits_dist.npz
NxdxM

  0, 1, ...                  d─1
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|   logits_1 (Dropout Sample 1) |
|   logits_1 (Dropout Sample 2) |
|               .               |
|               .               |
|               .               |
|   logits_1 (Dropout Sample M) |
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
|   logits_2 (Dropout Sample 1) |
|   logits_2 (Dropout Sample 2) |
|               .               |
|               .               |
|               .               |
|   logits_2 (Dropout Sample M) |
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
|   logits_3 (Dropout Sample 1) |
|   logits_3 (Dropout Sample 2) |
|               .               |
|               .               |
|               .               |
|   logits_3 (Dropout Sample M) |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                .
                .
                .
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|   logits_N (Dropout Sample 1) |
|   logits_N (Dropout Sample 2) |
|               .               |
|               .               |
|               .               |
|   logits_N (Dropout Sample M) |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
external_confids_dist.npz
NxM

To load inference output from different locations than $EXPERIMENT_ROOT_DIR, you can specify one or multiple directories in the FD_SHIFTS_STORE_PATH environment variable (multiple paths are separated by :):

export FD_SHIFTS_STORE_PATH=/absolute/path/to/fd-shifts/inference/output

You may also use the ExperimentData class to load your data in another way. You also have to provide an adequate config, where all test datasets and query parameters are set. Check out the config files in fd_shifts/configs including the dataclasses. Importantly, the dataset_idx has to match up with the list of datasets you provide and whether or not val_tuning is set. If val_tuning is set, the validation set takes over dataset_idx=0.

Acknowledgements


Β Β Β Β  Β Β Β Β 

About

A Benchmark for Failure Detection under Distribution Shifts in Image Classification

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages