Skip to content

DiceScore uses average='micro' by default, while other methods use average='macro' #3031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ZachParent opened this issue Mar 27, 2025 · 4 comments · Fixed by #3041 · May be fixed by #3042
Closed

DiceScore uses average='micro' by default, while other methods use average='macro' #3031

ZachParent opened this issue Mar 27, 2025 · 4 comments · Fixed by #3041 · May be fixed by #3042
Assignees
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.7.x

Comments

@ZachParent
Copy link

🐛 Bug

I noticed that the DiceScore metric in segmentation uses an average strategy of 'micro'

average: Optional[Literal["micro", "macro", "weighted", "none"]] = "micro",

which is different from typical multiclass averaging handling, as shown in the MulticlassStatScores class

average: Optional[Literal["micro", "macro", "weighted", "none"]] = "macro",

Combined with the default of include_background=True, this makes the DiceScore quite optimistic (> 80% dice with a pretrained segmentation model after 1 epoch of fine tuning) because a segmentation model will tend to be biased towards predicting background.

To Reproduce

Steps to reproduce the behavior...

Code sample

This demo shows various DiceScore intializations, applied to target and output tensors which are randomly intialized but show a bias towards background (class=0).

import torch
import torchmetrics
import pandas as pd

num_classes = 20
batch_size = 16
train_metrics = get_metric_collection(num_classes)
dice_score_default = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
)
dice_score_no_bg = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    include_background=False,
)
dice_score_macro = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    average="macro",
)
dice_score_realistic = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    include_background=False,
    average="macro",
)

# Create example output and target tensors, where the background is 75% of the output
example_output = torch.randint(0, num_classes, (batch_size, 10, 10))
output_background_mask = torch.rand(example_output.shape) < 0.75
example_output[output_background_mask] = 0

# Create example target tensor, where the background is 75% of the target
example_target = torch.randint(0, num_classes, (batch_size, 10, 10))
target_background_mask = torch.rand(example_target.shape) < 0.75
example_target[target_background_mask] = 0

dice_score_default.update(example_output, example_target)
dice_score_no_bg.update(example_output, example_target)
dice_score_macro.update(example_output, example_target)
dice_score_realistic.update(example_output, example_target)

scores = {
    "include_background": ["True", "False"],
    "average='micro'": [
        dice_score_default.compute().item(),
        dice_score_no_bg.compute().item(),
    ],
    "average='macro'": [
        dice_score_macro.compute().item(),
        dice_score_realistic.compute().item(),
    ],
}
scores_df = pd.DataFrame(scores)
print(scores_df)
#   include_background  average='micro'  average='macro'
# 0               True         0.575000         0.042818
# 1              False         0.007878         0.005482

Expected behavior

The default initialization of DiceScore should be a sensible choice which gives realistic results. The current set of defaults, when used with an entirely random outputs and targets, but a typical distribution of 75% background in the output and target, the DiceScore is >50%. This is not representative, and the expected DiceScore should be <1%, since these are nearly random guesses.

Environment

  • Python & PyTorch Version (e.g., 1.0):
    • Python 3.12.9
    • PyTorch 2.6.0
    • torchmetrics 1.7.0
  • Any other relevant information such as OS (e.g., Linux):
    • Mac

Additional context

Maybe these defaults were chosen for a particular reason that I'm not familiar with, but it seems to me that the torchmetrics metrics should choose a consistent averaging method, and that for segmentation tasks, we should ignore the background by default.

I understand one reason not to make this change is because updating defaults may lead to unexpected changes for users which have not specific the average strategy.

I would be happy to make this change and add/update any relevant tests, if the community agrees.

@ZachParent ZachParent added bug / fix Something isn't working help wanted Extra attention is needed labels Mar 27, 2025
Copy link

Hi! Thanks for your contribution! Great first issue!

@Isalia20
Copy link
Contributor

Isalia20 commented Apr 4, 2025

Seems like an easy fix, @SkafteNicki @Borda I can take it, if not objections

@SkafteNicki
Copy link
Member

@Isalia20 I am okay with you taking a stab at this but the fix is not just changing the default arguments. Because we promise some backwards compatibility, we need to first raise a deprecation warning (that the default will change for one or more arguments) for 1 release and then we can make the change in the release afterwards.

@Isalia20
Copy link
Contributor

Isalia20 commented Apr 4, 2025

Sure, I'll add a warning for now, keep this ticket open and for the next version we can change this afterwards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.7.x
Projects
None yet
4 participants