Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matched label returns NaN in metric calculation #215

Open
Jieran-S opened this issue Feb 19, 2024 · 2 comments
Open

Matched label returns NaN in metric calculation #215

Jieran-S opened this issue Feb 19, 2024 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed metric

Comments

@Jieran-S
Copy link
Member

if not args.matched_labels:
contingency_table = pd.crosstab(domains, groundtruth)
row_ind, col_ind = linear_sum_assignment(contingency_table, maximize=True)
domains = domains.map(dict(zip(row_ind, col_ind)))

Some metrics(MCC, Jaccard) require matched labels, if the labels are not pre-matched, the script will implement a matching algorithm (above). But when no. of domain label > no. of ground truth label, the resulted domains object has many NaN, leading to downstream error.

Can any metric people look into it and propose a potential fix?

@Jieran-S Jieran-S added bug Something isn't working help wanted Extra attention is needed metric labels Feb 19, 2024
@Jieran-S Jieran-S changed the title Matched label returns Nan in metric calculation Matched label returns NaN in metric calculation Feb 19, 2024
@shdam
Copy link
Contributor

shdam commented Feb 20, 2024

Could this be fixed with pd.crosstab(dropna = False)? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

Also, isn't the intention to prevent over-clustering with a resolution optimization or similar, which would prevent no. of domain label > no. of ground truth label to be true?

@Jieran-S
Copy link
Member Author

Yea agree...The issue also arise from model who dont convert resolution to n_cluster. But in case we want to investigate robustness of the clustering methods in the future it would be good to have this option imo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed metric
Projects
None yet
Development

No branches or pull requests

2 participants