Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix categorical distribution legends #437

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

billylanchantin
Copy link

@billylanchantin billylanchantin commented Mar 8, 2025

👋🏻 Hi!

This PR addresses two issues that I found make the plots for categorical distributions hard to work with:

  1. Reference and analysis period categories are sometimes inconsistent.
  2. Legends for all subplots are located near the top.

1. Reference and analysis period categories are sometimes inconsistent

Currently, the reference and analysis data are each categorized independently via separate calls to calculate_value_counts. This can lead to inconsistent categorizations because different classes may have different count rankings across both periods. E.g. the most frequent class in the reference period may not be the most frequent class in the analysis period, meaning the bottom of each stack (purple by default) will be used for different classes in different periods.

I fixed this by concating the two dataframes, making a single call to calculate_value_counts, then splitting the dataframe back up. There are other ways to accomplish this and I'm open to suggestions!

2. Legends for all subplots are located near the top.

This appears to be a long-standing issue with plotly. I applied a workaround found here:

https://community.plotly.com/t/plotly-subplots-with-individual-legends/1754/25

Note: I had to adapt it to work with more than one column of subplots.

Results

Run the following in console to reproduce these results:

poetry run categorical-plot-example

Before:

before

After:

after

Notice how:

  • the bottom purple chunk in the hover changes category from 30% (incorrect) to 20% (correct)
  • the legend is now co-located with the subplot

Discussion

  • The categorical-plot-example poetry script example is just to highlight the problem. I pulled it straight from here: https://nannyml.readthedocs.io/en/stable/tutorials/detecting_data_drift/univariate_drift_detection.html#just-the-code. It should be removed before merging.
  • I'm happy to break apart the 2 separate issues into separate PRs if preferred.
  • How should I test this? (If at all.)
  • The TODO in nannyml/plots/blueprints/distributions.py highlights the weakness of my approach: the chunker now needs to be expanded to handle both datasets simultaneously. Again, open to thoughts here.
  • There appears to be a parallel path to building this kind of plot via nannyml/distribution/categorical/result.py. Will this approach need to be mirrored in that file?

We cannot categorize the reference and
analysis periods separately. Doing so may
result in inconsistent categories. This
approach combines them first then re-splits
after categorization to ensure equivalent
category sets.
Copy link
Contributor

@nnansters nnansters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all: thank you very much for you contribution! This is definitely not the easiest part of the codebase to dive into, kudos on figuring this one out!

I think this is the "cleanest" solution in our current way of working. There is no solution that doesn't involve looking at the dataset in its entirety I can think of.

Just made a remark on the DefaultChunker being overridden: I'm not sure if that has an impact on the actual chunker variable outside of the scope of this function. It shouldn't, so it might require a copy of the existing chunker instance and using that copy for the rest of the function.

max_number_of_categories=5,
missing_category_label='Missing',
)
data = pd.concat([reference_data, analysis_data]).reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has bitten us a couple of times before as it is very memory-intensive, but since this is happening on a single column the impact should be OK.

analysis_chunk_indices = analysis_chunk_indices + (max(reference_chunk_indices) + 1)
# TODO: split proportionally.
if isinstance(chunker, DefaultChunker):
chunker = CountBasedChunker(2 * DefaultChunker.DEFAULT_CHUNK_COUNT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrite remains scoped to this function specifically? We need to be sure it doesn't "leak" outside (even though it is right at the end of the call chain).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good call out. I'll make sure it's scoped to the function.

@nnansters
Copy link
Contributor

The categorical-plot-example poetry script example is just to highlight the problem. I pulled it straight from here: https://nannyml.readthedocs.io/en/stable/tutorials/detecting_data_drift/univariate_drift_detection.html#just-the-code. It should be removed before merging.

Perfect, thanks for providing an easy way to (p)review the solution.

I'm happy to break apart the 2 separate issues into separate PRs if preferred.

Nah, all good. No point in creating extra work.

How should I test this? (If at all.)

To be honest, there hasn't been a lot of testing on the plotting part of the library, apart from ensuring "it runs". What you included is proof enough for me. I would just ensure that the chunker overwrite (mentioned in my review) works as intended.

The TODO in nannyml/plots/blueprints/distributions.py highlights the weakness of my approach: the chunker now needs to be expanded to handle both datasets simultaneously. Again, open to thoughts here.

This will do fine for now. I don't like how the chunker has been spilling into all kinds of implementations under the covers. We've been working on a total makeover for this part, so I wouldn't bother putting a lot of effort in this.

There appears to be a parallel path to building this kind of plot via nannyml/distribution/categorical/result.py. Will this approach need to be mirrored in that file?

Well spotted. Indeed, we "pulled out" the logic of distributions from the univariate drift calculation and turned it into a calculator of its own, with a Result class of its own. I was lazy and duplicated some of the plotting glue code, so yes, it will require duplicating that behavior there. Sorry about that, kind of fugly.

@billylanchantin
Copy link
Author

billylanchantin commented Mar 20, 2025

Thanks for all the feedback! I appreciate it.

I'm currently trying to finish my TODO before I copy the approach over to the other location. I need to create a chunker that can do the following (in pseudocode):

# Before
reference_chunks = chunker.split(reference_df)
analysis_chunks = chunker.split(analysis_df)
chunks_before = reference_chunks + analysis_chunks

# After
combined_df = pd.concat([reference_df, analysis_df])
new_chunker = # However I implement this...
chunks_after = new_chunker.split(combined_df)

# Goal
chunks_before == chunks_after

This is turning out to be tricky in the general case because there are 4 chunker variants to cover, plus we only have the chunk timestamps sometimes (I believe). I'm now wondering if the better approach is the following:

def calculate_value_counts(
    data
    # ...
    categories: Optional[list[str]] = None
):
    if categories is None:
        determine_categories(data)
    # ...

def determine_categories(...):
    # What used to be in calculate_value_counts

def _plot_stacked_bar(...):
    categories = determine_categories(pd.concat([reference_df, analysis_df]), ...)
    reference_value_counts = calculate_value_counts(reference_df, categories=categories, ...)
    analysis_value_counts = calculate_value_counts(analysis_df, categories=categories, ...)
    # ...

Determining the categories that calculate_value_counts should use independently sidesteps the issue. The problem is that it's more computationally intensive, though not egregiously so.

Thoughts?

Now we're categorizing ahead of calculating
the value counts and passing those categories
in as a optional value. Should be equivalent, but
without the need to create a new chunker.
@billylanchantin
Copy link
Author

@nnansters I went ahead and tried my alternate approach. I think it's actually better? LMK what you think. Happy to revert if preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants