Fix categorical distribution legends #437

billylanchantin · 2025-03-08T21:27:40Z

👋🏻 Hi!

This PR addresses two issues that I found make the plots for categorical distributions hard to work with:

Reference and analysis period categories are sometimes inconsistent.
Legends for all subplots are located near the top.

1. Reference and analysis period categories are sometimes inconsistent

Currently, the reference and analysis data are each categorized independently via separate calls to calculate_value_counts. This can lead to inconsistent categorizations because different classes may have different count rankings across both periods. E.g. the most frequent class in the reference period may not be the most frequent class in the analysis period, meaning the bottom of each stack (purple by default) will be used for different classes in different periods.

I fixed this by concating the two dataframes, making a single call to calculate_value_counts, then splitting the dataframe back up. There are other ways to accomplish this and I'm open to suggestions!

2. Legends for all subplots are located near the top.

This appears to be a long-standing issue with plotly. I applied a workaround found here:

https://community.plotly.com/t/plotly-subplots-with-individual-legends/1754/25

Note: I had to adapt it to work with more than one column of subplots.

Results

Run the following in console to reproduce these results:

poetry run categorical-plot-example

Before:

After:

Notice how:

the bottom purple chunk in the hover changes category from 30% (incorrect) to 20% (correct)
the legend is now co-located with the subplot

Discussion

The categorical-plot-example poetry script example is just to highlight the problem. I pulled it straight from here: https://nannyml.readthedocs.io/en/stable/tutorials/detecting_data_drift/univariate_drift_detection.html#just-the-code. It should be removed before merging.
I'm happy to break apart the 2 separate issues into separate PRs if preferred.
How should I test this? (If at all.)
The TODO in nannyml/plots/blueprints/distributions.py highlights the weakness of my approach: the chunker now needs to be expanded to handle both datasets simultaneously. Again, open to thoughts here.
There appears to be a parallel path to building this kind of plot via nannyml/distribution/categorical/result.py. Will this approach need to be mirrored in that file?

We cannot categorize the reference and analysis periods separately. Doing so may result in inconsistent categories. This approach combines them first then re-splits after categorization to ensure equivalent category sets.

nnansters

First of all: thank you very much for you contribution! This is definitely not the easiest part of the codebase to dive into, kudos on figuring this one out!

I think this is the "cleanest" solution in our current way of working. There is no solution that doesn't involve looking at the dataset in its entirety I can think of.

Just made a remark on the DefaultChunker being overridden: I'm not sure if that has an impact on the actual chunker variable outside of the scope of this function. It shouldn't, so it might require a copy of the existing chunker instance and using that copy for the rest of the function.

nnansters · 2025-03-20T11:18:16Z

nannyml/plots/blueprints/distributions.py

-            max_number_of_categories=5,
-            missing_category_label='Missing',
-        )
+        data = pd.concat([reference_data, analysis_data]).reset_index(drop=True)


This has bitten us a couple of times before as it is very memory-intensive, but since this is happening on a single column the impact should be OK.

nnansters · 2025-03-20T11:24:25Z

nannyml/plots/blueprints/distributions.py

+        analysis_chunk_indices = analysis_chunk_indices + (max(reference_chunk_indices) + 1)
+        # TODO: split proportionally.
+        if isinstance(chunker, DefaultChunker):
+            chunker = CountBasedChunker(2 * DefaultChunker.DEFAULT_CHUNK_COUNT)


This overwrite remains scoped to this function specifically? We need to be sure it doesn't "leak" outside (even though it is right at the end of the call chain).

Ah good call out. I'll make sure it's scoped to the function.

nnansters · 2025-03-20T11:44:47Z

The categorical-plot-example poetry script example is just to highlight the problem. I pulled it straight from here: https://nannyml.readthedocs.io/en/stable/tutorials/detecting_data_drift/univariate_drift_detection.html#just-the-code. It should be removed before merging.

Perfect, thanks for providing an easy way to (p)review the solution.

I'm happy to break apart the 2 separate issues into separate PRs if preferred.

Nah, all good. No point in creating extra work.

How should I test this? (If at all.)

To be honest, there hasn't been a lot of testing on the plotting part of the library, apart from ensuring "it runs". What you included is proof enough for me. I would just ensure that the chunker overwrite (mentioned in my review) works as intended.

The TODO in nannyml/plots/blueprints/distributions.py highlights the weakness of my approach: the chunker now needs to be expanded to handle both datasets simultaneously. Again, open to thoughts here.

This will do fine for now. I don't like how the chunker has been spilling into all kinds of implementations under the covers. We've been working on a total makeover for this part, so I wouldn't bother putting a lot of effort in this.

There appears to be a parallel path to building this kind of plot via nannyml/distribution/categorical/result.py. Will this approach need to be mirrored in that file?

Well spotted. Indeed, we "pulled out" the logic of distributions from the univariate drift calculation and turned it into a calculator of its own, with a Result class of its own. I was lazy and duplicated some of the plotting glue code, so yes, it will require duplicating that behavior there. Sorry about that, kind of fugly.

billylanchantin · 2025-03-20T16:16:17Z

Thanks for all the feedback! I appreciate it.

I'm currently trying to finish my TODO before I copy the approach over to the other location. I need to create a chunker that can do the following (in pseudocode):

# Before
reference_chunks = chunker.split(reference_df)
analysis_chunks = chunker.split(analysis_df)
chunks_before = reference_chunks + analysis_chunks

# After
combined_df = pd.concat([reference_df, analysis_df])
new_chunker = # However I implement this...
chunks_after = new_chunker.split(combined_df)

# Goal
chunks_before == chunks_after

This is turning out to be tricky in the general case because there are 4 chunker variants to cover, plus we only have the chunk timestamps sometimes (I believe). I'm now wondering if the better approach is the following:

def calculate_value_counts(
    data
    # ...
    categories: Optional[list[str]] = None
):
    if categories is None:
        determine_categories(data)
    # ...

def determine_categories(...):
    # What used to be in calculate_value_counts

def _plot_stacked_bar(...):
    categories = determine_categories(pd.concat([reference_df, analysis_df]), ...)
    reference_value_counts = calculate_value_counts(reference_df, categories=categories, ...)
    analysis_value_counts = calculate_value_counts(analysis_df, categories=categories, ...)
    # ...

Determining the categories that calculate_value_counts should use independently sidesteps the issue. The problem is that it's more computationally intensive, though not egregiously so.

Thoughts?

Now we're categorizing ahead of calculating the value counts and passing those categories in as a optional value. Should be equivalent, but without the need to create a new chunker.

billylanchantin · 2025-03-20T19:46:57Z

@nnansters I went ahead and tried my alternate approach. I think it's actually better? LMK what you think. Happy to revert if preferred.

Sorry for the churn!

billylanchantin added 4 commits March 8, 2025 14:12

Add runnable example

9a4e342

Combine-then-split data

e0831d3

We cannot categorize the reference and analysis periods separately. Doing so may result in inconsistent categories. This approach combines them first then re-splits after categorization to ensure equivalent category sets.

Vertically align legend with subplot

5cbe88d

Co-locate subplot legends

8bee041

billylanchantin requested review from nnansters and nikml as code owners March 8, 2025 21:27

Handle no timestamps

1d4a02e

nnansters reviewed Mar 20, 2025

View reviewed changes

billylanchantin added 2 commits March 20, 2025 15:32

Try another approach

c6deef3

Now we're categorizing ahead of calculating the value counts and passing those categories in as a optional value. Should be equivalent, but without the need to create a new chunker.

Revert autoformatting churn

174870b

billylanchantin added 2 commits March 20, 2025 15:55

Restore analysis period reindexing

6da9611

Also fix trailing whitespace

e398ccf

Sorry for the churn!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix categorical distribution legends #437

Fix categorical distribution legends #437

billylanchantin commented Mar 8, 2025 •

edited

Loading

nnansters left a comment

nnansters Mar 20, 2025

nnansters Mar 20, 2025

billylanchantin Mar 20, 2025

nnansters commented Mar 20, 2025

billylanchantin commented Mar 20, 2025 •

edited

Loading

billylanchantin commented Mar 20, 2025

Fix categorical distribution legends #437

Are you sure you want to change the base?

Fix categorical distribution legends #437

Conversation

billylanchantin commented Mar 8, 2025 • edited Loading

1. Reference and analysis period categories are sometimes inconsistent

2. Legends for all subplots are located near the top.

Results

Discussion

nnansters left a comment

Choose a reason for hiding this comment

nnansters Mar 20, 2025

Choose a reason for hiding this comment

nnansters Mar 20, 2025

Choose a reason for hiding this comment

billylanchantin Mar 20, 2025

Choose a reason for hiding this comment

nnansters commented Mar 20, 2025

billylanchantin commented Mar 20, 2025 • edited Loading

billylanchantin commented Mar 20, 2025

billylanchantin commented Mar 8, 2025 •

edited

Loading

billylanchantin commented Mar 20, 2025 •

edited

Loading