-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix categorical distribution legends #437
base: main
Are you sure you want to change the base?
Fix categorical distribution legends #437
Conversation
We cannot categorize the reference and analysis periods separately. Doing so may result in inconsistent categories. This approach combines them first then re-splits after categorization to ensure equivalent category sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all: thank you very much for you contribution! This is definitely not the easiest part of the codebase to dive into, kudos on figuring this one out!
I think this is the "cleanest" solution in our current way of working. There is no solution that doesn't involve looking at the dataset in its entirety I can think of.
Just made a remark on the DefaultChunker being overridden: I'm not sure if that has an impact on the actual chunker variable outside of the scope of this function. It shouldn't, so it might require a copy of the existing chunker instance and using that copy for the rest of the function.
max_number_of_categories=5, | ||
missing_category_label='Missing', | ||
) | ||
data = pd.concat([reference_data, analysis_data]).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has bitten us a couple of times before as it is very memory-intensive, but since this is happening on a single column the impact should be OK.
analysis_chunk_indices = analysis_chunk_indices + (max(reference_chunk_indices) + 1) | ||
# TODO: split proportionally. | ||
if isinstance(chunker, DefaultChunker): | ||
chunker = CountBasedChunker(2 * DefaultChunker.DEFAULT_CHUNK_COUNT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overwrite remains scoped to this function specifically? We need to be sure it doesn't "leak" outside (even though it is right at the end of the call chain).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good call out. I'll make sure it's scoped to the function.
Perfect, thanks for providing an easy way to (p)review the solution.
Nah, all good. No point in creating extra work.
To be honest, there hasn't been a lot of testing on the plotting part of the library, apart from ensuring "it runs". What you included is proof enough for me. I would just ensure that the chunker overwrite (mentioned in my review) works as intended.
This will do fine for now. I don't like how the chunker has been spilling into all kinds of implementations under the covers. We've been working on a total makeover for this part, so I wouldn't bother putting a lot of effort in this.
Well spotted. Indeed, we "pulled out" the logic of distributions from the univariate drift calculation and turned it into a calculator of its own, with a |
Thanks for all the feedback! I appreciate it. I'm currently trying to finish my TODO before I copy the approach over to the other location. I need to create a chunker that can do the following (in pseudocode): # Before
reference_chunks = chunker.split(reference_df)
analysis_chunks = chunker.split(analysis_df)
chunks_before = reference_chunks + analysis_chunks
# After
combined_df = pd.concat([reference_df, analysis_df])
new_chunker = # However I implement this...
chunks_after = new_chunker.split(combined_df)
# Goal
chunks_before == chunks_after This is turning out to be tricky in the general case because there are 4 chunker variants to cover, plus we only have the chunk timestamps sometimes (I believe). I'm now wondering if the better approach is the following: def calculate_value_counts(
data
# ...
categories: Optional[list[str]] = None
):
if categories is None:
determine_categories(data)
# ...
def determine_categories(...):
# What used to be in calculate_value_counts
def _plot_stacked_bar(...):
categories = determine_categories(pd.concat([reference_df, analysis_df]), ...)
reference_value_counts = calculate_value_counts(reference_df, categories=categories, ...)
analysis_value_counts = calculate_value_counts(analysis_df, categories=categories, ...)
# ... Determining the categories that Thoughts? |
Now we're categorizing ahead of calculating the value counts and passing those categories in as a optional value. Should be equivalent, but without the need to create a new chunker.
@nnansters I went ahead and tried my alternate approach. I think it's actually better? LMK what you think. Happy to revert if preferred. |
Sorry for the churn!
👋🏻 Hi!
This PR addresses two issues that I found make the plots for categorical distributions hard to work with:
1. Reference and analysis period categories are sometimes inconsistent
Currently, the reference and analysis data are each categorized independently via separate calls to
calculate_value_counts
. This can lead to inconsistent categorizations because different classes may have different count rankings across both periods. E.g. the most frequent class in the reference period may not be the most frequent class in the analysis period, meaning the bottom of each stack (purple by default) will be used for different classes in different periods.I fixed this by concating the two dataframes, making a single call to
calculate_value_counts
, then splitting the dataframe back up. There are other ways to accomplish this and I'm open to suggestions!2. Legends for all subplots are located near the top.
This appears to be a long-standing issue with plotly. I applied a workaround found here:
https://community.plotly.com/t/plotly-subplots-with-individual-legends/1754/25
Note: I had to adapt it to work with more than one column of subplots.
Results
Run the following in console to reproduce these results:
Before:
After:
Notice how:
30%
(incorrect) to20%
(correct)Discussion
categorical-plot-example
poetry script example is just to highlight the problem. I pulled it straight from here: https://nannyml.readthedocs.io/en/stable/tutorials/detecting_data_drift/univariate_drift_detection.html#just-the-code. It should be removed before merging.TODO
innannyml/plots/blueprints/distributions.py
highlights the weakness of my approach: thechunker
now needs to be expanded to handle both datasets simultaneously. Again, open to thoughts here.nannyml/distribution/categorical/result.py
. Will this approach need to be mirrored in that file?