Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indiscriminate conversion of string fields to categorical is problematic #799

Open
jpn-- opened this issue Feb 14, 2024 · 0 comments
Open
Labels
Bug Something isn't working/bug f

Comments

@jpn--
Copy link
Member

jpn-- commented Feb 14, 2024

Describe the bug
Most but not all fields initially encoded as strings are actually categorical. When they are categorical, conversion to an explicit categorical type is efficient. However, if they are not categorical (e.g. escort tour participants) or are loosely categorical but with potentially a lot of categories (vehicle type / age / fuel), the conversion to explicit categorical is not efficient.

In particular, converting non-categorical data to categorical ruins sharrow performance by triggering excessive recompiling, because every different categorical encoding is treated as a unique data type. This means, for example, if a "categorical" escort tour participants data column appears in a chooser table, then re-compiling will happen basically every time the model runs.

A fix will require not converting these fields to categorical data types.

This is quite possibly the problem in #756.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working/bug f
Projects
None yet
Development

No branches or pull requests

1 participant