-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce label_encoders.pkl #102
Comments
The label encoder can easily be generated from a dataset (e.g. Originally, I was creating the LabelEncoder every time a model is trained. The catch is that if you use a pipeline composed of two pretrained models, you have no quarantee (as far as I know) that the LabelEncoders will use the same index for the same label. The simplest solution seemed to generate the LabelEncoder once for all. It would be great if the LabelEncoder could be generated automatically but now that I think about it, there might be some issues we need to consider:
Regarding 1., I think the best is to generate it once for all. If scikit-learn is updated, I would leave it the user to decide what to do (keep the old file or regenerate it, with the possibility that it breaks previously trained models). |
We had a discussion on this during our last meeting. I believe the nicest solution would be to leave the warning as there isn't an obvious way to solve this. I think it is best to let the user decide what to do based on the specific circumstances. To make updating the file more straightforward though, I suggest modifying the code to generate a fiile automatically when it doesn't exist. If someone wants to update the file, all that he will need to do is delete the old one and a new one will be re-created the first time a model is trained. As we now support more than one dataset, I also suggest we make this file(s) dataset dependent (stored under Something that will also be nice to have but would require more effort, is to make loading textual captions optional. For most experiments, we actually don't need the captions nor the label encoder. I am not sure how simple that would be to implement though. @egpbos any thoughts on this? |
Sounds good to me!
The first thing I would think of is to "lazy" initialize these things, i.e. initialize them on their first use. Maybe we can do this most elegantly by turning them into properties. For the label encoder that would look something like: class DataThing:
...
@property
def le(self): # is it still called `le`?
try:
return self._le
except AttributeError:
# _le was not yet initialized, let's do that first
self._le = create_label_encoder_somehow()
return self._le
... Then, when the class calls for |
I think the code for reproducing the label encoders may be missing from the repository. Perhaps we discussed this before, but I forgot why this is.
I ran into this pickled LabelEncoder version mismatch warning again because the pinned scikit-learn version (0.22.1) wouldn't install on a newer Python version in a new environment I had created. It's a bit annoying anyway, maintenance-wise, to have to leave the scikit-learn dependency pinned. Just using the newest available version typically is more convenient. However, if we don't pin, we keep getting that annoying warning.
How much time does it cost to generate a label_encoders.pkl file? And what data is necessary for it?
If it is not too costly, then an alternative to shipping it with the package would be to just generate it on first usage. That would also take care of the warning automatically, except when people update scikit-learn.
To make things totally airtight, we could have the encoders be regenerated then as well. We could set the warning to throw as an exception, catch that and then regenerate.
Overengineering? Probably :D Not too much effort, though. At least, if the label_encoders generation process is not too complicated.
The text was updated successfully, but these errors were encountered: