Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce label_encoders.pkl #102

Open
egpbos opened this issue May 20, 2021 · 3 comments
Open

Reproduce label_encoders.pkl #102

egpbos opened this issue May 20, 2021 · 3 comments

Comments

@egpbos
Copy link
Collaborator

egpbos commented May 20, 2021

I think the code for reproducing the label encoders may be missing from the repository. Perhaps we discussed this before, but I forgot why this is.

I ran into this pickled LabelEncoder version mismatch warning again because the pinned scikit-learn version (0.22.1) wouldn't install on a newer Python version in a new environment I had created. It's a bit annoying anyway, maintenance-wise, to have to leave the scikit-learn dependency pinned. Just using the newest available version typically is more convenient. However, if we don't pin, we keep getting that annoying warning.

How much time does it cost to generate a label_encoders.pkl file? And what data is necessary for it?

If it is not too costly, then an alternative to shipping it with the package would be to just generate it on first usage. That would also take care of the warning automatically, except when people update scikit-learn.

To make things totally airtight, we could have the encoders be regenerated then as well. We could set the warning to throw as an exception, catch that and then regenerate.

Overengineering? Probably :D Not too much effort, though. At least, if the label_encoders generation process is not too complicated.

@bhigy
Copy link
Contributor

bhigy commented May 20, 2021

The label encoder can easily be generated from a dataset (e.g. dataset.Flickr8KData) using the method init_vocabulary(). It will then be available from the global variable tokenizer. ⚠️ The vocabulary should be initialized on the training set.

Originally, I was creating the LabelEncoder every time a model is trained. The catch is that if you use a pipeline composed of two pretrained models, you have no quarantee (as far as I know) that the LabelEncoders will use the same index for the same label. The simplest solution seemed to generate the LabelEncoder once for all.

It would be great if the LabelEncoder could be generated automatically but now that I think about it, there might be some issues we need to consider:

  1. you can run into trouble if the LabelEncoder is regenerated and you use it with some old checkpoints.
  2. I took into account the fact that we might have different languages but not that we have different datasets (which could have a different list of labels).

Regarding 1., I think the best is to generate it once for all. If scikit-learn is updated, I would leave it the user to decide what to do (keep the old file or regenerate it, with the possibility that it breaks previously trained models).
For 2., might be better to store the files in a language and dataset specific way (e.g. under data/datasets/<dataset_name>/label_encoder_<language>.pkl).

@bhigy
Copy link
Contributor

bhigy commented Jun 4, 2021

We had a discussion on this during our last meeting. I believe the nicest solution would be to leave the warning as there isn't an obvious way to solve this. I think it is best to let the user decide what to do based on the specific circumstances.

To make updating the file more straightforward though, I suggest modifying the code to generate a fiile automatically when it doesn't exist. If someone wants to update the file, all that he will need to do is delete the old one and a new one will be re-created the first time a model is trained. As we now support more than one dataset, I also suggest we make this file(s) dataset dependent (stored under <root>/data/datasets/<dataset_name>. This way, we can also handle dataset specificities (e.g. flickr8k having different languages or spokencoco different split schemes).

Something that will also be nice to have but would require more effort, is to make loading textual captions optional. For most experiments, we actually don't need the captions nor the label encoder. I am not sure how simple that would be to implement though.

@egpbos any thoughts on this?

@egpbos
Copy link
Collaborator Author

egpbos commented Jun 7, 2021

To make updating the file more straightforward though, I suggest modifying the code to generate a fiile automatically when it doesn't exist. If someone wants to update the file, all that he will need to do is delete the old one and a new one will be re-created the first time a model is trained. As we now support more than one dataset, I also suggest we make this file(s) dataset dependent (stored under /data/datasets/<dataset_name>. This way, we can also handle dataset specificities (e.g. flickr8k having different languages or spokencoco different split schemes).

Sounds good to me!

Something that will also be nice to have but would require more effort, is to make loading textual captions optional. For most experiments, we actually don't need the captions nor the label encoder. I am not sure how simple that would be to implement though.

The first thing I would think of is to "lazy" initialize these things, i.e. initialize them on their first use. Maybe we can do this most elegantly by turning them into properties. For the label encoder that would look something like:

class DataThing:
    ...
    @property
    def le(self):  # is it still called `le`?
        try:
            return self._le
        except AttributeError:
            # _le was not yet initialized, let's do that first
            self._le = create_label_encoder_somehow()
            return self._le
    ...

Then, when the class calls for self.le it will just get the internal _le once it's initialized the first time. The advantage of this is that when the label encoder is not used in a class, it will just automatically not be initialized and when it is used it automatically is initialized. To the user, it doesn't matter, because they just call data_thing.le or not. So the other advantage is that you don't have to change the calling code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants