Pickle dumping & loading causes model learn_one to forget all history #861

mshussein · 2022-02-28T04:37:57Z

mshussein
Feb 28, 2022

Hello,

I'll just preface this by saying I'm not a data scientist, just a seasoned web developer who's trying to expand horizons.

I've made a rudimentary bank transaction classifier that infers the type of transaction based on the description and amount. Eg shopping, loans, dining out, etc.

I've defined my pipeline as follows:

model = Pipeline(
    ('vectorizer', BagOfWords('Description',lowercase=True)),
    ('model', river.multiclass.OneVsOneClassifier(river.linear_model.PAClassifier())),
)

I then train it from a CSV and all works quite well!

Something like this:

for index, row in transactionsSample.iterrows():
    model = model.learn_one({'Description':row[1], 'Amount':row[2]}, row[0])

I need to dump this to file and load it on a container on AWS to serve requests for both training and prediction.

Now, if I ever save & reload it using pickle, and then do a single model.learn_one() on it, ALL other (previously correct) predictions result in the same answer as provided to the single model.learn_one function.

Save:

with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

Load:

with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

From the documentation (https://riverml.xyz/latest/api/compose/Pipeline/), I see that if you're doing learn_one on a pipeline, it fits the model to this new value, quote: "Fit to a single instance."

Other learn_one functions (eg https://riverml.xyz/latest/api/multiclass/OneVsOneClassifier/) say: "Update the model with a set of features x and a label y."

My gut feel is that it's not saving/loading correctly, therefore losing all historical learning, making the next learn_one the only training it has ever received.

Questions:

Am I saving & loading the model correctly?
Am I training it correctly? Both in the initial CSV training and the subsequent training post-load? (both use the pipeline to learn_one so I can't understand why it's any different)
What am I doing wrong?

I appreciate the help. I'm totally lost.

mshussein · 2022-02-28T05:23:48Z

mshussein
Feb 28, 2022
Author

Okay,

Further to the above, I find the issue happens even when I don't save and reload!

If I train a CSV model, run some predictions, then learn_one immediately afterwards (once), all previous predictions when re-done have the answer of my latest learning instead

8 replies

mshussein Mar 3, 2022
Author

Thanks @MaxHalford,

No problem at all, thanks for your time so far. Much appreciated.

As requested, I changed it to:

model = Pipeline(
    ('vectorizer', TFIDF(on='Description',lowercase=True)),
    ('model', river.multiclass.OneVsOneClassifier(river.linear_model.LogisticRegression())),
)

(I got an error trying to use BagOfWords, "Expected dict, got Counter", so I switched to TFIDF just to perform your test. It's not as accurate as BagOfWords for the data I have, but I just wanted to use LogisticRegression for the classifier)

Either way, still happens as described: As soon as I learn_one outside of the for loop, all results revert to the last trained answer.

mshussein Mar 3, 2022
Author

Sorry mate,

I'm such a muppet.

I figured out the extra learn_one was a new transaction category. I did make up 2 transactions, but they both coincidentally happened to be categories previously unseen in my (small) sample of about 8000. (Sorry!)

I can confirm when we do a previously known category, we're all good.

That being said, I suppose if we do need to add a new category, we may need to do a lot more training of the new and old categories, lest the newly added category becomes the default answer.

I don't know if that helps at all, or if there's still an issue. I would imagine however that learn_one on a brand new category shouldn't do what it's doing?

Your time and efforts are very much appreciated.

Kind regards.
Muhammed.

MaxHalford Mar 3, 2022
Maintainer

It's ok, mistakes happen! I'm glad you figured it out.

That being said, I suppose if we do need to add a new category, we may need to do a lot more training of the new and old categories, lest the newly added category becomes the default answer.

Depends. Retraining can be good. But normally you're learning from a stream of data, so if a new category is predominant in a stream, then there's no need to be good at classifying the old categories.

mshussein Mar 3, 2022
Author

I guess my use case is a little different than streaming data.

It would be to allow our staff to correct any mistakes in transaction categorization if/when we encounter them.

As long as we don't have to add a brand new category and we are only correcting, I think we'll be okay.

Thanks again for taking the time and I do apologize if I have wasted any!

mshussein Mar 3, 2022
Author

Also, even further to this in case this helps anyone else:

Training a new category causes all outputs to be that single category
Training something that results in a previously seen category, reverts all outputs to what they were before (the correct ones)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pickle dumping & loading causes model learn_one to forget all history #861

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pickle dumping & loading causes model learn_one to forget all history #861

mshussein Feb 28, 2022

Replies: 1 comment · 8 replies

mshussein Feb 28, 2022 Author

mshussein Mar 3, 2022 Author

mshussein Mar 3, 2022 Author

MaxHalford Mar 3, 2022 Maintainer

mshussein Mar 3, 2022 Author

mshussein Mar 3, 2022 Author

mshussein
Feb 28, 2022

Replies: 1 comment 8 replies

mshussein
Feb 28, 2022
Author

mshussein Mar 3, 2022
Author

mshussein Mar 3, 2022
Author

MaxHalford Mar 3, 2022
Maintainer

mshussein Mar 3, 2022
Author

mshussein Mar 3, 2022
Author