Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warnings regarding scikit-learn version: Version mismatch might lead to invalid results #22

Open
bockthom opened this issue Jun 12, 2023 · 10 comments
Assignees

Comments

@bockthom
Copy link
Contributor

Up to now, I used BoDeGHa with sci-learn version 0.22, as stated in requirements.txt:

scikit-learn == 0.22

However, when installing BoDeGHa freshly, it uses sci-learn version 1.0.1, since this is the version given in setup.py:

'scikit-learn == 1.0.1',

But using 1.0.1 leads to warnings when running BoDeGHa, as the pretrained model was trained with 0.22:

bodegha/lib/python3.10/site-packages/sklearn/base.py:324: UserWarning: 
Trying to unpickle estimator DecisionTreeClassifier from version 0.22 when using version 1.0.1. 
This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
bodegha/lib/python3.10/site-packages/sklearn/base.py:324: UserWarning: 
Trying to unpickle estimator RandomForestClassifier from version 0.22 when using version 1.0.1. 
This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
bodegha/lib/python3.10/site-packages/sklearn/base.py:438: UserWarning: 
X has feature names, but RandomForestClassifier was fitted without feature names
warnings.warn(

So, as there is a mismatch of the scikit-learn versions in your repository, this needs to be fixed somehow – using a pretrained model that was not trained using the current scikit-learn version could lead to wrong results.
To fix this, one can either set the scikit-learn version in setup.py back to 0.22, or you need to provide a new pretrained model for 1.0.1 in the repository.

I tried to set the version of scikit-learn in setup.py back to 0.22 , but without success: scikit-learn 0.22 is not compatible with the current version of numpy any more (AttributeError: module 'numpy' has no attribute 'float'. `np.float` was a deprecated alias for the builtin 'float'). Downgrading numpy to version 1.19.5 (the version before the deprecation of np.float) was not possible, as numpy 1.19.5 does not work with python 3.10. Installing numpy 1.21.2 (which is compatible with python 3.10), results in another error (ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject). I also tried other versions of numpy in-between 1.19.5 and 1.21.2, also without success.
So, finally, I did not manage to install scikit-learn version 0.22 on python3.10, on which your pretrained model was trained.

Could you please update the pretrained model in this repository to work with scikit-learn 1.0.1? – or could you prove that using your 0.22-pretrained model with 1.0.1 is still correct and prevent the corresponding warnings somehow?

Thanks in advance! This would help a lot and increase the reliability of your tool when such risk warnings would disappear 😉

@AlexandreDecan
Copy link
Collaborator

@mehdigolzadeh can you take care of this? Is there an easy way to convert the model to the more recent version of sklearn?

@AlexandreDecan
Copy link
Collaborator

@bockthom
Copy link
Contributor Author

Are there any news regarding this issue?

@AlexandreDecan
Copy link
Collaborator

I sent an email to the maintainer. I think he changed his email address, explaining why he's not even aware of this issue :-) Let's wait a few days for him to react.

As a side note, a researcher in my team is currently working on a new approach to detect bots in repositories hosted on GitHub, based on the various activities they make. The main difference compared to Bodegha is that the new model/tool will rely on a limited number of queries on the GitHub API, implying it will be much faster to detect bots in practice. However, so far, we have no insight about the accuracy of this approach but we are confident it will be, at least, comparable to Bodegha's accuracy. That said, do not expect the tool to be released before October/November :-)

@AlexandreDecan
Copy link
Collaborator

@mehdigolzadeh Any update?

@bockthom
Copy link
Contributor Author

Still no reaction from the maintainer?
@AlexandreDecan Have you been able to successfully contact him via email?

(And also thanks for your side note. Nevertheless, I would like to stay with BoDeGHa, at least, for a certain time, as it is already part of my toolchain, and changing tools always implies additional efforts...)

@AlexandreDecan
Copy link
Collaborator

He reacted by mail saying he would give some feedback "soon"... :-) I've just sent another email.

@mehdigolzadeh
Copy link
Owner

I apologize for the delayed response; I've been swamped with numerous tasks. Unfortunately, I couldn't find the time to run and train a new model, but I did come up with a quick temporary fix. The warning is still present, but I've ignored it because the model is functioning without any problems. I plan to train the model using the new version of scikit-learn as soon as I have some free time.

@AlexandreDecan
Copy link
Collaborator

If the model is still working with the new version of sklearn, would it be possible to load it in the new version and to export it with the new model format?

@mehdigolzadeh
Copy link
Owner

I did this. Now, the model is exported using the new version of scikit-learn. However, I couldn't resolve the warning because the parameter needs to be passed during training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants