-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is TfIdfVectorizer safe to use? #6078
Comments
mlprodict is archived because its main functionality, a python runtime for onnx called OnnxInference, was moved into onnx package as ReferenceEvaluator. I did not write an example about the trick I put in place to avoid ambiguities but it is maintained in unit tests: https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_text.py#L11. This issue is not mentioned in onnx documentation because the ambiguity comes from scikit-learn and not onnx. This case usually don't happen unless the n-grams contain spaces. In that case, the converter gets confused. |
Thank you for the quick reply! Using the test cases as a model, I wrote a sample program and I was not having success when using any of the min_df, max_df or max_features parameters. It looks like |
I've sort of fixed my problem by setting stop_words to an empty set and moving on. However, I'm still seeing some discrepancies and tracked it down to what looks like single-char tokens not being skipped by ONNX? to reproduce:
sklearn gives the same vector for both strings as expected:
And the model from onnx is is only matching the results on the first string:
|
Hi, I haven't understood the details here. ONNX is a specification. The issue could either be in the conversion side or it could be a bug in the onnxruntime implementation. If it is the latter, I would recommend filing an issue in the onnxruntime repo. Do you know? Xavier might have a better idea (he is off this week). |
Nope, I did not know. I don't have a full image of all the related projects, ownership and relations between them. Thanks for the link though, I thought I found a related issue reported there but, I don't think that's it. I was able to resolve my issue by explicitly passing It should be something like |
Ask a Question
Question
Is TfIdfVectorizer safe to use?
I thought I was going crazy after models I had trained with a TfIdfVectorizer in the pipeline started having (sometimes large) discrepancies in predictions after converting to ONNX.
After some googling, I found this post here: http://www.xavierdupre.fr/app/onnxcustom/helpsphinx/gyexamples/plot_transformer_discrepancy.html
It appears that the issue the author raised with sklearn is still open:
scikit-learn/scikit-learn#13733
I would use one of the two work-arounds in the first article but they both depend on the mlprodict package which appears archived so I am hesitant to depend on it.
So I am asking, is there a recommended workaround? Also, why is there no mention of this issue in the official docs?
The text was updated successfully, but these errors were encountered: