Is TfIdfVectorizer safe to use? #6078

geofforigin8 · 2024-04-11T15:04:42Z

Ask a Question

Question

Is TfIdfVectorizer safe to use?

I thought I was going crazy after models I had trained with a TfIdfVectorizer in the pipeline started having (sometimes large) discrepancies in predictions after converting to ONNX.

After some googling, I found this post here: http://www.xavierdupre.fr/app/onnxcustom/helpsphinx/gyexamples/plot_transformer_discrepancy.html

It appears that the issue the author raised with sklearn is still open:
scikit-learn/scikit-learn#13733

I would use one of the two work-arounds in the first article but they both depend on the mlprodict package which appears archived so I am hesitant to depend on it.

So I am asking, is there a recommended workaround? Also, why is there no mention of this issue in the official docs?

justinchuby · 2024-04-11T17:28:13Z

cc @xadupre @gramalingam

xadupre · 2024-04-11T17:48:43Z

mlprodict is archived because its main functionality, a python runtime for onnx called OnnxInference, was moved into onnx package as ReferenceEvaluator. I did not write an example about the trick I put in place to avoid ambiguities but it is maintained in unit tests: https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_text.py#L11. This issue is not mentioned in onnx documentation because the ambiguity comes from scikit-learn and not onnx. This case usually don't happen unless the n-grams contain spaces. In that case, the converter gets confused.

geofforigin8 · 2024-04-11T22:26:34Z

Thank you for the quick reply! Using the test cases as a model, I wrote a sample program and I was not having success when using any of the min_df, max_df or max_features parameters.

It looks like convert_sklearn_text_vectorizer is expecting stop_words_ to be a set of string but when min_df et al are set, it is being set to a set of tuples, so as a result running to_onnx on it will fail with an error like this:
TypeError: One stop word is not a string ('is', 'this') in stop_words={('is', 'this'), ...

geofforigin8 · 2024-04-11T23:39:13Z

I've sort of fixed my problem by setting stop_words to an empty set and moving on. However, I'm still seeing some discrepancies and tracked it down to what looks like single-char tokens not being skipped by ONNX?

to reproduce:

import numpy as np 
import skl2onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer, TraceableCountVectorizer
from skl2onnx import update_registered_converter
from skl2onnx.shape_calculators.text_vectorizer import (
    calculate_sklearn_text_vectorizer_output_shapes,
)
from skl2onnx.operator_converters.text_vectoriser import convert_sklearn_text_vectorizer
from skl2onnx.operator_converters.tfidf_vectoriser import convert_sklearn_tfidf_vectoriser
from skl2onnx.common.data_types import StringTensorType
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession


update_registered_converter(
    TraceableCountVectorizer,
    "Skl2onnxTraceableCountVectorizer",
    calculate_sklearn_text_vectorizer_output_shapes,
    convert_sklearn_text_vectorizer,
    options={
        "tokenexp": None,
        "separators": None,
        "nan": [True, False],
        "keep_empty_string": [True, False],
        "locale": None,
    },
)

update_registered_converter(
    TraceableTfidfVectorizer,
    "Skl2onnxTraceableTfidfVectorizer",
    calculate_sklearn_text_vectorizer_output_shapes,
    convert_sklearn_tfidf_vectoriser,
    options={
        "tokenexp": None,
        "separators": None,
        "nan": [True, False],
        "keep_empty_string": [True, False],
        "locale": None,
    },
)

corpus = np.array([
        "This is the first document.",
        "This document is the second document.",
        "And this is the third one.",
        "Is this the first document?",
        "first Z document", # uncomment this and the last assertion fails because ONNX still sees the "Z" token when it shouldn't?
        "",
    ]
)

FEATURES = 12

vect = TraceableTfidfVectorizer(ngram_range=(1,2), max_features=FEATURES)
vect.fit(corpus)

print('vocabulary:', vect.vocabulary_)
print('stop_words:', vect.stop_words_)

A = vect.transform(corpus).todense()

vect.stop_words_ = set() # should be safe right?

B = vect.transform(corpus).todense()
assert_almost_equal(A,B)  # make sure vectorizer still does what we expect

model_onnx = skl2onnx.to_onnx(
    vect,
    "TfidfVectorizer",
    initial_types=[("input", StringTensorType([None, 1]))],
    target_opset=12,
)

sess = InferenceSession(model_onnx.SerializeToString(), providers=['CPUExecutionProvider'])

inputs = {"input": corpus.reshape((-1, 1))}

out = sess.run(None, inputs)[0]
assert out.shape == (len(corpus), FEATURES)
assert_almost_equal(out.ravel(), A.A1)

sklearn gives the same vector for both strings as expected:
vect.transform(["first document", "first Z document"]).todense()

matrix([[0.        , 0.51822427, 0.60474937, 0.60474937, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.51822427, 0.60474937, 0.60474937, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ]])

And the model from onnx is is only matching the results on the first string:
sess.run(None, {'input': [["first document"], ["first Z document"]]})[0]

array([[0.        , 0.5182243 , 0.6047494 , 0.6047494 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.65069556, 0.7593387 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]], dtype=float32)

gramalingam · 2024-04-19T00:40:04Z

Hi, I haven't understood the details here. ONNX is a specification. The issue could either be in the conversion side or it could be a bug in the onnxruntime implementation. If it is the latter, I would recommend filing an issue in the onnxruntime repo. Do you know?

Xavier might have a better idea (he is off this week).

geofforigin8 · 2024-04-19T22:15:49Z

Nope, I did not know. I don't have a full image of all the related projects, ownership and relations between them. Thanks for the link though, I thought I found a related issue reported there but, I don't think that's it.

I was able to resolve my issue by explicitly passing "tokenexp": r"\b\w\w+\b" in the options. I don't think this substitution is correct:
https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/operator_converters/text_vectoriser.py#L238

It should be something like [a-zA-Z0-9_]{2,} or \b\w\w+\b. Is the reason for this replacement because the onnx runtime chokes on the (?u) that sklearn has in it's default for token_pattern?

geofforigin8 added the question Questions about ONNX label Apr 11, 2024

justinchuby added the operator Issues related to ONNX operators label Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is TfIdfVectorizer safe to use? #6078

Is TfIdfVectorizer safe to use? #6078

geofforigin8 commented Apr 11, 2024

justinchuby commented Apr 11, 2024

xadupre commented Apr 11, 2024

geofforigin8 commented Apr 11, 2024 •

edited

Loading

geofforigin8 commented Apr 11, 2024 •

edited

Loading

gramalingam commented Apr 19, 2024

geofforigin8 commented Apr 19, 2024 •

edited

Loading

Is TfIdfVectorizer safe to use? #6078

Is TfIdfVectorizer safe to use? #6078

Comments

geofforigin8 commented Apr 11, 2024

Ask a Question

Question

justinchuby commented Apr 11, 2024

xadupre commented Apr 11, 2024

geofforigin8 commented Apr 11, 2024 • edited Loading

geofforigin8 commented Apr 11, 2024 • edited Loading

gramalingam commented Apr 19, 2024

geofforigin8 commented Apr 19, 2024 • edited Loading

geofforigin8 commented Apr 11, 2024 •

edited

Loading

geofforigin8 commented Apr 11, 2024 •

edited

Loading

geofforigin8 commented Apr 19, 2024 •

edited

Loading