spark-nlp explainability #9328

BassieWitkin · 2022-06-14T13:20:54Z

BassieWitkin
Jun 14, 2022

I am using spark-nlp to preprocess data, which I am then passing into a logistic regression model. Because I am building a bunch of similar models with different subsets of data, I created a pipeline for each model

documentAssembler = DocumentAssembler().setInputCol(SN.NOTES_COLUMN).setOutputCol('document')
    tokenizer = Tokenizer().setInputCols(['document']).setOutputCol('tokenized')
    normalizer = Normalizer().setInputCols(['tokenized']).setOutputCol('normalized').setLowercase(True)
    spell_chk = NorvigSweetingModel().pretrained().setInputCols(['normalized']).setOutputCol('checked')
    lemmatizer = LemmatizerModel.pretrained().setInputCols(['checked']).setOutputCol('lemmatized')
    stopwords_cleaner = StopWordsCleaner.pretrained('stopwords_en','en').setInputCols(['lemmatized']).setOutputCol('unigrams')
    ngrammer = NGramGenerator().setInputCols(['lemmatized']).setOutputCol('ngrams').setN(3).setEnableCumulative(True).setDelimiter('_')
    pos_tagger = PerceptronModel.pretrained('pos_anc').setInputCols(['document', 'lemmatized']).setOutputCol('pos')
    finisher = Finisher().setInputCols(['unigrams', 'ngrams','pos'])
    sqlTrans = SQLTransformer(statement="SELECT *, CONCAT(finished_ngrams, finished_unigrams) AS final FROM __THIS__")
    tfizer= HashingTF(inputCol='final',outputCol='tf_features')
    idfizer = IDF(inputCol='tf_features', outputCol=SN.VECTORISED_NOTES_COLUMN)
    
    pipeline = Pipeline().setStages([documentAssembler,
                                    tokenizer,
                                    normalizer,
                                    spell_chk,
                                    lemmatizer, stopwords_cleaner,
                                    pos_tagger,
                                    ngrammer,
                                    finisher,
                                    sqlTrans,
                                    tfizer,
                                    idfizer])
    pipeline_model = pipeline.fit(df_training)
    df_fitted = pipeline_model.transform(df_training)

I am looking into model explainability- how to see which text is most effecting the model. I looked into LIME and SHAP, but could not find anything that would work with spark-nlp. Does spark-nlp have some sort of explainability and if so, can you point me to docs?

I am also looking into incremental training, I can't seem to find anything that works with spark-nlp. If this is possible with spark-nlp and you can point me in the direction of docs, that would be really helpful.

Thanks in advance.

maziyarpanahi · 2022-06-19T13:15:06Z

maziyarpanahi
Jun 19, 2022
Maintainer

Hi,

This is actually a very big topic and up until now it has been mostly if not 100% done by data scientists through different evaluation methods.

I would personally have a test datasets that is important to me and represents the real-world data I am going to use my model(s) for, and then evaluate those pipelines each separately over the test datasets to understand the value/importance of pre-processing/lowercasing/lemmatizing/ and anything else in the pipeline. (accuracy, false positives, F1, etc.) - Spark NLP like many other NLP libraries don't have such feature, or at least not out of the box. But it is an interesting subject so I converted your issue to a discussion to avoid being it closed.

3 replies

BassieWitkin Jun 20, 2022
Author

@maziyarpanahi Thanks for your reply! I'm not sure it was clear from my question- I have tested each of the stages of the pipeline on my data and found the right combination of stages for preprocessing. What I want to know is if there is a way to map back to the text (after it is vectorized) to see which of the original text is affecting the model.

maziyarpanahi Jun 20, 2022
Maintainer

sorry it wasn't clear in the original question. I am still not sure, could you please provide an example end to end with the pipeline you have via a sample text/sentence, what the prediction is from your model, and what exactly do you mean mapping it back or affecting the model?

maziyarpanahi Jun 20, 2022
Maintainer

OK I just read some papers/articles about LIME and SHAP regarding model explainability. Spark NLP doesn't have such feature, however, you are hashing/vectorizing outside Spark NLP inside Spark ML. So I would look there to see if this is provided for Spark ML features. (maybe a third-party library offers this)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-nlp explainability #9328

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

spark-nlp explainability #9328

BassieWitkin Jun 14, 2022

Replies: 1 comment · 3 replies

maziyarpanahi Jun 19, 2022 Maintainer

BassieWitkin Jun 20, 2022 Author

maziyarpanahi Jun 20, 2022 Maintainer

maziyarpanahi Jun 20, 2022 Maintainer

BassieWitkin
Jun 14, 2022

Replies: 1 comment 3 replies

maziyarpanahi
Jun 19, 2022
Maintainer

BassieWitkin Jun 20, 2022
Author

maziyarpanahi Jun 20, 2022
Maintainer

maziyarpanahi Jun 20, 2022
Maintainer