Skip to content

Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"

License

Notifications You must be signed in to change notification settings

k-randl/self-explaining_llms

Repository files navigation

self-explaining_llms

Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"

Abstract: This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g., SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.

Usage

After cloning or downloading this repository, first run the Linux shell script ./setup.sh. It will initialize the workspace by performing the following steps:

  1. It will install the required Python modules by running pip install -r "./requirements.txt"
  2. It will download the necessary Python code to compute the BARTScore by Yuan et al. (2021) to "./resources/bart_score.py".
  3. It will download and preprocess the Food Incidents Dataset by Randl et al. (2024) to "./data/food incidents - hazard/"
  4. It will download and preprocess the "Movies" Task (Zaidan and Eisner, 2008) of the ERASER benchmark by DeYoung et al. (2020) to "./data/movies/"

When preprocessing is finished, the experiments can be rerun using the shell script ./run.sh which will run each of the following python files in turn:

  • python ./gemma-2b-hazard.py
  • python ./gemma-2b-movies.py
  • python ./gemma-7b-hazard.py
  • python ./gemma-7b-movies.py
  • python ./llama-8b-hazard.py
  • python ./llama-8b-movies.py

Originally, the experiments were performed using Python 3.10.12 on 8 NVIDIA RTX A5500 graphics cards with 24GB of memory each. Finally, the Jupyter Notebooks evaluate-hazard.ipynb and evaluate-movies.ipynb can be used to analyze the results.

Sources

Yuan, W., Neubig, G., & Liu, P. (2021). BARTScore: Evaluating Generated Text as Text Generation. ArXiv.

Randl, K., Karvounis, M., Marinos, G., Pavlopoulos, J., Lindgren, T., & Henriksson, A. (2024). Food Recall Incidents [Data set]. Zenodo.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. (2020). ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.

Omar Zaidan and Jason Eisner. (2008). Modeling Annotators: A Generative Approach to Learning from Annotator Rationales. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 31–40, Honolulu, Hawaii. Association for Computational Linguistics.

About

Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published