Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"
Abstract: This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g., SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.
After cloning or downloading this repository, first run the Linux shell script ./setup.sh
.
It will initialize the workspace by performing the following steps:
- It will install the required Python modules by running
pip install -r "./requirements.txt"
- It will download the necessary Python code to compute the BARTScore by Yuan et al. (2021) to "./resources/bart_score.py".
- It will download and preprocess the Food Incidents Dataset by Randl et al. (2024) to "./data/food incidents - hazard/"
- It will download and preprocess the "Movies" Task (Zaidan and Eisner, 2008) of the ERASER benchmark by DeYoung et al. (2020) to "./data/movies/"
When preprocessing is finished, the experiments can be rerun using the shell script ./run.sh
which will run each of the following python files in turn:
python ./gemma-2b-hazard.py
python ./gemma-2b-movies.py
python ./gemma-7b-hazard.py
python ./gemma-7b-movies.py
python ./llama-8b-hazard.py
python ./llama-8b-movies.py
Originally, the experiments were performed using Python 3.10.12 on 8 NVIDIA RTX A5500 graphics cards with 24GB of memory each.
Finally, the Jupyter Notebooks evaluate-hazard.ipynb
and evaluate-movies.ipynb
can be used to analyze the results.