Skip to content

Latest commit

 

History

History
225 lines (180 loc) · 14.4 KB

README.md

File metadata and controls

225 lines (180 loc) · 14.4 KB

Task Example

Missci: Reconstructing the Fallacies in Misrepresented Science (ACL 2024)

License Python Versions

Abstract: Health-related misinformation on social networks can lead to poor decision-making and real-world dangers. Such misinformation often misrepresents scientific publications and cites them as "proof" to gain perceived credibility. To effectively counter such claims automatically, a system must explain how the claim was falsely derived from the cited publication. Current methods for automated fact-checking or fallacy detection neglect to assess the (mis)used evidence in relation to misinformation claims, which is required to detect the mismatch between them. To address this gap, we introduce Missci, a novel argumentation theoretical model for fallacious reasoning together with a new dataset for real-world misinformation detection that misrepresents biomedical publications. Unlike previous fallacy detection datasets, Missci (i) focuses on implicit fallacies between the relevant content of the cited publication and the inaccurate claim, and (ii) requires models to verbalize the fallacious reasoning in addition to classifying it. We present Missci as a dataset to test the critical reasoning abilities of large language models (LLMs), that are required to reconstruct real-world fallacious arguments, in a zero-shot setting. We evaluate two representative LLMs and the impact of different levels of detail about the fallacy classes provided to the LLM via prompts. Our experiments and human evaluation show promising results for GPT 4, while also demonstrating the difficulty of this task.

Contact person: Max Glockner

UKP Lab | TU Darmstadt

This repository contains Missci, a novel dataset with reconstructed fallacious arguments that misrepresent scientific publications. We provide all necessary code to reproduce and evaluate our results and use LLMs in reconstructing the fallacious arguments. Don't hesitate to send us an e-mail or report an issue, if you have further questions.

Setup

Follow these instructions to recreate thy python environment used for all our experiments. All experiments ran on A100 GPUs.

We use python version 3.10. To create a python environment with all necessary dependencies run:

python -m venv missci
source missci/bin/activate
pip install -r requirements.txt

For Llama 2 / GPT 4 prompting edit the llm-config.json file:

{
  "gpt-4": {
      "AZURE_OPENAI_ENDPOINT": "<endpoint string>",
      "OPENAI_API_KEY": "<api key>"
  },
  "llama2": {
    "directory": "<llama2 directory>"
  }
}

Structure

How to use

Argument Reconstruction (Baselines)

Run run-argument-reconstruction.py to re-create the results for argument reconstruction with LLMs or the random baseline.

To run the baselines run

python run-argument-reconstruction.py eval-random claim
python run-argument-reconstruction.py eval-random p0

Baselines will randomly select a fallacy class and predict the "claim" or "p0" as the fallacious premise. If not specified otherwise, each baseline will run five times with the seeds [1,2,3,4,5]. The predictions and evaluations will be stored in the generate-classify directory.

Argument Reconstruction (LLM)

Prompts for argument reconstruction via LLMs are in the gen_cls directory. To prompt Llama2 or GPT 4 to reconstruct fallacious arguments run the run-argument-reconstruction.py script:

 python run-argument-reconstruction.py llama <prompt-template> <model-size> [<seed>] [--dev]
 python run-argument-reconstruction.py gpt4 <prompt-template> [--dev] [--overwrite]

To parse and evaluate the LLM output use:

 python run-argument-reconstruction.py parse-llm-output <file> <k> [--dev]

Arguments:

Name Description Example
<prompt-template> Path to the prompt template (relative to the "prompt_templates" directory) gen_cls/p4-connect-D.txt
<model-size> Model size for Llama 2 One of ("70b", "13b", "7b")
<seed> Optional random seed (default=1) 42
<file> Name (not path) of the file containing the raw LLM outputs for evaluation. missci_gen_cls--p4-connect-D_70b__test.jsonl
<k> For evaluation, consider the top k results. 1
--dev If set, only instances on the validation set are used (otherwise test instances). --dev
--overwrite If set, existing GPT 4 predictions are not re-used but re-generated. --overwrite

The LLM output will be stored in the generate-classify-raw directory. The evaluation results and predictions will be stored in the generate-classify directory.

Example:

To run LLMs using the Definition prompt template run

 python run-argument-reconstruction.py llama gen_cls/p4-connect-D.txt 70b
 python run-argument-reconstruction.py gpt4 gen_cls/p4-connect-D.txt

And to evaluate the Llama2 output run:

 python run-argument-reconstruction.py parse-llm-output missci_gen_cls--p4-connect-D_70b__test.jsonl 1 

Consistency

To measure the LLM consistency by prompting LLMs to re-classify the fallacy over their generated fallacious premises use the un-get-consistency.py file:

 python run-get-consistency.py llama <file> <prompt-template> <prefix> [--dev]
 python run-get-consistency.py gpt4 <file> <prompt-template> <prefix> [--dev] [--overwrite]

Arguments:

Name Description Example
<file> Path to the input file within the "predictions/generate-classify" directory. missci_gen_cls--p4-connect-D_70b__testk-1.jsonl
<prompt-template> Path to the prompt template (relative to the "prompt_templates" directory) cls_with_premise/classify-D.txt
<prefix> Prefix to be used when storing the results to avoid naming conflicts. _p4-D
--dev If set, only instances on the validation set are used (otherwise test instances). --dev
--overwrite If set, existing GPT 4 predictions are not re-used but re-generated. --overwrite

Example:

To assess the consistency of Llama2 using the Definition prompt template run:

 python run-get-consistency.py llama missci_gen_cls--p4-connect-D_70b__testk-1.jsonl cls_with_premise/classify-D.txt _p4-D

To parse and evaluate the resulting outputs run:

 python run-get-consistency.py consistency-parse  missci_p4-D_cls_with_premise--classify-D_70b__test.jsonl

Fallacy classification (over gold premises)

To prompt LLMs to classify the fallacies over the provided gold fallacious premises run the run-fallacy-classification-with-gold-premise.py script:

 python run-fallacy-classification-with-gold-premise.py llama <prompt-template> <model-size> [<seed>] [--dev] 
 python run-fallacy-classification-with-gold-premise.py gpt4  <prompt-template> [--dev] [--overwrite]

A list of available prompts is provided in the cls_with_premise directory. Parse and evaluate with:

Arguments

Name Description Example
<prompt-template> Path to the prompt template (relative to the "prompt_templates" directory) cls_with_premise/classify-D.txt
<model-size> Model size for Llama 2 One of ("70b", "13b", "7b")
<seed> Optional random seed (default=1) 42
--dev If set, only instances on the validation set are used (otherwise test instances). --dev
--overwrite If set, existing GPT 4 predictions are not re-used but re-generated. --overwrite

Example:

To run Llama2 using the Definition prompt template run:

 python run-fallacy-classification-with-gold-premise.py llama cls_with_premise/classify-D.txt 70b 

To parse and evaluate the results, run:

 python run-fallacy-classification-with-gold-premise.py parse-llm-output missci_cls_with_premise--classify-D_70b__test.jsonl 

Fallacy classification (without premise)

To prompt LLMs to classify the fallacies without fallacious premises run the run-fallacy-classification-without-premise.py script:

 python run-fallacy-classification-without-premise.py llama <prompt-template> <model-size> [<seed>] [--dev]
 python run-fallacy-classification-without-premise.py gpt4  <prompt-template> [--dev] [--overwrite]

A list of available prompts is provided in the cls_without_premise directory.

Arguments

Name Description Example
<prompt-template> Path to the prompt template (relative to the "prompt_templates" directory) cls_without_premise/p4-connect-cls-D.txt
<model-size> Model size for Llama 2 One of ("70b", "13b", "7b")
<seed> Optional random seed (default=1) 42
--dev If set, only instances on the validation set are used (otherwise test instances). --dev
--overwrite If set, existing GPT 4 predictions are not re-used but re-generated. --overwrite

Example:

To run Llama2 using the Definition prompt template run:

 python run-fallacy-classification-without-premise.py llama cls_without_premise/p4-connect-cls-D.txt 70b

To parse and evaluate the results, run:

 python run-fallacy-classification-without-premise.py parse-llm-output missci_cls_without_premise--p4-connect-cls-D_70b__test.jsonl

Citation

When using our dataset or code, please cite us with

@inproceedings{glockner-etal-2024-missci,
    title = "Missci: Reconstructing Fallacies in Misrepresented Science",
    author = "Glockner, Max  and
      Hou, Yufang  and
      Nakov, Preslav  and
      Gurevych, Iryna",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.240",
    doi = "10.18653/v1/2024.acl-long.240",
    pages = "4372--4405"
}

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.