Skip to content

ethz-spylab/hallucinated-citations

Repository files navigation

Check for probably-hallucinated references in arxiv papers

Some vibe-coded scripts (thanks Claude-Code) for finding references in arxiv papers that:

  1. match the title of a real arxiv paper
  2. have a completely different author list

Some manual checks suggest that most of these references are LLM-hallucinated.

See this blog post for some analysis of the data.

Growth of hallucinated refs over time

Dependencies

Steps to acquire the data

  1. Download metadata for all arxiv papers (about 1.5GB as of September 2025)
foo@bar:~$ ./dl_arxiv_oai.sh
  1. build a much smaller DB (arxiv_papers.json) containing the metadata we actually need
foo@bar:~$ python build_arxiv_db.py
  1. download all arxiv PDFs since 2000 and extract refs (about 30GB). WARNING: THIS BILLS YOUR AWS ACCOUNT
foo@bar:~$ python dl_pdfs_and_extract_refs.py
  1. find potential hallucinated refs with a bunch of heuristics
foo@bar:~$ python find_hallucinated_refs.py
  1. filter false positives using Claude. WARNING: THIS BILLS YOUR ANTHROPIC ACCOUNT
foo@bar:~$ python filter_hallucinated_refs.py
  1. plot the results
foo@bar:~$ python show_results.py 

About

Check for probably-hallucinated references in arxiv papers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published