Some vibe-coded scripts (thanks Claude-Code) for finding references in arxiv papers that:
- match the title of a real arxiv paper
- have a completely different author list
Some manual checks suggest that most of these references are LLM-hallucinated.
See this blog post for some analysis of the data.
- Download metadata for all arxiv papers (about 1.5GB as of September 2025)
foo@bar:~$ ./dl_arxiv_oai.sh- build a much smaller DB (arxiv_papers.json) containing the metadata we actually need
foo@bar:~$ python build_arxiv_db.py- download all arxiv PDFs since 2000 and extract refs (about 30GB). WARNING: THIS BILLS YOUR AWS ACCOUNT
foo@bar:~$ python dl_pdfs_and_extract_refs.py- find potential hallucinated refs with a bunch of heuristics
foo@bar:~$ python find_hallucinated_refs.py- filter false positives using Claude. WARNING: THIS BILLS YOUR ANTHROPIC ACCOUNT
foo@bar:~$ python filter_hallucinated_refs.py- plot the results
foo@bar:~$ python show_results.py 