Skip to content

behavioral-data/synthworlds

Repository files navigation

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models


This repo contains the experiment and evaluation code and dataset for SynthWorlds, a framework and benchmark to disentangle task reasoning from parametric knowledge in language models (LMs).

📚 SynthWorlds Data

The dataset consists of two parallel corpora:

  • SynthWorld-RM (Real-Mapped): grounded in real-world entities that are likely to be in LMs' parametric knowledge.
  • SynthWorld-SM (Synthetic-Mapped): grounded in synthetic entities containing no parametric knowledge.

SynthWorld-SM/RM each contain 6290 documents and over 1.5M tokens and 161K facts. On top of these corpora, we provide mirrored tasks with matched reasoning complexity: (1) Multi-hop Question Answering with 1.2K task instances; and (2) Page Navigation (i.e., navigating from a start and goal page using only the hyperlinks on a page) with 1K task instances. The corpora are constructed using the Wikidata knowledge base.

We quantify the knowledge advantage gap as the performance difference between real-mapped [RM] and synthetic-mapped [SM] settings.

The datasets and tasks are located in the datasets folder (or a single file for each instance in datasets_disagg).

🚀 Quick Start

This repo uses uv to manage dependencies. After installing uv, run the following command to install dependencies.

# install dependencies (creates virtual environment) and sync to latest environment
uv sync --prerelease=allow

# activate venv environment
source .venv/bin/activate

In particular, we use langchain for using LLMs with tools, litellm, and langfuse for tracking llm calls.

Setting the envrionment variables

Next, set the relevant LLM api keys in .env environment variable file to use LLMs. See litellm documentation for the environment variable for the corresponding provider.

OPENAI_API_KEY = ""
GEMINI_API_KEY = ""

To check things work properly run

python run_scripts/run_qa.py \
  dataset=qa-sm \
  agent=qa-rc-agent \
  llm.model=openai/gpt-5-mini \
  llm.params.temperature=1 \
  num_workers=1 \
  max_instances=2

Running Experiments

To run experiments on SynthWorlds datasets, we provide the following scripts

bash run_scripts/run_qa.sh
bash run_scripts/run_nav.sh

🤖 Baselines

Multi-hop QA

In our experiments, we evalaute three primary baselines:

  1. Closed-book (i.e., qa-no-rag-agent in run_conf/agent), where the model has no access to documents and answers directly from its parametric knowledge.
  2. One-step RAG (i.e., hipporag-agent in run_conf/agent), where the model retrieves supporting documents once before answering.
  3. IRCoT + RAG (i.e., hipporag-ircot-agent in run_conf/agent), which interleaves retrieval with chain-of-thought reasoning, enabling iterative reasoning and retrieval steps.

Page Navigation

We evaluate an agent equipped with two function-calling tools: click_link, which allows the agent to click any link on the current page, and backtrack, which allows the agent to return to a previously visited page. We evaluate the agent under two observation conditions:

  1. Links Only (wikinav-agent-links-only) where the agent observes only the set of outgoing links on each page.
  2. Content + Links (wikinav-agent) here the agent observes both the outgoing links and the full page text.

Viewing the Data

You can run the following code to load the data.

from synthworld_experiments.datasets import QAWikiDataset, WikiNavDataset

qa_rm = QAWikiDataset.from_json("datasets/mhqa/qa-rm.json")
nav_rm = WikiNavDataset.from_json("datasets/wikinav/wikinav-rm.json")

print("Number of qa instances:", len(qa_rm))
print("Number of nav instances:", len(nav_rm))

We provide loader functions to load the dataset from huggingface directly.

from synthworld_experiments.datasets import QAWikiDataset, WikiNavDataset
from synthworld_experiments.loader import load_qa_dataset, load_nav_dataset

qa_sm: QAWikiDataset = load_qa_dataset("sm")
qa_rm: QAWikiDataset = load_qa_dataset("rm")
nav_sm: WikiNavDataset = load_nav_dataset("sm")
nav_rm: WikiNavDataset = load_nav_dataset("rm")

We also provide two UIs to explore the data using streamlit. This reads from the datasets_disagg folder.

Alt text

# To see qa data
streamlit run visualizers/dataset/mhqa.py

# To see nav data
streamlit run visualizers/dataset/wikinav.py

📖 Citation

If you use SynthWorlds in your research, please cite our paper:

@article{gu2025synthworld,
  title={SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models}, 
  author={Ken Gu and Advait Bhat and Mike A Merrill and Robert West and Xin Liu and Daniel McDuff and Tim Althoff},
  journal={arXiv preprint arXiv:2510.24427},
  year={2025},
  url={https://arxiv.org/abs/2510.24427}, 
}

About

Evaluation code for SynthWorlds

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published