This repository serves as the benchmarking system for efforts to provide an agentic interface to cBioPortal.org. It is designed to evaluate various "agents" (like MCP servers or standalone APIs) that answer questions about cancer genomics data.
🏆 View the Leaderboard to see current benchmark results.
The system provides a modular CLI to:
- Ask single questions to different agents.
- Batch process a set of questions.
- Benchmark agents against a gold-standard dataset, automatically evaluating their accuracy using an LLM judge.
The system currently supports the following agent types via the --agent-type flag:
mcp-clickhouse: The original Model Context Protocol (MCP) agent connected to a ClickHouse database.mcp-navigator-agent: cBioPortal MCP agent service (HTTP API wrapper around MCP).cbio-nav-null: A baseline/testing agent (or a specific implementation hosted at a URL).cbio-qa-null: Another baseline/testing agent, similar tocbio-nav-nullbut using a different configuration.
# Create Python 3.13 virtual environment
uv venv .venv --python 3.13
source .venv/bin/activate
# Install dependencies in editable mode
uv sync --editableCreate a .env file or export the following environment variables:
General:
ANTHROPIC_API_KEY: Required for the LLM judge (evaluation). Alternatively, use AWS Bedrock with--use-bedrockand--aws-profile.
For mcp-navigator-agent:
CBIOPORTAL_MCP_AGENT_URL: URL of the cBioPortal MCP agent API (e.g.,http://localhost:8080).
For cbio-nav-null:
NULL_NAV_URL: URL of the agent API (e.g.,http://localhost:5000).
For cbio-qa-null:
NULL_QA_URL: URL of the agent API (e.g.,http://localhost:5002).
For mcp-clickhouse:
MCP_CLICKHOUSE_AGENT_URL: URL of the MCP ClickHouse agent API (e.g.,http://localhost:8080).
Optional (Tracing):
PHOENIX_API_KEY: For Arize Phoenix tracing.PHOENIX_COLLECTOR_ENDPOINT: Tracing endpoint.
The benchmark command is the main way to evaluate an agent. It automates generation, evaluation, and leaderboard updates.
# Run benchmark for the cBioPortal MCP agent
cbioportal-mcp-qa benchmark --agent-type mcp-navigator-agent --questions 1-5
# Run benchmark for the null agent
cbioportal-mcp-qa benchmark --agent-type cbio-nav-null --questions 1-5
# Run benchmark for the direct MCP connection
cbioportal-mcp-qa benchmark --agent-type mcp-clickhouseWhat happens:
- Questions are loaded from
input/autosync-public.csv. - The specified agent generates answers.
- Answers are saved to
results/{agent_type}/{YYYYMMDD}/answers/. simple_eval.pyevaluates the answers against the expected output (usingNavbot Expected Linkas the ground truth).- Results are saved to
results/{agent_type}/{YYYYMMDD}/eval/. LEADERBOARD.mdis updated with the latest scores.
Reproducibility testing measures how consistently an agent answers the same questions across multiple runs. Answers are compared using semantic equivalence -- two answers are considered equivalent if they convey the same factual information, even if worded differently.
# Run benchmark with 3 reproducibility runs (recommended)
cbioportal-mcp-qa benchmark --agent-type mcp-clickhouse --questions 1-5 --reproducibility-runs 3
# Run with 5 reproducibility runs for more statistical confidence
cbioportal-mcp-qa benchmark --agent-type mcp-clickhouse --questions 1-10 -r 5How it works:
- The first run's answers are reused from the main benchmark (no extra API call).
- Additional runs (2 through N) generate fresh answers for the same questions.
- All pairwise combinations of runs are compared using an LLM judge for semantic equivalence.
- A
reproducibility_scoreis added to the evaluation results and leaderboard.
You can also run individual components manually.
# Ask using the cBioPortal MCP agent
cbioportal-mcp-qa ask "How many studies are there?" --agent-type mcp-navigator-agent
# Ask using a null agent
cbioportal-mcp-qa ask "How many studies are there?" --agent-type cbio-nav-nullGenerate answers without running the full benchmark evaluation.
cbioportal-mcp-qa batch input/autosync-public.csv --questions 1-10 --output-dir my_results/Run the evaluation script on existing output files.
python simple_eval.py \
--input-csv input/autosync-public.csv \
--answers-dir my_results/ \
--answer-column "Navbot Expected Link"To integrate a new agent into the benchmarking system:
-
Create a new client class: In
src/cbioportal_mcp_qa/, create a new Python file (e.g.,my_new_agent_client.py) with a class that inherits fromBaseQAClientand implements theask_questionandget_sql_queries_markdownmethods. -
Register the client in
llm_client.py: Opensrc/cbioportal_mcp_qa/llm_client.py:- Import your new client class.
- Add a new
elifcondition in theget_qa_clientfactory function to return an instance of your new client when a specific--agent-typestring is provided.
# Example in src/cbioportal_mcp_qa/llm_client.py from .my_new_agent_client import MyNewAgentClient # ... def get_qa_client(agent_type: str = "mcp-clickhouse", **kwargs) -> BaseQAClient: if agent_type == "mcp-clickhouse": return MCPClickHouseClient(**kwargs) elif agent_type == "cbio-nav-null": return CBioAgentNullClient(**kwargs) elif agent_type == "my-new-agent": # Your new agent type return MyNewAgentClient(**kwargs) else: raise ValueError(f"Unknown agent type: {agent_type}")
-
Update
AGENT_COLUMN_MAPPINGinbenchmark.py: Insrc/cbioportal_mcp_qa/benchmark.py, add an entry to theAGENT_COLUMN_MAPPINGdictionary. This maps your newagent_typeto the corresponding column in yourinput/autosync-public.csv(or other benchmark CSV) that contains the expected answer for evaluation.# Example in src/cbioportal_mcp_qa/benchmark.py AGENT_COLUMN_MAPPING = { "mcp-clickhouse": "Navbot Expected Link", "cbio-nav-null": "Navbot Expected Link", "my-new-agent": "My New Agent Expected Answer Column", # Your agent's expected answer column }
-
Add Configuration (if any): If your new agent requires specific environment variables or CLI options, update the
Configurationsection inREADME.mdand addclick.optiondecorators insrc/cbioportal_mcp_qa/main.pyif needed.
src/cbioportal_mcp_qa/: Source code.main.py: CLI entry point.benchmark.py: Benchmarking workflow logic.evaluation.py: Core evaluation logic (LLM judge).base_client.py: Abstract base class for agents.null_agent_client.py: Client forcbio-nav-null.llm_client.py: Client formcp-clickhouse.
input/: Benchmark datasets (e.g.,autosync-public.csv).results/: Generated answers and evaluation reports.simple_eval.py: Wrapper script for running evaluation manually.agents/: Contains Docker Compose configurations for running external agent services, such asdocker-compose.ymlforcbio-null-agent.