[Tracking]: Storybook MCP Dev validation

### Problem statement

The Dev Toolset in `@storybook/addon-mcp` is an experimental MCP server released in a previous cycle to help agents auto-generate and link Storybook stories for UI development. Despite a lack of thorough validation, the package has become the de facto public face of our MCP efforts, growing rapidly to ~50k weekly npm downloads. This leaves us with:

1. Unknown Quality: We lack complex data to confirm the agent-generated stories are correct, functional, or adhere to user patterns (CSF3/Next). The quality of the core workflow is unverified at scale.
2. Reputational Risk: A widely used prototype that delivers a poor or inconsistent experience actively damages user trust in Storybook's AI strategy.
3. Black Box: We are blind to the major points of failure (edge cases, agent compatibility, token cost) in the most-used MCP component.

### Milestones

### M1: Eval Framework Expansion and Benchmarking
**Owner: @valentinpalkovic 
**Complete by: `@date`**
- [x] Adapt the existing Docs eval (cost, duration, turns) to track metrics specific to story generation quality: successful Storybook build, successful story render, successful component test (if a play function is autogenerated)
- [x] Add coverage as one of the dev metrics
- [ ] Add copilot-cli as an additional `agent`
- [ ] Define Benchmarks: Must cover the high-risk, real-world complexity across different scenarios:
  - Write component and story (from scratch)
  - Write only story (component already exists)
  - Modify components that have stories
  - Modify existing story
  - Simple Atoms (Button)
  - Composite Components (Card with sub-components)
  - Components requiring Mocking (MSW, module mocks)
  - Components requiring callbacks or play functions (user interaction)

### M2: Execute & Analyze Benchmarks
**Owner: @valentinpalkovic 
**Complete by: `@date`**
- [ ] Use a predefined model instead of the default one for Claude code to have a deterministic model
- [ ] Run all x scenarios across Claude, Cursor, and Copilot CLI. Use the Conversation Visualizer to analyze failure modes and agent decision-making.

### M3: Fixes and Prompt Engineering Optimizations
**Owner: `@mention`**
**Complete by: `@date`**
- [ ] Implement code fixes based on M2 analysis
- [ ] Optimize the story-generation prompts for optimal Dev toolset usage

### Gap in Evals
- [ ] We aren't reporting directly if our MCP server was called and with which parameters
- [ ] Extract in a clean way what the MCP server did (actions taken, tools used, parameters passed)

### Questions
- [ ] Collaboration of dev and docs tools: How should they work together?
- [ ] Evals for x different scenarios instead of writing a script for real-world repositories: Is this the right approach?
- [ ] Do we need a different Google Sheet for tracking these evals?
- [ ] Adjust the eval script to run multiple evals at once: How should this be implemented?
- [ ] How do we respect the existing Storybook's conventions?

**========= Everything below this line is strictly nice-to-have =========**

### Open questions
- [ ] Where should we document the Dev workflow, setup and known limitations?

### Nice-to-haves
- [ ] Research automated ways to score "story quality" beyond just rendering success (AST diffing, static analysis, functional metrics)
- [ ] Add runtime evaluation MCP tools to provide LLMs real-time information of a rendered story, so that the LLM gets self-healing capabilities in cases of preview/play function errors, accessibility violations, ...

### Related issues that may be resolved by this project
- [ ] issue1
- [ ] issue2
**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracking]: Storybook MCP Dev validation #97

Problem statement

Milestones

M1: Eval Framework Expansion and Benchmarking

M2: Execute & Analyze Benchmarks

M3: Fixes and Prompt Engineering Optimizations

Gap in Evals

Questions

Open questions

Nice-to-haves

Related issues that may be resolved by this project

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracking]: Storybook MCP Dev validation #97

Description

Problem statement

Milestones

M1: Eval Framework Expansion and Benchmarking

M2: Execute & Analyze Benchmarks

M3: Fixes and Prompt Engineering Optimizations

Gap in Evals

Questions

Open questions

Nice-to-haves

Related issues that may be resolved by this project

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions