Skip to content

[Tracking]: Storybook MCP Dev validation #97

@valentinpalkovic

Description

@valentinpalkovic

Problem statement

The Dev Toolset in @storybook/addon-mcp is an experimental MCP server released in a previous cycle to help agents auto-generate and link Storybook stories for UI development. Despite a lack of thorough validation, the package has become the de facto public face of our MCP efforts, growing rapidly to ~50k weekly npm downloads. This leaves us with:

  1. Unknown Quality: We lack complex data to confirm the agent-generated stories are correct, functional, or adhere to user patterns (CSF3/Next). The quality of the core workflow is unverified at scale.
  2. Reputational Risk: A widely used prototype that delivers a poor or inconsistent experience actively damages user trust in Storybook's AI strategy.
  3. Black Box: We are blind to the major points of failure (edge cases, agent compatibility, token cost) in the most-used MCP component.

Milestones

M1: Eval Framework Expansion and Benchmarking

**Owner: @valentinpalkovic
Complete by: @date

  • Adapt the existing Docs eval (cost, duration, turns) to track metrics specific to story generation quality: successful Storybook build, successful story render, successful component test (if a play function is autogenerated)
  • Add coverage as one of the dev metrics
  • Add copilot-cli as an additional agent
  • Define Benchmarks: Must cover the high-risk, real-world complexity across different scenarios:
    • Write component and story (from scratch)
    • Write only story (component already exists)
    • Modify components that have stories
    • Modify existing story
    • Simple Atoms (Button)
    • Composite Components (Card with sub-components)
    • Components requiring Mocking (MSW, module mocks)
    • Components requiring callbacks or play functions (user interaction)

M2: Execute & Analyze Benchmarks

**Owner: @valentinpalkovic
Complete by: @date

  • Use a predefined model instead of the default one for Claude code to have a deterministic model
  • Run all x scenarios across Claude, Cursor, and Copilot CLI. Use the Conversation Visualizer to analyze failure modes and agent decision-making.

M3: Fixes and Prompt Engineering Optimizations

Owner: @mention
Complete by: @date

  • Implement code fixes based on M2 analysis
  • Optimize the story-generation prompts for optimal Dev toolset usage

Gap in Evals

  • We aren't reporting directly if our MCP server was called and with which parameters
  • Extract in a clean way what the MCP server did (actions taken, tools used, parameters passed)

Questions

  • Collaboration of dev and docs tools: How should they work together?
  • Evals for x different scenarios instead of writing a script for real-world repositories: Is this the right approach?
  • Do we need a different Google Sheet for tracking these evals?
  • Adjust the eval script to run multiple evals at once: How should this be implemented?
  • How do we respect the existing Storybook's conventions?

========= Everything below this line is strictly nice-to-have =========

Open questions

  • Where should we document the Dev workflow, setup and known limitations?

Nice-to-haves

  • Research automated ways to score "story quality" beyond just rendering success (AST diffing, static analysis, functional metrics)
  • Add runtime evaluation MCP tools to provide LLMs real-time information of a rendered story, so that the LLM gets self-healing capabilities in cases of preview/play function errors, accessibility violations, ...

Related issues that may be resolved by this project

  • issue1
  • issue2
    **

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions