generated from storybookjs/addon-kit
-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Problem statement
The Dev Toolset in @storybook/addon-mcp is an experimental MCP server released in a previous cycle to help agents auto-generate and link Storybook stories for UI development. Despite a lack of thorough validation, the package has become the de facto public face of our MCP efforts, growing rapidly to ~50k weekly npm downloads. This leaves us with:
- Unknown Quality: We lack complex data to confirm the agent-generated stories are correct, functional, or adhere to user patterns (CSF3/Next). The quality of the core workflow is unverified at scale.
- Reputational Risk: A widely used prototype that delivers a poor or inconsistent experience actively damages user trust in Storybook's AI strategy.
- Black Box: We are blind to the major points of failure (edge cases, agent compatibility, token cost) in the most-used MCP component.
Milestones
M1: Eval Framework Expansion and Benchmarking
**Owner: @valentinpalkovic
Complete by: @date
- Adapt the existing Docs eval (cost, duration, turns) to track metrics specific to story generation quality: successful Storybook build, successful story render, successful component test (if a play function is autogenerated)
- Add coverage as one of the dev metrics
- Add copilot-cli as an additional
agent - Define Benchmarks: Must cover the high-risk, real-world complexity across different scenarios:
- Write component and story (from scratch)
- Write only story (component already exists)
- Modify components that have stories
- Modify existing story
- Simple Atoms (Button)
- Composite Components (Card with sub-components)
- Components requiring Mocking (MSW, module mocks)
- Components requiring callbacks or play functions (user interaction)
M2: Execute & Analyze Benchmarks
**Owner: @valentinpalkovic
Complete by: @date
- Use a predefined model instead of the default one for Claude code to have a deterministic model
- Run all x scenarios across Claude, Cursor, and Copilot CLI. Use the Conversation Visualizer to analyze failure modes and agent decision-making.
M3: Fixes and Prompt Engineering Optimizations
Owner: @mention
Complete by: @date
- Implement code fixes based on M2 analysis
- Optimize the story-generation prompts for optimal Dev toolset usage
Gap in Evals
- We aren't reporting directly if our MCP server was called and with which parameters
- Extract in a clean way what the MCP server did (actions taken, tools used, parameters passed)
Questions
- Collaboration of dev and docs tools: How should they work together?
- Evals for x different scenarios instead of writing a script for real-world repositories: Is this the right approach?
- Do we need a different Google Sheet for tracking these evals?
- Adjust the eval script to run multiple evals at once: How should this be implemented?
- How do we respect the existing Storybook's conventions?
========= Everything below this line is strictly nice-to-have =========
Open questions
- Where should we document the Dev workflow, setup and known limitations?
Nice-to-haves
- Research automated ways to score "story quality" beyond just rendering success (AST diffing, static analysis, functional metrics)
- Add runtime evaluation MCP tools to provide LLMs real-time information of a rendered story, so that the LLM gets self-healing capabilities in cases of preview/play function errors, accessibility violations, ...
Related issues that may be resolved by this project
- issue1
- issue2
**
dosubot
Metadata
Metadata
Assignees
Labels
No labels