Skip to content

Conversation

@valentinpalkovic
Copy link
Contributor

Running a dev-context eval now yields a full coverage report alongside the existing build/test/typecheck outputs. A new Coverage page renders totals and syntax-highlighted, expandable per-file views with line/branch indicators, and the CLI now reports coverage percentages.

Additionally, the test flow is now decomposed into focused helpers (run tests, parse results, write artifacts, compute coverage).

Bildschirmfoto 2025-12-11 um 21 38 19 Bildschirmfoto 2025-12-11 um 21 39 17 Bildschirmfoto 2025-12-11 um 21 39 34 Bildschirmfoto 2025-12-11 um 21 50 36

Copilot AI review requested due to automatic review settings December 11, 2025 20:54
@changeset-bot
Copy link

changeset-bot bot commented Dec 11, 2025

⚠️ No Changeset found

Latest commit: 1d75460

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

💥 An error occurred when fetching the changed packages and changesets in this PR
Some errors occurred when validating the changesets config:
The package or glob expression "@storybook/mcp-eval*" is specified in the `ignore` option but it is not found in the project. You may have misspelled the package name or provided an invalid glob expression. Note that glob expressions must be defined according to https://www.npmjs.com/package/micromatch.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Dec 11, 2025

npm i https://pkg.pr.new/storybookjs/mcp/@storybook/addon-mcp@110
npm i https://pkg.pr.new/storybookjs/mcp/@storybook/mcp@110

commit: 1d75460

@codecov
Copy link

codecov bot commented Dec 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.82%. Comparing base (c2c7920) to head (1d75460).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #110   +/-   ##
=======================================
  Coverage   87.82%   87.82%           
=======================================
  Files          19       19           
  Lines         427      427           
  Branches      122      122           
=======================================
  Hits          375      375           
  Misses          8        8           
  Partials       44       44           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@valentinpalkovic valentinpalkovic force-pushed the valentin/implementing-coverage-for-dev-evals branch from c432181 to 421035d Compare December 11, 2025 20:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements comprehensive coverage reporting for dev evaluations, adding full end-to-end coverage tracking with visual presentation. The implementation decomposes the test flow into focused, reusable helpers (run tests, parse results, write artifacts, compute coverage), adds a new Coverage UI page with syntax-highlighted, expandable per-file views with line/branch indicators, and includes CLI reporting of coverage percentages.

Key changes:

  • Added istanbul-lib-coverage integration for coverage computation
  • Created modular test pipeline with separate concerns (run, parse, write, coverage)
  • Added interactive Coverage UI component with syntax highlighting via react-syntax-highlighter
  • Coverage metrics are conditionally collected only for dev evaluations

Reviewed changes

Copilot reviewed 18 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
eval/lib/evaluations/coverage.ts New module to compute coverage from istanbul data, extract per-file metrics, and write coverage artifacts
eval/lib/evaluations/parse-tests.ts Extracted test result parsing logic from test-stories.ts into a focused helper
eval/lib/evaluations/run-tests.ts Extracted test execution logic into a reusable function
eval/lib/evaluations/write-story-artifacts.ts Extracted artifact writing logic for story results and a11y violations
eval/lib/evaluations/result-types.ts New shared types for test results, coverage summary, and a11y violations
eval/lib/evaluations/test-stories.ts Refactored to orchestrate the new modular helpers with conditional coverage collection
eval/lib/context-utils.ts New utility to identify dev evaluations
eval/templates/result-docs/coverage.tsx New React component for rendering coverage with syntax highlighting and interactive file expansion
eval/templates/result-docs/summary.tsx Updated to display coverage metrics in the summary UI
eval/templates/evaluation/results/coverage.mdx New MDX page to render coverage in Storybook docs
eval/templates/project/package.json Added eval:test:coverage script for coverage collection
eval/types/istanbul-lib-coverage.d.ts Type definitions for istanbul-lib-coverage API
eval/types.ts Added coverage field to EvaluationSummary type
eval/tsconfig.json Added types directory to includes
eval/package.json Added istanbul-lib-coverage and react-syntax-highlighter dependencies
eval/eval.ts Added CLI output for coverage metrics
eval/lib/save/google-sheet.ts Added coverageLines field to sheets data
eval/README.md Updated documentation to reflect new coverage metric
pnpm-lock.yaml Lockfile updates for new dependencies
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported
Comments suppressed due to low confidence (1)

eval/lib/evaluations/coverage.ts:87

  • This use of variable 'normalizedTotal' always evaluates to true.
		if (normalizedTotal) {

cwd: projectPath,
},
});
await runTests({ projectPath, resultsPath } as ExperimentArgs, testScript);
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast to ExperimentArgs is potentially unsafe as it only passes projectPath and resultsPath but the ExperimentArgs type requires additional fields (experimentPath, evalPath, verbose, hooks, uploadId, evalName, context, agent). Consider either creating a minimal type for runTests parameters or ensuring all required fields are passed.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 11, 2025 21:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 20 changed files in this pull request and generated 6 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported
Comments suppressed due to low confidence (1)

eval/lib/evaluations/coverage.ts:87

  • This use of variable 'normalizedTotal' always evaluates to true.
		coverage = {

Comment on lines +34 to +35
storyAssertions[storyId] = {
status: assertionResult.status,
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storyAssertions object will only keep the status of the last assertion for each storyId if multiple assertions have the same storyId. This could lead to incorrect test counts if a story has multiple test assertions. Consider whether this is the intended behavior, or if you should aggregate statuses (e.g., mark as failed if any assertion fails) or track all assertions per story.

Suggested change
storyAssertions[storyId] = {
status: assertionResult.status,
// Aggregate statuses: if any assertion fails, mark as failed; otherwise, keep the worst status
const prevStatus = storyAssertions[storyId]?.status;
let newStatus = assertionResult.status;
if (prevStatus) {
// Priority: failed > todo > skipped > passed
const statusPriority = { failed: 3, todo: 2, skipped: 1, passed: 0 };
const prevPriority = statusPriority[prevStatus] ?? -1;
const newPriority = statusPriority[newStatus] ?? -1;
newStatus = prevPriority > newPriority ? prevStatus : newStatus;
}
storyAssertions[storyId] = {
status: newStatus,

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +29
const coverageData = JSON.parse(
await fs.readFile(finalCoveragePath, 'utf8'),
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the modern dynamic import of JSON files now support in Node, over readFile+parse. It gives nicer errors and provide type safety when the path is static, and then dynamic paths is just about being consistent. There are a bunch of instances of this alraedy in the code base that you can find, but it's something like this, if my memory is correct:

Suggested change
const coverageData = JSON.parse(
await fs.readFile(finalCoveragePath, 'utf8'),
);
const { default: coverageData } = await import(finalCoveragePath, { type: 'json' });

"eval:lint": "eslint .",
"preview": "vite preview",
"eval:test": "vitest run --reporter json --outputFile ../results/tests.json",
"eval:test:coverage": "vitest run --coverage --reporter json --outputFile ../results/tests.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
"eval:test:coverage": "vitest run --coverage --reporter json --outputFile ../results/tests.json",
"eval:test:coverage": "pnpm run eval:test --coverage",

Comment on lines +151 to +155
p.log.message(
cov
? `📊 Coverage: lines ${formatCov(cov.lines)}, statements ${formatCov(cov.statements)}, branches ${formatCov(cov.branches)}, functions ${formatCov(cov.functions)}`
: '📊 Coverage: (not collected)',
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of these log lines is to get a very quick overview of the result, and I think this is a bit too verbose for that right now.

I'd suggest something at outputs:

Coverage: ✅ 96 %
OR
Coverage: ⚠️ 80 %
OR
Coverage: ❌ 60 %

With some high and low watermarks, like: ❌ < 70 % < ⚠️ < 90 % < ✅
(I'm not married to those numbers at all)

I would probably also just not show "(not collected)" at all in that scenario, because it makes me think I as a user configured something wrong - maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants