-
Notifications
You must be signed in to change notification settings - Fork 12
Implemented end-to-end coverage reporting for dev evals #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implemented end-to-end coverage reporting for dev evals #110
Conversation
|
commit: |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #110 +/- ##
=======================================
Coverage 87.82% 87.82%
=======================================
Files 19 19
Lines 427 427
Branches 122 122
=======================================
Hits 375 375
Misses 8 8
Partials 44 44 ☔ View full report in Codecov by Sentry. |
c432181 to
421035d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements comprehensive coverage reporting for dev evaluations, adding full end-to-end coverage tracking with visual presentation. The implementation decomposes the test flow into focused, reusable helpers (run tests, parse results, write artifacts, compute coverage), adds a new Coverage UI page with syntax-highlighted, expandable per-file views with line/branch indicators, and includes CLI reporting of coverage percentages.
Key changes:
- Added istanbul-lib-coverage integration for coverage computation
- Created modular test pipeline with separate concerns (run, parse, write, coverage)
- Added interactive Coverage UI component with syntax highlighting via react-syntax-highlighter
- Coverage metrics are conditionally collected only for dev evaluations
Reviewed changes
Copilot reviewed 18 out of 20 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| eval/lib/evaluations/coverage.ts | New module to compute coverage from istanbul data, extract per-file metrics, and write coverage artifacts |
| eval/lib/evaluations/parse-tests.ts | Extracted test result parsing logic from test-stories.ts into a focused helper |
| eval/lib/evaluations/run-tests.ts | Extracted test execution logic into a reusable function |
| eval/lib/evaluations/write-story-artifacts.ts | Extracted artifact writing logic for story results and a11y violations |
| eval/lib/evaluations/result-types.ts | New shared types for test results, coverage summary, and a11y violations |
| eval/lib/evaluations/test-stories.ts | Refactored to orchestrate the new modular helpers with conditional coverage collection |
| eval/lib/context-utils.ts | New utility to identify dev evaluations |
| eval/templates/result-docs/coverage.tsx | New React component for rendering coverage with syntax highlighting and interactive file expansion |
| eval/templates/result-docs/summary.tsx | Updated to display coverage metrics in the summary UI |
| eval/templates/evaluation/results/coverage.mdx | New MDX page to render coverage in Storybook docs |
| eval/templates/project/package.json | Added eval:test:coverage script for coverage collection |
| eval/types/istanbul-lib-coverage.d.ts | Type definitions for istanbul-lib-coverage API |
| eval/types.ts | Added coverage field to EvaluationSummary type |
| eval/tsconfig.json | Added types directory to includes |
| eval/package.json | Added istanbul-lib-coverage and react-syntax-highlighter dependencies |
| eval/eval.ts | Added CLI output for coverage metrics |
| eval/lib/save/google-sheet.ts | Added coverageLines field to sheets data |
| eval/README.md | Updated documentation to reflect new coverage metric |
| pnpm-lock.yaml | Lockfile updates for new dependencies |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Comments suppressed due to low confidence (1)
eval/lib/evaluations/coverage.ts:87
- This use of variable 'normalizedTotal' always evaluates to true.
if (normalizedTotal) {
| cwd: projectPath, | ||
| }, | ||
| }); | ||
| await runTests({ projectPath, resultsPath } as ExperimentArgs, testScript); |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast to ExperimentArgs is potentially unsafe as it only passes projectPath and resultsPath but the ExperimentArgs type requires additional fields (experimentPath, evalPath, verbose, hooks, uploadId, evalName, context, agent). Consider either creating a minimal type for runTests parameters or ensuring all required fields are passed.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 18 out of 20 changed files in this pull request and generated 6 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Comments suppressed due to low confidence (1)
eval/lib/evaluations/coverage.ts:87
- This use of variable 'normalizedTotal' always evaluates to true.
coverage = {
| storyAssertions[storyId] = { | ||
| status: assertionResult.status, |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The storyAssertions object will only keep the status of the last assertion for each storyId if multiple assertions have the same storyId. This could lead to incorrect test counts if a story has multiple test assertions. Consider whether this is the intended behavior, or if you should aggregate statuses (e.g., mark as failed if any assertion fails) or track all assertions per story.
| storyAssertions[storyId] = { | |
| status: assertionResult.status, | |
| // Aggregate statuses: if any assertion fails, mark as failed; otherwise, keep the worst status | |
| const prevStatus = storyAssertions[storyId]?.status; | |
| let newStatus = assertionResult.status; | |
| if (prevStatus) { | |
| // Priority: failed > todo > skipped > passed | |
| const statusPriority = { failed: 3, todo: 2, skipped: 1, passed: 0 }; | |
| const prevPriority = statusPriority[prevStatus] ?? -1; | |
| const newPriority = statusPriority[newStatus] ?? -1; | |
| newStatus = prevPriority > newPriority ? prevStatus : newStatus; | |
| } | |
| storyAssertions[storyId] = { | |
| status: newStatus, |
| const coverageData = JSON.parse( | ||
| await fs.readFile(finalCoveragePath, 'utf8'), | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the modern dynamic import of JSON files now support in Node, over readFile+parse. It gives nicer errors and provide type safety when the path is static, and then dynamic paths is just about being consistent. There are a bunch of instances of this alraedy in the code base that you can find, but it's something like this, if my memory is correct:
| const coverageData = JSON.parse( | |
| await fs.readFile(finalCoveragePath, 'utf8'), | |
| ); | |
| const { default: coverageData } = await import(finalCoveragePath, { type: 'json' }); |
| "eval:lint": "eslint .", | ||
| "preview": "vite preview", | ||
| "eval:test": "vitest run --reporter json --outputFile ../results/tests.json", | ||
| "eval:test:coverage": "vitest run --coverage --reporter json --outputFile ../results/tests.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| "eval:test:coverage": "vitest run --coverage --reporter json --outputFile ../results/tests.json", | |
| "eval:test:coverage": "pnpm run eval:test --coverage", |
| p.log.message( | ||
| cov | ||
| ? `📊 Coverage: lines ${formatCov(cov.lines)}, statements ${formatCov(cov.statements)}, branches ${formatCov(cov.branches)}, functions ${formatCov(cov.functions)}` | ||
| : '📊 Coverage: (not collected)', | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the purpose of these log lines is to get a very quick overview of the result, and I think this is a bit too verbose for that right now.
I'd suggest something at outputs:
Coverage: ✅ 96 %
OR
Coverage: ⚠️ 80 %
OR
Coverage: ❌ 60 %
With some high and low watermarks, like: ❌ < 70 % <
(I'm not married to those numbers at all)
I would probably also just not show "(not collected)" at all in that scenario, because it makes me think I as a user configured something wrong - maybe?
Running a dev-context eval now yields a full coverage report alongside the existing build/test/typecheck outputs. A new Coverage page renders totals and syntax-highlighted, expandable per-file views with line/branch indicators, and the CLI now reports coverage percentages.
Additionally, the test flow is now decomposed into focused helpers (run tests, parse results, write artifacts, compute coverage).