-
Notifications
You must be signed in to change notification settings - Fork 5
Add narrow vs wide parquet performance benchmark tutorial #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add narrow vs wide parquet performance benchmark tutorial #40
Conversation
Interactive browser-based benchmarks comparing narrow (691MB, 11.6M rows) vs wide (275MB, 2.5M rows) parquet schemas for iSamples OpenContext data. Features: - Three benchmarks: entity counts, site aggregation, material distribution - Multiple runs with median timing for reliability - Environment info display (browser, connection type) - Data validity checks - Technical notes on pitfalls (caching, cold starts, memory limits) This page answers Eric's question about whether the 2-3x speedup seen locally also holds "over the wire" with HTTP range requests via DuckDB-WASM. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an interactive browser-based performance benchmark tutorial comparing narrow vs wide parquet schema formats for the iSamples property graph data. The tutorial uses DuckDB-WASM to execute queries via HTTP range requests and demonstrates the performance differences between the two schema approaches.
Key changes:
- Interactive benchmarking suite with three different query types (entity counts, site aggregation, material distribution)
- Real-time performance measurement with cold/warm run tracking and median calculation
- Data validity checks to confirm schema equivalence
- Environment information display and comprehensive technical documentation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Calculate medians | ||
| const median = arr => { | ||
| const sorted = [...arr].sort((a, b) => a - b); | ||
| return sorted[Math.floor(sorted.length / 2)]; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The median calculation is incorrect for arrays with an even number of elements. For an array with 3 elements (odd), Math.floor(3/2) = 1 correctly returns the middle element. However, if the number of runs changes to an even number, this implementation would not calculate the true median (average of the two middle values).
Consider using a more robust median calculation:
const median = arr => {
const sorted = [...arr].sort((a, b) => a - b);
const mid = Math.floor(sorted.length / 2);
return sorted.length % 2 === 0
? (sorted[mid - 1] + sorted[mid]) / 2
: sorted[mid];
};| return sorted[Math.floor(sorted.length / 2)]; | |
| const mid = Math.floor(sorted.length / 2); | |
| return sorted.length % 2 === 0 | |
| ? (sorted[mid - 1] + sorted[mid]) / 2 | |
| : sorted[mid]; |
| const median = arr => { | ||
| const sorted = [...arr].sort((a, b) => a - b); | ||
| return sorted[Math.floor(sorted.length / 2)]; | ||
| }; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The median calculation is duplicated from Benchmark 1 (lines 236-239). Consider extracting this function to avoid code duplication and improve maintainability. The same issue with even-length arrays also applies here.
| const median = arr => { | |
| const sorted = [...arr].sort((a, b) => a - b); | |
| return sorted[Math.floor(sorted.length / 2)]; | |
| }; | |
| // Median calculation extracted for reuse and correctness | |
| function median(arr) { | |
| // (function defined above) | |
| return (sorted[mid - 1] + sorted[mid]) / 2; | |
| } else { | |
| return sorted[mid]; | |
| } | |
| } |
| const median = arr => { | ||
| const sorted = [...arr].sort((a, b) => a - b); | ||
| return sorted[Math.floor(sorted.length / 2)]; | ||
| }; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The median calculation is duplicated from Benchmarks 1 and 2. Consider extracting this function to avoid code duplication and improve maintainability. The same issue with even-length arrays also applies here.
| const median = arr => { | |
| const sorted = [...arr].sort((a, b) => a - b); | |
| return sorted[Math.floor(sorted.length / 2)]; | |
| }; | |
| // Median calculation extracted and fixed for even-length arrays | |
| function median(arr) { | |
| const sorted = [...arr].sort((a, b) => a - b); | |
| const mid = Math.floor(sorted.length / 2); | |
| if (sorted.length % 2 === 0) { | |
| return (sorted[mid - 1] + sorted[mid]) / 2; | |
| } else { | |
| return sorted[mid]; | |
| } | |
| } |
| //| echo: false | ||
| envInfo = { | ||
| const ua = navigator.userAgent; | ||
| const browser = ua.includes('Chrome') ? 'Chrome' : ua.includes('Firefox') ? 'Firefox' : ua.includes('Safari') ? 'Safari' : 'Unknown'; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The browser detection logic is overly simplistic and may incorrectly identify browsers. For example, Chrome's user agent string contains "Safari", so the order matters. Consider checking for more specific patterns or using a library for accurate browser detection.
The current logic will work for most cases since Chrome is checked first, but Edge (Chromium-based) would be detected as "Chrome" rather than "Edge", which may not be the desired behavior.
| const browser = ua.includes('Chrome') ? 'Chrome' : ua.includes('Firefox') ? 'Firefox' : ua.includes('Safari') ? 'Safari' : 'Unknown'; | |
| // Improved browser detection logic | |
| let browser = 'Unknown'; | |
| if (ua.includes('Edg/')) { | |
| browser = 'Edge'; | |
| } else if (ua.includes('Chrome/') && !ua.includes('Edg/')) { | |
| browser = 'Chrome'; | |
| } else if (ua.includes('Firefox/')) { | |
| browser = 'Firefox'; | |
| } else if (ua.includes('Safari/') && !ua.includes('Chrome/') && !ua.includes('Edg/')) { | |
| browser = 'Safari'; | |
| } |
| wideCold: wideTimes[0], | ||
| wideMedian: wideMedian, | ||
| wideAll: wideTimes, | ||
| speedup: narrowMedian / wideMedian |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speedup calculation narrowMedian / wideMedian could produce Infinity or NaN if wideMedian is 0 or if either value is undefined. While unlikely in practice, consider adding a guard condition or handling this edge case gracefully in the display logic to prevent confusing output.
| wideCold: wideTimes[0], | ||
| wideMedian: wideMedian, | ||
| wideAll: wideTimes, | ||
| speedup: narrowMedian / wideMedian |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speedup calculation narrowMedian / wideMedian could produce Infinity or NaN if wideMedian is 0. The same issue exists in Benchmark 1 (line 252).
| speedup: narrowMedian / wideMedian | |
| speedup: wideMedian === 0 ? null : narrowMedian / wideMedian |
| wideCold: wideTimes[0], | ||
| wideMedian: wideMedian, | ||
| wideAll: wideTimes, | ||
| speedup: narrowMedian / wideMedian |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speedup calculation narrowMedian / wideMedian could produce Infinity or NaN if wideMedian is 0. The same issue exists in Benchmarks 1 and 2.
| speedup: narrowMedian / wideMedian | |
| speedup: wideMedian === 0 ? null : narrowMedian / wideMedian |
Combined Review: Codex + Copilot AnalysisSummaryCopilot gave a positive high-level overview. Codex provided detailed code analysis identifying several methodology and implementation issues worth addressing. Critical Issues Identified by Codex
Nice-to-Have Improvements
Decision NeededThese are valid methodological concerns. Options:
What's the preferred approach? |
Fixes based on Codex code review: - Lazy data loading: DB initialization now gated behind button click - Sequential execution: Benchmarks wait for previous ones to complete - Improved cold/warm methodology: Warm median excludes cold run - Error handling: Try/catch with user-visible error display - Pinned DuckDB version: @0.7.1 instead of @latest - Updated methodology docs to reflect changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Fixes Applied (commit 9f7ebc6)Addressed all critical issues from the Codex review: 1. Lazy Data Loading ✅
2. Sequential Benchmark Execution ✅
3. Improved Cold/Warm Methodology ✅
4. Error Handling ✅
5. Pinned DuckDB Version ✅
Updated Methodology DocumentationReady for re-review! |
Addresses remaining issues from Codex code review:
1. Error handling in render blocks - validityCheck and all three
benchmarks now gracefully show error messages instead of crashing
when they return {error}
2. Summary aggregation guards - allResults now filters out failed
benchmarks, shows "N/A" for errors, handles null avgSpeedup
3. Documentation drift - Updated Pitfalls table to clarify we report
warm-run median (excluding cold run)
4. warmMedian calculation - Fixed to average two elements instead of
floor selection which overstated times
All Codex review issues now resolved.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Third Round: Error Handling FixesCodex reviewed the second round of fixes and found remaining issues with error handling in render blocks. These have now been addressed: Fixed (commit 43896b1):
All Codex Review Issues Resolved ✅The implementation now handles error cases gracefully and the documentation accurately reflects the methodology. |
The tutorial was merged in PR isamplesorg#40 but wasn't added to the sidebar. This adds it to the "Getting Started" section. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The tutorial was merged in PR #40 but wasn't added to the sidebar. This adds it to the "Getting Started" section. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <[email protected]>
Summary
Context
This addresses Eric's question about whether the 2-3x speedup seen locally also holds "over the wire" with HTTP range requests via DuckDB-WASM.
Test plan
🤖 Generated with Claude Code