-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix(rust-tokenizer)!: return token vector in tokenize even on failure
#5155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix(rust-tokenizer)!: return token vector in tokenize even on failure
#5155
Conversation
Benchmark for 634067dClick to view benchmark
|
8577417 to
11a95bc
Compare
Benchmark for fc86a5eClick to view benchmark
|
11a95bc to
f8b0829
Compare
Benchmark for 76b382cClick to view benchmark
|
Parsing BenchmarkBenchmark hidden because not significant (8): parse_sqlglot_tpch, parse_sqlglot_short, parse_sqlglot_long, parse_sqlglot_crazy, parse_sqlglotrs_tpch, parse_sqlglotrs_short, parse_sqlglotrs_long, parse_sqlglotrs_crazy Optimization Benchmark
Benchmark hidden because not significant (3): optimize_tpch, optimize_condition_10, optimize_condition_100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses issue #5148 by ensuring that tokenization always returns a token vector—even when an error occurs—so that partial token lists can be inspected.
- Updated the Rust tokenizer (tokenizer.rs) to return a tuple with tokens and an optional error message.
- Adjusted the Python binding (tokens.py) to process the new return type and raise TokenError if needed while preserving partial tokens.
- Added a new test (test_tokens.py) to verify that partial token lists are correctly returned on tokenization failure, and updated the CI workflow (.github/workflows/benchmark-sqlglot.yml) for consistent environment setup.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_tokens.py | New test added to assert the behavior of partial token lists on failure. |
| sqlglotrs/src/tokenizer.rs | Modified tokenization to return a tuple (tokens, error_msg) instead of raising immediately. |
| sqlglot/tokens.py | Updated RS tokenizer integration to unpack the tuple and raise TokenError if needed. |
| .github/workflows/benchmark-sqlglot.yml | Adjusted workflow steps to reset and reinstall dependencies for reproducible benchmarking. |
| PyException::new_err(format!("Error tokenizing '{}': {}", e.context, e.message)) | ||
| }) | ||
| let tokenize_result = state.tokenize(); | ||
| match tokenize_result { |
Copilot
AI
Jun 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment that explains the new behavior of returning a tuple (partial tokens and an optional error message) so that API consumers are aware of how to handle tokenization failures.
| return tokens | ||
| except Exception as e: | ||
| raise TokenError(str(e)) | ||
| tokens, error_msg = self._RS_TOKENIZER.tokenize(sql, self._rs_dialect_settings) |
Copilot
AI
Jun 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the RS tokenizer integration documentation or comments to note the new tuple return type, ensuring that downstream users understand that partial tokens are available even after a failure.
| - name: Checkout main branch into subdir | ||
| run: | | ||
| git fetch origin main | ||
| git worktree add main-branch origin/main |
Copilot
AI
Jun 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider adding a comment explaining the rationale for resetting the virtual environment to aid future maintainers in understanding the workflow steps for reproducible benchmarks.
| git worktree add main-branch origin/main | |
| git worktree add main-branch origin/main | |
| # Reset the virtual environment to ensure a clean and consistent environment | |
| # when switching between branches. This is critical for reproducible benchmarks. |
Fixes #5148