Skip to content

Conversation

@georgesittas
Copy link
Collaborator

Fixes #5148

@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2025

Benchmark for 634067d

Click to view benchmark
Test Base PR %
long 211.1±3.00µs 217.8±22.74µs +3.17%

@georgesittas georgesittas force-pushed the jo/always_return_tokens_in_sqlglotrs_tokenize branch from 8577417 to 11a95bc Compare June 2, 2025 14:52
@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2025

Benchmark for fc86a5e

Click to view benchmark
Test Base PR %
long 210.0±4.42µs 212.2±1.92µs +1.05%

@georgesittas georgesittas force-pushed the jo/always_return_tokens_in_sqlglotrs_tokenize branch from 11a95bc to f8b0829 Compare June 2, 2025 15:06
@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2025

Benchmark for 76b382c

Click to view benchmark
Test Base PR %
long 209.2±4.92µs 214.0±3.68µs +2.29%

@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2025

Parsing Benchmark

Benchmark hidden because not significant (8): parse_sqlglot_tpch, parse_sqlglot_short, parse_sqlglot_long, parse_sqlglot_crazy, parse_sqlglotrs_tpch, parse_sqlglotrs_short, parse_sqlglotrs_long, parse_sqlglotrs_crazy


Optimization Benchmark

Benchmark bench_optimize_pr bench_optimize_main
optimize_condition_1000 2.87 sec 2.84 sec: 1.01x faster
Geometric mean (ref) 1.01x faster

Benchmark hidden because not significant (3): optimize_tpch, optimize_condition_10, optimize_condition_100

@benfdking benfdking requested a review from Copilot June 2, 2025 16:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses issue #5148 by ensuring that tokenization always returns a token vector—even when an error occurs—so that partial token lists can be inspected.

  • Updated the Rust tokenizer (tokenizer.rs) to return a tuple with tokens and an optional error message.
  • Adjusted the Python binding (tokens.py) to process the new return type and raise TokenError if needed while preserving partial tokens.
  • Added a new test (test_tokens.py) to verify that partial token lists are correctly returned on tokenization failure, and updated the CI workflow (.github/workflows/benchmark-sqlglot.yml) for consistent environment setup.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
tests/test_tokens.py New test added to assert the behavior of partial token lists on failure.
sqlglotrs/src/tokenizer.rs Modified tokenization to return a tuple (tokens, error_msg) instead of raising immediately.
sqlglot/tokens.py Updated RS tokenizer integration to unpack the tuple and raise TokenError if needed.
.github/workflows/benchmark-sqlglot.yml Adjusted workflow steps to reset and reinstall dependencies for reproducible benchmarking.

PyException::new_err(format!("Error tokenizing '{}': {}", e.context, e.message))
})
let tokenize_result = state.tokenize();
match tokenize_result {
Copy link

Copilot AI Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment that explains the new behavior of returning a tuple (partial tokens and an optional error message) so that API consumers are aware of how to handle tokenization failures.

Copilot uses AI. Check for mistakes.
return tokens
except Exception as e:
raise TokenError(str(e))
tokens, error_msg = self._RS_TOKENIZER.tokenize(sql, self._rs_dialect_settings)
Copy link

Copilot AI Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the RS tokenizer integration documentation or comments to note the new tuple return type, ensuring that downstream users understand that partial tokens are available even after a failure.

Copilot uses AI. Check for mistakes.
- name: Checkout main branch into subdir
run: |
git fetch origin main
git worktree add main-branch origin/main
Copy link

Copilot AI Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider adding a comment explaining the rationale for resetting the virtual environment to aid future maintainers in understanding the workflow steps for reproducible benchmarks.

Suggested change
git worktree add main-branch origin/main
git worktree add main-branch origin/main
# Reset the virtual environment to ensure a clean and consistent environment
# when switching between branches. This is critical for reproducible benchmarks.

Copilot uses AI. Check for mistakes.
@georgesittas georgesittas merged commit f6124c6 into main Jun 2, 2025
8 checks passed
@georgesittas georgesittas deleted the jo/always_return_tokens_in_sqlglotrs_tokenize branch June 2, 2025 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ability for rust tokenizer to also efficiently handle partial text streams

3 participants