Fix(rust-tokenizer)!: return token vector in `tokenize` even on failure #5155

georgesittas · 2025-06-02T14:40:42Z

github-actions · 2025-06-02T14:43:43Z

Benchmark for `634067d`

Click to view benchmark

Test	Base	PR	%
long	211.1±3.00µs	217.8±22.74µs	+3.17%

github-actions · 2025-06-02T14:56:05Z

Benchmark for `fc86a5e`

Click to view benchmark

Test	Base	PR	%
long	210.0±4.42µs	212.2±1.92µs	+1.05%

github-actions · 2025-06-02T15:09:25Z

Benchmark for `76b382c`

Click to view benchmark

Test	Base	PR	%
long	209.2±4.92µs	214.0±3.68µs	+2.29%

github-actions · 2025-06-02T15:18:17Z

Parsing Benchmark

Benchmark hidden because not significant (8): parse_sqlglot_tpch, parse_sqlglot_short, parse_sqlglot_long, parse_sqlglot_crazy, parse_sqlglotrs_tpch, parse_sqlglotrs_short, parse_sqlglotrs_long, parse_sqlglotrs_crazy

Optimization Benchmark

Benchmark	bench_optimize_pr	bench_optimize_main
optimize_condition_1000	2.87 sec	2.84 sec: 1.01x faster
Geometric mean	(ref)	1.01x faster

Benchmark hidden because not significant (3): optimize_tpch, optimize_condition_10, optimize_condition_100

Copilot

Pull Request Overview

This PR addresses issue #5148 by ensuring that tokenization always returns a token vector—even when an error occurs—so that partial token lists can be inspected.

Updated the Rust tokenizer (tokenizer.rs) to return a tuple with tokens and an optional error message.
Adjusted the Python binding (tokens.py) to process the new return type and raise TokenError if needed while preserving partial tokens.
Added a new test (test_tokens.py) to verify that partial token lists are correctly returned on tokenization failure, and updated the CI workflow (.github/workflows/benchmark-sqlglot.yml) for consistent environment setup.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
tests/test_tokens.py	New test added to assert the behavior of partial token lists on failure.
sqlglotrs/src/tokenizer.rs	Modified tokenization to return a tuple (tokens, error_msg) instead of raising immediately.
sqlglot/tokens.py	Updated RS tokenizer integration to unpack the tuple and raise TokenError if needed.
.github/workflows/benchmark-sqlglot.yml	Adjusted workflow steps to reset and reinstall dependencies for reproducible benchmarking.

Copilot · 2025-06-02T16:57:25Z

sqlglotrs/src/tokenizer.rs

-            PyException::new_err(format!("Error tokenizing '{}': {}", e.context, e.message))
-        })
+        let tokenize_result = state.tokenize();
+        match tokenize_result {


Consider adding a comment that explains the new behavior of returning a tuple (partial tokens and an optional error message) so that API consumers are aware of how to handle tokenization failures.

Copilot · 2025-06-02T16:57:25Z

sqlglot/tokens.py

-            return tokens
-        except Exception as e:
-            raise TokenError(str(e))
+        tokens, error_msg = self._RS_TOKENIZER.tokenize(sql, self._rs_dialect_settings)


Update the RS tokenizer integration documentation or comments to note the new tuple return type, ensuring that downstream users understand that partial tokens are available even after a failure.

Copilot · 2025-06-02T16:57:25Z

.github/workflows/benchmark-sqlglot.yml

      - name: Checkout main branch into subdir
        run: |
          git fetch origin main
          git worktree add main-branch origin/main


[nitpick] Consider adding a comment explaining the rationale for resetting the virtual environment to aid future maintainers in understanding the workflow steps for reproducible benchmarks.

Suggested change

git worktree add main-branch origin/main

git worktree add main-branch origin/main

# Reset the virtual environment to ensure a clean and consistent environment

# when switching between branches. This is critical for reproducible benchmarks.

georgesittas requested review from benfdking, izeigerman and tobymao June 2, 2025 14:40

georgesittas force-pushed the jo/always_return_tokens_in_sqlglotrs_tokenize branch from 8577417 to 11a95bc Compare June 2, 2025 14:52

Fix(rust-tokenizer)!: return token vector in tokenize even on failure

f8b0829

georgesittas force-pushed the jo/always_return_tokens_in_sqlglotrs_tokenize branch from 11a95bc to f8b0829 Compare June 2, 2025 15:06

benfdking requested a review from Copilot June 2, 2025 16:56

benfdking approved these changes Jun 2, 2025

View reviewed changes

Copilot AI reviewed Jun 2, 2025

View reviewed changes

georgesittas merged commit f6124c6 into main Jun 2, 2025
8 checks passed

georgesittas deleted the jo/always_return_tokens_in_sqlglotrs_tokenize branch June 2, 2025 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix(rust-tokenizer)!: return token vector in `tokenize` even on failure #5155

Fix(rust-tokenizer)!: return token vector in `tokenize` even on failure #5155

Uh oh!

georgesittas commented Jun 2, 2025

Uh oh!

github-actions bot commented Jun 2, 2025

Uh oh!

github-actions bot commented Jun 2, 2025

Uh oh!

github-actions bot commented Jun 2, 2025

Uh oh!

github-actions bot commented Jun 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 2, 2025

Uh oh!

Copilot AI Jun 2, 2025

Uh oh!

Copilot AI Jun 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix(rust-tokenizer)!: return token vector in tokenize even on failure #5155

Fix(rust-tokenizer)!: return token vector in tokenize even on failure #5155

Uh oh!

Conversation

georgesittas commented Jun 2, 2025

Uh oh!

github-actions bot commented Jun 2, 2025

Benchmark for 634067d

Uh oh!

github-actions bot commented Jun 2, 2025

Benchmark for fc86a5e

Uh oh!

github-actions bot commented Jun 2, 2025

Benchmark for 76b382c

Uh oh!

github-actions bot commented Jun 2, 2025

Parsing Benchmark

Optimization Benchmark

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix(rust-tokenizer)!: return token vector in `tokenize` even on failure #5155

Fix(rust-tokenizer)!: return token vector in `tokenize` even on failure #5155

Benchmark for `634067d`

Benchmark for `fc86a5e`

Benchmark for `76b382c`