feat: implement iter_arrow for skip, take and step iterables #7972

Edge-Explorer · 2026-01-30T05:47:13Z

This commit optimizes streaming operations by implementing _iter_arrow for SkipExamplesIterable, TakeExamplesIterable, and StepExamplesIterable.

Key Changes:

Fast Batch Processing: Enabled batch-level slicing for .skip(n) and .take(n) on streaming datasets, bypassing slow row-by-row iteration.
Optimized Sharding: Updated StepExamplesIterable (used in distributed training) to use Arrow's .take() to extract multiple records from a batch simultaneously.
State Preservation: Reinforced _init_state_dict and load_state_dict to support flawless checkpointing and resumption while using Arrow iteration.

Performance Impact:

Users will experience significant performance gains when skipping or taking examples in streaming mode. By staying in the "Arrow path" and avoiding Python dictionary conversions, data loading overhead is drastically reduced, especially for large-scale training jobs.

Testing:

Integrated 6 new unit tests into tests/test_iterable_dataset.py to verify:

Functional correctness for skip, take, and step using Arrow iteration.
Reliable state checkpointing and resumption after partial iteration.

This commit optimizes streaming operations by implementing _iter_arrow for SkipExamplesIterable, TakeExamplesIterable, and StepExamplesIterable. Key Changes: - Enabled fast batch-level processing for .skip(n) and .take(n) on streaming datasets. - Optimized distributed sharding (StepExamplesIterable) to use Arrow's .take() for picking multiple records from a batch simultaneously. - Updated _init_state_dict and load_state_dict to ensure seamless checkpointing while using Arrow iteration. Performance Impact: Users will see significant speedups when skipping or taking examples in streaming mode, as the dataset no longer needs to fallback to row-by-row Python dictionary conversion for these operations. Testing: Added 6 new unit tests to ests/test_iterable_dataset.py covering functional correctness and state resumption for all three iterable types.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement iter_arrow for skip, take and step iterables #7972

feat: implement iter_arrow for skip, take and step iterables #7972

Edge-Explorer commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: implement iter_arrow for skip, take and step iterables #7972

Are you sure you want to change the base?

feat: implement iter_arrow for skip, take and step iterables #7972

Conversation

Edge-Explorer commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes:

Performance Impact:

Testing:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Edge-Explorer commented Jan 30, 2026 •

edited

Loading