Skip to content

Conversation

@Edge-Explorer
Copy link
Contributor

@Edge-Explorer Edge-Explorer commented Jan 30, 2026

This commit optimizes streaming operations by implementing _iter_arrow for SkipExamplesIterable, TakeExamplesIterable, and StepExamplesIterable.

Key Changes:

  • Fast Batch Processing: Enabled batch-level slicing for .skip(n) and .take(n) on streaming datasets, bypassing slow row-by-row iteration.
  • Optimized Sharding: Updated StepExamplesIterable (used in distributed training) to use Arrow's .take() to extract multiple records from a batch simultaneously.
  • State Preservation: Reinforced _init_state_dict and load_state_dict to support flawless checkpointing and resumption while using Arrow iteration.

Performance Impact:

Users will experience significant performance gains when skipping or taking examples in streaming mode. By staying in the "Arrow path" and avoiding Python dictionary conversions, data loading overhead is drastically reduced, especially for large-scale training jobs.

Testing:

Integrated 6 new unit tests into tests/test_iterable_dataset.py to verify:

  • Functional correctness for skip, take, and step using Arrow iteration.
  • Reliable state checkpointing and resumption after partial iteration.

This commit optimizes streaming operations by implementing _iter_arrow for SkipExamplesIterable, TakeExamplesIterable, and StepExamplesIterable.

Key Changes:
- Enabled fast batch-level processing for .skip(n) and .take(n) on streaming datasets.
- Optimized distributed sharding (StepExamplesIterable) to use Arrow's .take() for picking multiple records from a batch simultaneously.
- Updated _init_state_dict and load_state_dict to ensure seamless checkpointing while using Arrow iteration.

Performance Impact:
Users will see significant speedups when skipping or taking examples in streaming mode, as the dataset no longer needs to fallback to row-by-row Python dictionary conversion for these operations.

Testing:
Added 6 new unit tests to 	ests/test_iterable_dataset.py covering functional correctness and state resumption for all three iterable types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant