Skip to content

feat: add clean_newline utility for hyphenated line breaks (#2513)#4339

Open
DevAbdullah90 wants to merge 1 commit intoUnstructured-IO:mainfrom
DevAbdullah90:DevAbdullah90/feat/clean-newline
Open

feat: add clean_newline utility for hyphenated line breaks (#2513)#4339
DevAbdullah90 wants to merge 1 commit intoUnstructured-IO:mainfrom
DevAbdullah90:DevAbdullah90/feat/clean-newline

Conversation

@DevAbdullah90
Copy link
Copy Markdown

@DevAbdullah90 DevAbdullah90 commented Apr 16, 2026

Problem

Issue #2513 identified a need for a utility function to handle hyphenated words split across newlines (e.g., "re- \nsearch" → "research"). This is a common issue in document partitioning where layout-preserving text extraction introduces artificial breaks in words.

Solution

This PR adds the clean_newline function to unstructured/cleaners/core.py.

  • Logic: Uses regex r"(\w+)-\s+(\w+)" to rejoin hyphenated words.
  • Flexibility: The \s+ pattern ensures it handles single spaces, tabs, and newlines between the hyphen and the word continuation.

Changes

  • Added clean_newline to unstructured/cleaners/core.py.
  • Added test cases to test_unstructured/cleaners/test_core.py covering various indentation and newline scenarios.

Verification

  • Added parameterized unit tests in test_unstructured/cleaners/test_core.py.
  • Verified all core cleaning tests pass (91 passed).

Fixes #2513

uv run python -m pytest test_unstructured/cleaners/test_core.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/clean_newline

1 participant