Describe the bug
passing unstructured.cleaners.core.group_bullet_paragraph to UnstructuredBaseLoader's post_processors will cause the code to break, because group_bullet_paragraph returns a List[str], and unstructured.documents.elements.Text.apply() method checks the output of group_bullet_paragraph, and throws an error if it is not str, see here:
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")
To Reproduce
loader = UnstructuredFileLoader("some_file_that_has_bullet_points.pdf",
mode="elements",
pdf_infer_table_structure=True,
skip_infer_table_types=['jpg', 'png', 'xls', 'xlsx'],
show_progress=True,
post_processors=[group_bullet_paragraph]
)
docs = loader.load()
Expected behavior
The list of strings should be joined.
Proposing replacing:
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")
with something like:
if isinstance(cleaned_text, list):
cleaned_text = " ".join(cleaned_text)
if not isinstance(cleaned_text, str): # pyright: ignore[reportUnnecessaryIsInstance]
raise ValueError("Cleaner produced a non-string output.")
Describe the bug
passing
unstructured.cleaners.core.group_bullet_paragraphtoUnstructuredBaseLoader'spost_processorswill cause the code to break, becausegroup_bullet_paragraphreturns aList[str], andunstructured.documents.elements.Text.apply()method checks the output ofgroup_bullet_paragraph, and throws an error if it is notstr, see here:To Reproduce
Expected behavior
The list of strings should be joined.
Proposing replacing:
with something like: