New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems when I parsing Chineses PDF documents #2999
Comments
Hi @WangJiaxin-x - do you have an example document available that we could use to replicate this? Thanks! |
ok,let me give an example.I will give the two documents,One is the raw pdf file,another is the json which i use the code below to get.The bug is some elements should be import json
from typing import Iterable, Optional
from unstructured.documents.elements import Element
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json, _fix_metadata_field_precision, elements_to_dicts
elements = partition_pdf(filename=r"C:\Users\A\Desktop\test.pdf",
languages=["chi_sim"]) # bytes -> BinaryIO
def elements_to_json_chi(
elements: Iterable[Element],
filename: Optional[str] = None,
indent: int = 4,
encoding: str = "utf-8",
) -> Optional[str]:
"""Saves a list of elements to a JSON file if filename is specified.
Otherwise, return the list of elements as a string.
"""
# -- serialize `elements` as a JSON array (str) --
precision_adjusted_elements = _fix_metadata_field_precision(elements)
element_dicts = elements_to_dicts(precision_adjusted_elements)
json_str = json.dumps(element_dicts, ensure_ascii=False, indent=indent, sort_keys=True)
if filename is not None:
with open(filename, "w", encoding=encoding) as f:
f.write(json_str)
return None
return json_str
elements_to_json_chi(elements, filename="./test_json.json") also,it shows that in package |
Thank you for the example! We're tracking this and will investigate as soon as we can. |
ok,thanks for your help!!! |
Hi, When I use
partition_type(file=io.BytesIO(file.file.read()),languages=["chi_sim"])
to parse Chinese pdf documents, I found the result was to split the paragraph text into a line text as a elemet. And another problem is element type isn't accurate, should be UncategorizedText but actually is TitleThe text was updated successfully, but these errors were encountered: