Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when I parsing Chineses PDF documents #2999

Open
WangJiaxin-x opened this issue May 10, 2024 · 4 comments
Open

Problems when I parsing Chineses PDF documents #2999

WangJiaxin-x opened this issue May 10, 2024 · 4 comments
Labels
bug Something isn't working needs follow up

Comments

@WangJiaxin-x
Copy link

WangJiaxin-x commented May 10, 2024

Hi, When I use partition_type(file=io.BytesIO(file.file.read()),languages=["chi_sim"]) to parse Chinese pdf documents, I found the result was to split the paragraph text into a line text as a elemet. And another problem is element type isn't accurate, should be UncategorizedText but actually is Title

@WangJiaxin-x WangJiaxin-x added the bug Something isn't working label May 10, 2024
@MthwRobinson
Copy link
Contributor

Hi @WangJiaxin-x - do you have an example document available that we could use to replicate this? Thanks!

@idiotTest
Copy link

ok,let me give an example.I will give the two documents,One is the raw pdf file,another is the json which i use the code below to get.The bug is some elements should be Title,But it is UncategorizedText in the result. Also, the result shows that some paragraphs can't recognized ,you can see the json,a line in paragraphs is recognized as a element.So a paragraphs is split into many elements.I think it is not a good result.
Hope your reply,Thanks!!!

import json
from typing import Iterable, Optional

from unstructured.documents.elements import Element
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json, _fix_metadata_field_precision, elements_to_dicts

elements = partition_pdf(filename=r"C:\Users\A\Desktop\test.pdf",
                         languages=["chi_sim"])  # bytes -> BinaryIO


def elements_to_json_chi(
        elements: Iterable[Element],
        filename: Optional[str] = None,
        indent: int = 4,
        encoding: str = "utf-8",
) -> Optional[str]:
    """Saves a list of elements to a JSON file if filename is specified.

    Otherwise, return the list of elements as a string.
    """
    # -- serialize `elements` as a JSON array (str) --
    precision_adjusted_elements = _fix_metadata_field_precision(elements)
    element_dicts = elements_to_dicts(precision_adjusted_elements)
    json_str = json.dumps(element_dicts, ensure_ascii=False, indent=indent, sort_keys=True)

    if filename is not None:
        with open(filename, "w", encoding=encoding) as f:
            f.write(json_str)
        return None

    return json_str


elements_to_json_chi(elements, filename="./test_json.json")

also,it shows that in package unstructured.staging.base the func elements_to_json has some encoding bugs in chinese.The parameter ensure_ascii in json.dump,I think shoule be false.
test.pdf
test_json.json

@MthwRobinson
Copy link
Contributor

Thank you for the example! We're tracking this and will investigate as soon as we can.

@idiotTest
Copy link

ok,thanks for your help!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs follow up
Projects
None yet
Development

No branches or pull requests

3 participants