Problems when I parsing Chineses PDF documents #2999

WangJiaxin-x · 2024-05-10T08:08:37Z

Hi, When I use partition_type(file=io.BytesIO(file.file.read()),languages=["chi_sim"]) to parse Chinese pdf documents, I found the result was to split the paragraph text into a line text as a elemet. And another problem is element type isn't accurate, should be UncategorizedText but actually is Title

The text was updated successfully, but these errors were encountered:

MthwRobinson · 2024-05-10T12:22:06Z

Hi @WangJiaxin-x - do you have an example document available that we could use to replicate this? Thanks!

idiotTest · 2024-05-11T03:31:49Z

ok,let me give an example.I will give the two documents,One is the raw pdf file,another is the json which i use the code below to get.The bug is some elements should be Title,But it is UncategorizedText in the result. Also, the result shows that some paragraphs can't recognized ,you can see the json,a line in paragraphs is recognized as a element.So a paragraphs is split into many elements.I think it is not a good result.
Hope your reply,Thanks!!!

import json
from typing import Iterable, Optional

from unstructured.documents.elements import Element
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json, _fix_metadata_field_precision, elements_to_dicts

elements = partition_pdf(filename=r"C:\Users\A\Desktop\test.pdf",
                         languages=["chi_sim"])  # bytes -> BinaryIO


def elements_to_json_chi(
        elements: Iterable[Element],
        filename: Optional[str] = None,
        indent: int = 4,
        encoding: str = "utf-8",
) -> Optional[str]:
    """Saves a list of elements to a JSON file if filename is specified.

    Otherwise, return the list of elements as a string.
    """
    # -- serialize `elements` as a JSON array (str) --
    precision_adjusted_elements = _fix_metadata_field_precision(elements)
    element_dicts = elements_to_dicts(precision_adjusted_elements)
    json_str = json.dumps(element_dicts, ensure_ascii=False, indent=indent, sort_keys=True)

    if filename is not None:
        with open(filename, "w", encoding=encoding) as f:
            f.write(json_str)
        return None

    return json_str


elements_to_json_chi(elements, filename="./test_json.json")

also,it shows that in package unstructured.staging.base the func elements_to_json has some encoding bugs in chinese.The parameter ensure_ascii in json.dump,I think shoule be false.
test.pdf
test_json.json

MthwRobinson · 2024-05-13T12:34:15Z

Thank you for the example! We're tracking this and will investigate as soon as we can.

idiotTest · 2024-05-14T08:02:24Z

ok，thanks for your help!!!

WangJiaxin-x added the bug Something isn't working label May 10, 2024

MthwRobinson added the awaiting-response label May 10, 2024

WangJiaxin-x closed this as completed May 11, 2024

WangJiaxin-x reopened this May 11, 2024

WangJiaxin-x closed this as completed May 11, 2024

WangJiaxin-x reopened this May 11, 2024

MthwRobinson removed the awaiting-response label May 13, 2024

MthwRobinson added the needs follow up label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems when I parsing Chineses PDF documents #2999

Problems when I parsing Chineses PDF documents #2999

WangJiaxin-x commented May 10, 2024 •

edited

MthwRobinson commented May 10, 2024

idiotTest commented May 11, 2024

MthwRobinson commented May 13, 2024

idiotTest commented May 14, 2024

Problems when I parsing Chineses PDF documents #2999

Problems when I parsing Chineses PDF documents #2999

Comments

WangJiaxin-x commented May 10, 2024 • edited

MthwRobinson commented May 10, 2024

idiotTest commented May 11, 2024

MthwRobinson commented May 13, 2024

idiotTest commented May 14, 2024

WangJiaxin-x commented May 10, 2024 •

edited