Skip to content

Remove IDs from HTML code #4012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 11, 2025
Merged

Remove IDs from HTML code #4012

merged 11 commits into from
Jun 11, 2025

Conversation

plutasnyy
Copy link
Contributor

@plutasnyy plutasnyy commented Jun 5, 2025

In this pull request parent-child relationship for elements generated with v2 parser is based on actual element IDs instead of IDs baked somewhere in the HTML script.
With some extra bug fixing it allowed for significantly simplifying json -> HTML script

@@ -134,9 +136,7 @@ def get_ontology_to_unstructured_type_mapping() -> dict[str, Element]:
for tag in element_type().allowed_tags
}
CSS_CLASS_TO_ELEMENT_TYPE_MAP: Dict[str, Type[ontology.OntologyElement]] = {
element_type().css_class_name: element_type
for element_type in ALL_ONTOLOGY_ELEMENT_TYPES
for tag in element_type().allowed_tags
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this line of code working before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copy-paste form the mapping above (which is more specific), here we just did more loops, but the 'tag' is not used anywhere. After the change, the mapping is the same, just unnecessary loops are removed:

CSS_CLASS_TO_ELEMENT_TYPE_MAP_BEFORE: Dict[str, Type[ontology.OntologyElement]] = {
    element_type().css_class_name: element_type
    for element_type in ALL_ONTOLOGY_ELEMENT_TYPES
    for tag in element_type().allowed_tags
}

CSS_CLASS_TO_ELEMENT_TYPE_MAP_AFTER: Dict[str, Type[ontology.OntologyElement]] = {
    element_type().css_class_name: element_type for element_type in ALL_ONTOLOGY_ELEMENT_TYPES
}
print(CSS_CLASS_TO_ELEMENT_TYPE_MAP_BEFORE == CSS_CLASS_TO_ELEMENT_TYPE_MAP_AFTER)
# Output: True

This is because some CSS classes can have multiple HTML tags assigned eg.

class Heading(OntologyElement):
    description: str = Field("Section headings (levels 1-6)", frozen=True)
    elementType: ElementTypeEnum = Field(ElementTypeEnum.text, frozen=True)
    allowed_tags: List[str] = Field(["h1", "h2", "h3", "h4", "h5", "h6"], frozen=True)

In this case, we were creating a 6 times entry: "Heading": OntologyElement.Heading

@plutasnyy
Copy link
Contributor Author

Removed detection_origin from unit tests for 2 reasons:

  • This field is intentionally excluded from __eq__ anyway
  • It doesn't work? Or work in a different way than I would expect. This field is excluded from to_dict(), but this code:
from unstructured.documents.elements import NarrativeText, ElementMetadata
metadata = ElementMetadata(
    text_as_html='<p class="NarrativeText">DEALER ONLY</p>',
)
metadata.detection_origin = "vlm_partitioner"
element = NarrativeText(
        text="DEALER ONLY",
        detection_origin="Test",
        metadata=metadata,
    )
print(element.metadata.detection_origin)

Print None as this line guards assigning debug field

For some reason, it was assigned in the unit test (even though I got None = None locally), possibly due to some version mismatch.

I think it needs another debugging session and is not a blocker for this ticket (this change didn't touch detection origin)

@plutasnyy
Copy link
Contributor Author

I have verified if the script produces the same outputs, and the only difference is in the order of classes in CSS + br is generated with a space (which I assume is fine, as it is rendered in the same way)
image

@plutasnyy plutasnyy self-assigned this Jun 9, 2025
@plutasnyy plutasnyy marked this pull request as ready for review June 9, 2025 15:44
Comment on lines 263 to 264
if parent_id not in id_to_element_mapping:
raise ValueError(f"Parent element {parent_id} not found in id_to_element_mapping")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a unit test for the new exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but I noticed that they make no sense. parent_id is always set before (if isn't it is generated) and ontology always returns some element so I just removed raising exception part

@badGarnet
Copy link
Collaborator

hold on while I bump our base image to fix the dockerfile cve

@plutasnyy plutasnyy added this pull request to the merge queue Jun 11, 2025
Merged via the queue into main with commit ec209c6 Jun 11, 2025
66 of 67 checks passed
@plutasnyy plutasnyy deleted the fix-element-ids branch June 11, 2025 12:26
yuming-long pushed a commit that referenced this pull request Jun 13, 2025
In this pull request parent-child relationship for elements generated
with v2 parser is based on actual element IDs instead of IDs baked
somewhere in the HTML script.
With some extra bug fixing it allowed for significantly simplifying json
-> HTML script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants