Skip to content

Commit 3f87946

Browse files
feat: add DocumentData type (#4031)
In scenarios where there is a large amount of data that represents the document rather than individual elements in the document, it may be preferable to specify this in a single location rather than duplicating the data across all elements (as we do for smaller metadata like filename or filetype) This PR adds DocumentData element type which can be used to uniquely capture this data.
1 parent 6866fda commit 3f87946

File tree

4 files changed

+14
-2
lines changed

4 files changed

+14
-2
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
## 0.18.1-dev0
1+
## 0.18.1
22

33
### Enhancements
44

55
### Features
6+
- **Add DocumentData element type** This is helpful in scenarios where there is large data that does not make sense to represent across each element in the document.
67

78
### Fixes
89
- The `encoding` property of the `_CsvPartitioningContext` is now properly used.

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.18.1-dev0" # pragma: no cover
1+
__version__ = "0.18.1" # pragma: no cover

unstructured/documents/elements.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -640,6 +640,7 @@ class ElementType:
640640
PAGE_NUMBER = "PageNumber"
641641
CODE_SNIPPET = "CodeSnippet"
642642
FORM_KEYS_VALUES = "FormKeysValues"
643+
DOCUMENT_DATA = "DocumentData"
643644

644645
@classmethod
645646
def to_dict(cls):
@@ -975,6 +976,14 @@ class FormKeysValues(Text):
975976
category = "FormKeysValues"
976977

977978

979+
class DocumentData(Text):
980+
"""An element for capturing document-level data,
981+
particularly for large data that does not make sense to
982+
represent across each element in the document."""
983+
984+
category = "DocumentData"
985+
986+
978987
TYPE_TO_TEXT_ELEMENT_MAP: dict[str, type[Text]] = {
979988
ElementType.TITLE: Title,
980989
ElementType.SECTION_HEADER: Title,
@@ -1013,6 +1022,7 @@ class FormKeysValues(Text):
10131022
ElementType.CODE_SNIPPET: CodeSnippet,
10141023
ElementType.PAGE_NUMBER: PageNumber,
10151024
ElementType.FORM_KEYS_VALUES: FormKeysValues,
1025+
ElementType.DOCUMENT_DATA: DocumentData,
10161026
}
10171027

10181028

unstructured/partition/html/convert.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -216,6 +216,7 @@ def _inject_html_element_attrs(self, element_html: Tag) -> None:
216216
ElementType.PAGE_NUMBER: ElementHtml,
217217
ElementType.CODE_SNIPPET: ElementHtml,
218218
ElementType.FORM_KEYS_VALUES: ElementHtml,
219+
ElementType.DOCUMENT_DATA: ElementHtml,
219220
}
220221

221222

0 commit comments

Comments
 (0)