Releases: Unstructured-IO/unstructured
0.18.1
0.17.11-dev1
What's Changed
- Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names by @srisudarsan in #3959
- manual trigger of workflows to publish new image and new vers tag in … by @luke-kucing in #3965
- chore: deprecate stage_for_label_studio by @qued in #3968
- build: remove test and dev deps from docker image by @qued in #3969
- feat: convenience unstructured-get-json.sh update by @cragwolfe in #3971
- chore: allow changing default output dir for unstructured-get-json.sh by @cragwolfe in #3973
- chore: add html path to ingest-test-fixtures-update-pr by @cragwolfe in #3977
- fix: hi_res PDF parsing: only uncategorized text for extracted elements by @cragwolfe in #3975
- Fix sort_page_element. ensures that sorting is stable and not random. by @pprados in #3978
- Update pdfminer_utils.py by @Nathan-GoSupply in #3974
- fix cve by @potter-potter in #3989
- fix: Add missing diffstat command to test_json_to_html CI job by @mpolomdeepsense in #3992
- fix: failing build by @mpolomdeepsense in #3993
- fix: properly handle the case when an element's text is None by @badGarnet in #3995
- fix: Fix for Pillow error when extracting PNG images by @awalker4 in #3998
- fix: throw validation error when json is passed with invalid unstructured json by @jordan-homan in #4002
- Replace Serverless API to Platform announcement on README page by @ron-unstructured in #4003
- fix: resolve warnings of logger library by @emmanuel-ferdman in #3999
- chore: script to verify unstructured image outbound connectivity by @cragwolfe in #4008
- resolve CVEs and HF issue by @luke-kucing in #4009
- Feat/bump inference by @badGarnet in #4013
- Bump requests to address CVEs by @PastelStorm in #4015
- Drop Python 3.9 support due to dependency conflicts by @PastelStorm in #4017
- Remove IDs from HTML code by @plutasnyy in #4012
- fix chucking text None type has no attribute stripe by @yuming-long in #4018
- recompile on arm64 to get minimum reqs by @badGarnet in #4020
New Contributors
- @srisudarsan made their first contribution in #3959
- @Nathan-GoSupply made their first contribution in #3974
- @jordan-homan made their first contribution in #4002
- @emmanuel-ferdman made their first contribution in #3999
- @PastelStorm made their first contribution in #4015
Full Changelog: 0.17.2...0.17.11-dev1
0.17.2
Enhancements
-
Add image_url of images in html partitioner
<img>
tags with non-data content include a new image_url metadata field with the content of the src attribute. -
Use
lxml
instead ofbs4
to parse hOCR data.lxml
is much faster thanbs4
given the hOCR data format is regular (garanteed because it is programatically generated) -
bump
numpy
to>2
. And upgradepaddlepaddle
,unstructured-paddleocr
,onnx
so they are compatible withnumpy>2
.
Fixes
- Fix Image in a tag is "UncategorizedText" with no .text
What's Changed
- feat: support extracting image url in html by @ryannikolaidis in #3955
- feat: use lxml instead of bs4 to parse hOCR data by @badGarnet in #3960
- Feat/bump numpy to 2 by @badGarnet in #3961
- Image within div or span with no text is annotated as Image by @ajjimeno in #3962
Full Changelog: 0.17.0...0.17.2
0.17.0
What's Changed
- feat: include images when partitioning html by @ryannikolaidis in #3945
- fix: pass extract image args to all partitioners by @ryannikolaidis in #3950
- feat: allow passing down of ocr agent and table agent by @badGarnet in #3954
- Feat/remove reference of PageLayout.elements by @badGarnet in #3943
Full Changelog: 0.16.25...0.17.0
0.16.25
0.16.24
0.16.24
Enhancements
-
Support dynamic partitioner file type registration. Use
create_file_type
to create new file type that can be handled
in unstructured andregister_partitioner
to enable registering your own partitioner for any file type. -
extract_image_block_types
now also works for CamelCase elemenet type names. PreviouslyNarrativeText
and similar CamelCase element types can't be extracted using the mentioned parameter inpartition
. Now figures for those elements can be extracted likeImage
andTable
elements -
use block matrix to reduce peak memory usage for pdf/image partition.
Features
- Add JSON elements to HTML converter - Converts JSON elements file into an HTML file.
Fixes
0.16.23
0.16.22
0.16.21
Enhancements
-
Use password to load PDF with all modes
-
use vectorized logic to merge inferred and extracted layouts. Using the new
LayoutElements
data structure and numpy library to refactor the layout merging logic to improve compute performance as well as making logic more clear -
Add PDF Miner configuration Now PDF Miner can be configured via
pdfminer_line_overlap
,pdfminer_word_margin
,pdfminer_line_margin
andpdfminer_char_margin
parameters added to partition method.
Features
Fixes
- Fix file type detection for NDJSON files NDJSON files were being detected as JSON due to having the same mime-type.
0.16.20
0.16.20
Enhancements
Features
Fixes
- Fix a security issue where rst and org files could read files in the local filesystem. Certain filetypes could 'include' or 'import' local files into their content, allowing partitioning of arbitrary files from the local filesystem. Partitioning of these files is now sandboxed.