Releases · jsvine/pdfplumber · GitHub

28 Feb 01:13

jsvine

v0.5.3

Fixed

Allow import pdfplumber even if ImageMagick not installed.

Assets 2

27 Feb 05:12

jsvine

v0.5.2

Added

Access to curve points. (E.g., page.curves[0]["points"].)
Ability for .draw_line to draw curve points.

Changed

Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
Internally, made utils.decimalize a bit more robust; now throws errors on non-decimalizable items.
Now explicitly ignoring some (obscure) pdfminer object attributes.
Raw input for .draw_line from a bounding box to ((x, y), (x, y)), for consistency with curve["points"] and with Pillow's underlying method.

Fixed

Fixed typo bug when .rect_edges is called before .edges

Assets 2

26 Feb 16:06

jsvine

v0.5.1

Added

Quick-draw PageImage methods: .draw_vline, .draw_vlines, .draw_hline, and .draw_hlines.
Boolean parameter keep_blank_chars for .extract_words(...) and TableFinder settings.

Changed

Increased default text_tolerance and intersection_tolerance TableFinder values from 1 to 3.

Fixed

Properly handle conversion of PDFs with transparency to pillow images.
Properly handle pandas DataFrames as inputs to multi-draw commands (e.g., PageImage.draw_rects(...)).

Assets 2

25 Feb 19:20

jsvine

v0.5.0

Completely overhauls the approach to table extraction.
Adds visual debugging.
See CHANGELOG.md for details.

Assets 2

09 Mar 12:57

jsvine

v0.4.0

Adds Page.extract_words(...), inspired by @jsfenfen's coalesce_words.py
Adds Page.filter(...)
Adds height/width properties to CroppedPage
Shifts idiom from .from_path to .open, and makes PDF class compatible with with statements.
Fixes a memory leak (caused by misuse of atexit)

Assets 2

07 Mar 01:27

jsvine

v0.3.1

Quickfix to v0.3.0; changes get_text(...) -> extract_text(...) for symmetry's sake.

Assets 2

07 Mar 01:10

jsvine

v0.3.0

A ton of improvements and new features:

Shifts to a lazy-loading paradigm, so that you don't have to process an entire PDF just to access one page.
Strips out pandas requirement and usage.
- Results in a 3x-ish speedup for within_bbox and similar methods, thanks to short-circuiting & operators.
Moves from floats to Decimals to improve accuracy of equality comparisons.
Moves to a more modular architecture, adds Container, Page, and CroppedPage classes.
Adds Page.crop(...).
Adds Page.extract_table(...) for Tabula-like functionality.
Adds PDF.metadata property.
Adds derived properties Container.rect_edges and Container.edges, decomposing each rectangle decomposed into its constituent lines.
Renames collate_chars(...) to get_text(...) (while retaining a reference to the former).
Enriches the the command-line tool's JSON output to include PDF metadata and page dimensions. [https://github.com//issues/3]

Assets 2