Skip to content

Commit

Permalink
v0.5.2
Browse files Browse the repository at this point in the history
@ Added
- Access to `curve` points. (E.g., `page.curves[0]["points"]`.)
- Ability for `.draw_line` to draw `curve` points.

@ Changed
- Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
- Internally, made `utils.decimalize` a bit more robust; now throws errors on non-decimalizable items.
- Now explicitly ignoring some (obscure) `pdfminer` object attributes.
- Raw input for `.draw_line` from a bounding box to `((x, y), (x, y))`, for consistency with `curve["points"]` and with `Pillow`'s underlying method.

@ Fixed
- Fixed typo bug when `.rect_edges` is called before `.edges`
  • Loading branch information
jsvine committed Feb 27, 2017
1 parent 6d2d010 commit b44f2dc
Show file tree
Hide file tree
Showing 11 changed files with 406 additions and 44 deletions.
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,20 @@ All notable changes to this project will be documented in this file. Currently g

The format is based on [Keep a Changelog](http://keepachangelog.com/).

## [0.5.2] — 2017-02-27
### Added
- Access to `curve` points. (E.g., `page.curves[0]["points"]`.)
- Ability for `.draw_line` to draw `curve` points.

### Changed
- Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
- Internally, made `utils.decimalize` a bit more robust; now throws errors on non-decimalizable items.
- Now explicitly ignoring some (obscure) `pdfminer` object attributes.
- Raw input for `.draw_line` from a bounding box to `((x, y), (x, y))`, for consistency with `curve["points"]` and with `Pillow`'s underlying method.

### Fixed
- Fixed typo bug when `.rect_edges` is called before `.edges`

## [0.5.1] — 2017-02-26
### Added
- Quick-draw `PageImage` methods: `.draw_vline`, `.draw_vlines`, `.draw_hline`, and `.draw_hlines`.
Expand Down
34 changes: 28 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PDFPlumber `v0.5.1`
# PDFPlumber `v0.5.2`

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Expand Down Expand Up @@ -102,6 +102,7 @@ Each instance of `pdfplumber.PDF` and `pdfplumber.Page` provides access to four
- `.annos`, each representing a single annotation-text character.
- `.lines`, each representing a single 1-dimensional line.
- `.rects`, each representing a single 2-dimensional rectangle.
- `.curves`, each representing a series of connected points.

Each object is represented as a simple Python `dict`, with the following properties:

Expand Down Expand Up @@ -130,7 +131,7 @@ Each object is represented as a simple Python `dict`, with the following propert

| Property | Description |
|----------|-------------|
|`page_number`| Page number on which this character was found.|
|`page_number`| Page number on which this line was found.|
|`height`| Height of line.|
|`width`| Width of line.|
|`x0`| Distance of left-side extremity from left side of page.|
Expand All @@ -147,7 +148,7 @@ Each object is represented as a simple Python `dict`, with the following propert

| Property | Description |
|----------|-------------|
|`page_number`| Page number on which this character was found.|
|`page_number`| Page number on which this rectangle was found.|
|`height`| Height of rectangle.|
|`width`| Width of rectangle.|
|`x0`| Distance of left side of rectangle from left side of page.|
Expand All @@ -160,6 +161,24 @@ Each object is represented as a simple Python `dict`, with the following propert
|`linewidth`| Thickness of line.|
|`object_type`| "rect"|

#### `curve` properties

| Property | Description |
|----------|-------------|
|`page_number`| Page number on which this curve was found.|
|`points`| Points — as a list of `(x, top)` tuples — describing the curve.|
|`height`| Height of curve's bounding box.|
|`width`| Width of curve's bounding box.|
|`x0`| Distance of curve's left-most point from left side of page.|
|`x1`| Distance of curve's right-most point from left side of the page.|
|`y0`| Distance of curve's lowest point from bottom of page.|
|`y1`| Distance of curve's highest point from bottom of page.|
|`top`| Distance of curve's highest point from top of page.|
|`bottom`| Distance of curve's lowest point from top of page.|
|`doctop`| Distance of curve's highest point from top of document.|
|`linewidth`| Thickness of line.|
|`object_type`| "curve"|

Additionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to two derived lists of objects: `.rect_edges` (which decomposes each rectangle into its four lines) and `.edges` (which combines `.rect_edges` with `.lines`).

## Visual debugging
Expand Down Expand Up @@ -191,7 +210,7 @@ You can pass explicit coordinates or any `pdfplumber` PDF object (e.g., char, li

| Single-object method | Bulk method | Description |
|----------------------|-------------|-------------|
|`im.draw_line(line, stroke={color}, stroke_width=1)`| `im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`-like object, or a 4-tuple bounding box.|
|`im.draw_line(line, stroke={color}, stroke_width=1)`| `im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`, `curve`, or a 2-tuple of 2-tuples (e.g., `((x, y), (x, y))`).|
|`im.draw_vline(location, stroke={color}, stroke_width=1)`| `im.draw_vlines(list_of_locations, **kwargs)`| Draws a vertical line at the x-coordinate indicated by `location`.|
|`im.draw_hline(location, stroke={color}, stroke_width=1)`| `im.draw_hlines(list_of_locations, **kwargs)`| Draws a horizontal line at the y-coordinate indicated by `location`.|
|`im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1)`| `im.draw_rects(list_of_rects, **kwargs)`| Draws a rectangle from a `rect`, `char`, etc., or 4-tuple bounding box.|
Expand Down Expand Up @@ -243,7 +262,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"text_word_threshold": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
"keep_blank_chars": False,
"text_tolerance": 3,
"text_x_tolerance": None,
Expand All @@ -263,7 +283,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
|`"snap_tolerance"`| Parallel lines within `snap_tolerance` pixels will be "snapped" to the same horizontal or vertical position.|
|`"join_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.|
|`"edge_min_length"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.|
|`"text_word_threshold"`| When using the `text` strategy, at least `text_word_threshold` words must share the same alignment.|
|`"min_words_vertical"`| When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.|
|`"min_words_horizontal"`| When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.|
|`"keep_blank_chars"`| When using the `text` strategy, consider `" "` chars to be *parts* of words and not word-separators.|
|`"text_tolerance"`, `"text_x_tolerance"`, `"text_y_tolerance"`| When the `text` strategy searches for words, it will expect the individual letters in each word to be no more than `text_tolerance` pixels apart.|
|`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`| When combining edges into cells, orthogonal edges most be within `intersection_tolerance` pixels to be considered intersecting.|
Expand All @@ -290,6 +311,7 @@ Both `vertical_strategy` and `horizontal_strategy` accept the following options:

- [Using `extract_table` on a California Worker Adjustment and Retraining Notification (WARN) report](examples/notebooks/extract-table-ca-warn-report.ipynb). Demonstrates basic visual debugging and table extraction.
- [Using `extract_table` on the FBI's National Instant Criminal Background Check System PDFs](examples/notebooks/extract-table-nics.ipynb). Demonstrates how to use visual debugging to find optimal table extraction settings. Also demonstrates `Page.crop(...)` and `Page.extract_text(...)`
- [Inspecting and visualizing `curve` objects](examples/notebooks/ag-energy-roundup-curves.ipynb).

## Acknowledgments / Contributors

Expand Down
277 changes: 277 additions & 0 deletions examples/notebooks/ag-energy-roundup-curves.ipynb

Large diffs are not rendered by default.

Binary file added examples/pdfs/ag-energy-round-up-2017-02-24.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion pdfplumber/_version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
version_info = (0, 5, 1)
version_info = (0, 5, 2)
__version__ = '.'.join(map(str, version_info))
2 changes: 1 addition & 1 deletion pdfplumber/container.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def annos(self):

@property
def rect_edges(self):
if hasattr(self, "_rect_edges"): return self._edges
if hasattr(self, "_rect_edges"): return self._rect_edges
rect_edges_gen = (utils.rect_to_edges(r) for r in self.rects)
self._rect_edges = list(chain(*rect_edges_gen))
return self._rect_edges
Expand Down
29 changes: 18 additions & 11 deletions pdfplumber/display.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,16 +88,18 @@ def reset(self):
def copy(self):
return self.__class__(self.page, self.original)

def draw_line(self, points_or_line,
def draw_line(self, points_or_obj,
stroke=DEFAULT_STROKE,
stroke_width=DEFAULT_STROKE_WIDTH):
if isinstance(points_or_line, (tuple, list)):
points = points_or_line
if isinstance(points_or_obj, (tuple, list)):
points = points_or_obj
elif type(points_or_obj) == dict and "points" in points_or_obj:
points = points_or_obj["points"]
else:
obj = points_or_line
points = (obj["x0"], obj["top"], obj["x1"], obj["bottom"])
obj = points_or_obj
points = ((obj["x0"], obj["top"]), (obj["x1"], obj["bottom"]))
self.draw.line(
self._reproject_bbox(points),
list(map(self._reproject, points)),
fill=stroke,
width=stroke_width
)
Expand Down Expand Up @@ -165,10 +167,10 @@ def draw_rect(self, bbox_or_obj,

if stroke_width > 0:
segments = [
(x0, top, x1, top), # top
(x0, bottom, x1, bottom), # bottom
(x0, top, x0, bottom), # left
(x1, top, x1, bottom), # right
((x0, top), (x1, top)), # top
((x0, bottom), (x1, bottom)), # bottom
((x0, top), (x0, bottom)), # left
((x1, top), (x1, bottom)), # right
]
self.draw_lines(
segments,
Expand All @@ -195,7 +197,12 @@ def draw_circle(self, center_or_obj,
(obj["top"] + obj["bottom"]) / 2
)
cx, cy = center
bbox = (cx - radius, cy - radius, cx + radius, cy + radius)
bbox = self.decimalize((
cx - radius,
cy - radius,
cx + radius,
cy + radius
))
self.draw.ellipse(
self._reproject_bbox(bbox),
fill,
Expand Down
47 changes: 36 additions & 11 deletions pdfplumber/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,22 +21,22 @@ def __init__(self, pdf, page_obj, page_number=None, initial_doctop=0):
self.initial_doctop = self.decimalize(initial_doctop)

cropbox = page_obj.attrs.get("CropBox", page_obj.attrs.get("MediaBox"))
self.cropbox = tuple(map(self.decimalize, cropbox))
self.cropbox = self.decimalize(cropbox)

if self.rotation in [ 90, 270 ]:
self.bbox = tuple(map(self.decimalize, (
self.bbox = self.decimalize((
min(cropbox[1], cropbox[3]),
min(cropbox[0], cropbox[2]),
max(cropbox[1], cropbox[3]),
max(cropbox[0], cropbox[2]),
)))
))
else:
self.bbox = tuple(map(self.decimalize, (
self.bbox = self.decimalize((
min(cropbox[0], cropbox[2]),
min(cropbox[1], cropbox[3]),
max(cropbox[0], cropbox[2]),
max(cropbox[1], cropbox[3]),
)))
))

def decimalize(self, x):
return utils.decimalize(x, self.pdf.precision)
Expand Down Expand Up @@ -69,11 +69,33 @@ def parse_objects(self):
idc = self.initial_doctop
pno = self.page_number

def process_object(obj):
def point2coord(pt):
x, y = pt
return (
d(x),
h - d(y)
)

IGNORE = [
"bbox",
"matrix",
"_text",
"_objs",
"groups",
"stream",
"colorspace",
"imagemask",
"pts",
]

NON_DECIMALIZE = [
"fontname", "name", "upright",
]

attr = dict((k, d(v)) for k, v in obj.__dict__.items()
if isinstance(v, (float, int, string_types))
and k[0] != "_")
def process_object(obj):
attr = dict((k, (v if k in NON_DECIMALIZE else d(v)))
for k, v in obj.__dict__.items()
if k not in IGNORE)

kind = re.sub(lt_pat, "", obj.__class__.__name__).lower()
attr["object_type"] = kind
Expand All @@ -82,6 +104,9 @@ def process_object(obj):
if hasattr(obj, "get_text"):
attr["text"] = obj.get_text()

if kind == "curve":
attr["points"] = list(map(point2coord, obj.pts))

if attr.get("y0") != None:
attr["top"] = h - attr["y1"]
attr["bottom"] = h - attr["y0"]
Expand Down Expand Up @@ -145,7 +170,7 @@ def objects(self):
return self._objects

cropped = CroppedPage(self)
cropped.bbox = tuple(map(self.decimalize, bbox))
cropped.bbox = self.decimalize(bbox)
return cropped

def within_bbox(self, bbox):
Expand All @@ -162,7 +187,7 @@ def objects(self):
return self._objects

cropped = CroppedPage(self)
cropped.bbox = tuple(map(self.decimalize, bbox))
cropped.bbox = self.decimalize(bbox)
return cropped

def filter(self, test_function):
Expand Down
17 changes: 9 additions & 8 deletions pdfplumber/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

DEFAULT_SNAP_TOLERANCE = 3
DEFAULT_JOIN_TOLERANCE = 3
DEFAULT_MIN_WORDS_VERTICAL = 3
DEFAULT_MIN_WORDS_HORIZONTAL = 1

def move_to_avg(objs, orientation):
"""
Expand Down Expand Up @@ -87,7 +89,7 @@ def get_group(edge):
return edges

def words_to_edges_h(words,
word_threshold=3):
word_threshold=DEFAULT_MIN_WORDS_HORIZONTAL):
"""
Find (imaginary) horizontal lines that connect the tops of at least `word_threshold` words.
"""
Expand Down Expand Up @@ -117,7 +119,7 @@ def words_to_edges_h(words,
return edges

def words_to_edges_v(words,
word_threshold=3):
word_threshold=DEFAULT_MIN_WORDS_VERTICAL):
"""
Find (imaginary) vertical lines that connect the left, right, or center of at least `word_threshold` words.
"""
Expand Down Expand Up @@ -213,7 +215,7 @@ def intersections_to_cells(intersections):

def edge_connects(p1, p2):
def edges_to_set(edges):
return set(map(tuple, [ x.items() for x in edges ]))
return set(map(utils.obj_to_bbox, edges))

if p1[0] == p2[0]:
common = edges_to_set(intersections[p1]["v"])\
Expand Down Expand Up @@ -395,7 +397,8 @@ def char_in_bbox(char, bbox):
"snap_tolerance": DEFAULT_SNAP_TOLERANCE,
"join_tolerance": DEFAULT_JOIN_TOLERANCE,
"edge_min_length": 3,
"text_word_threshold": 3,
"min_words_vertical": DEFAULT_MIN_WORDS_VERTICAL,
"min_words_horizontal": DEFAULT_MIN_WORDS_HORIZONTAL,
"keep_blank_chars": False,
"text_tolerance": 3,
"text_x_tolerance": None,
Expand Down Expand Up @@ -505,7 +508,7 @@ def v_edge_desc_to_edge(desc):
edge_type="lines")
elif v_strat == "text":
v_base = words_to_edges_v(words,
word_threshold=settings["text_word_threshold"])
word_threshold=settings["min_words_vertical"])
elif v_strat == "explicit":
v_base = []

Expand Down Expand Up @@ -539,7 +542,7 @@ def h_edge_desc_to_edge(desc):
edge_type="lines")
elif h_strat == "text":
h_base = words_to_edges_h(words,
word_threshold=settings["text_word_threshold"])
word_threshold=settings["min_words_horizontal"])
elif h_strat == "explicit":
h_base = []

Expand All @@ -553,5 +556,3 @@ def h_edge_desc_to_edge(desc):
)
return utils.filter_edges(edges,
min_length=settings["edge_min_length"])


20 changes: 14 additions & 6 deletions pdfplumber/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,15 +69,24 @@ def decode_text(s):
return ''.join(PDFDocEncoding[o] for o in ords)

def decimalize(v, q=None):
if isinstance(v, numbers.Integral):
# If already a decimal, just return itself
if isinstance(v, Decimal):
return v
# If tuple/list passed, bulk-convert
elif isinstance(v, (tuple, list)):
return type(v)(decimalize(x, q) for x in v)
# Convert int-like
elif isinstance(v, numbers.Integral):
return Decimal(int(v))
if isinstance(v, numbers.Real):
# Convert float-like
elif isinstance(v, numbers.Real):
if q != None:
return Decimal(repr(v)).quantize(Decimal(repr(q)),
rounding=ROUND_HALF_UP)
else:
return Decimal(repr(v))
return v
else:
raise ValueError("Cannot convert {0} to Decimal.".format(v))

def is_dataframe(collection):
cls = collection.__class__
Expand Down Expand Up @@ -117,8 +126,7 @@ def objects_to_bbox(objects):
max(map(itemgetter("bottom"), objects)),
)

def rect_to_bbox(rect):
return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])
obj_to_bbox = itemgetter("x0", "top", "x1", "bottom")

def bbox_to_rect(bbox):
return {
Expand Down Expand Up @@ -267,7 +275,7 @@ def clip_obj(obj, bbox, score=None):
return copy

def n_points_intersecting_bbox(objs, bbox):
bbox = tuple(map(decimalize, bbox))
bbox = decimalize(bbox)
objs = to_list(objs)
scores = (obj_inside_bbox_score(obj, bbox) for obj in objs)
return list(scores)
Expand Down
Loading

0 comments on commit b44f2dc

Please sign in to comment.