v0.5.2

@ Added - Access to `curve` points. (E.g., `page.curves[0]["points"]`.) - Ability for `.draw_line` to draw `curve` points. @ Changed - Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold". - Internally, made `utils.decimalize` a bit more robust; now throws errors on non-decimalizable items. - Now explicitly ignoring some (obscure) `pdfminer` object attributes. - Raw input for `.draw_line` from a bounding box to `((x, y), (x, y))`, for consistency with `curve["points"]` and with `Pillow`'s underlying method. @ Fixed - Fixed typo bug when `.rect_edges` is called before `.edges`
jsvine · Feb 27, 2017 · b44f2dc · b44f2dc
1 parent 6d2d010
commit b44f2dc
Show file tree

Hide file tree

Showing 11 changed files with 406 additions and 44 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,20 @@ All notable changes to this project will be documented in this file. Currently g
 
 The format is based on [Keep a Changelog](http://keepachangelog.com/).
 
+## [0.5.2] — 2017-02-27
+### Added
+- Access to `curve` points. (E.g., `page.curves[0]["points"]`.)
+- Ability for `.draw_line` to draw `curve` points.
+
+### Changed
+- Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
+- Internally, made `utils.decimalize` a bit more robust; now throws errors on non-decimalizable items.
+- Now explicitly ignoring some (obscure) `pdfminer` object attributes.
+- Raw input for `.draw_line` from a bounding box to `((x, y), (x, y))`, for consistency with `curve["points"]` and with `Pillow`'s underlying method.
+
+### Fixed
+- Fixed typo bug when `.rect_edges` is called before `.edges`
+
 ## [0.5.1] — 2017-02-26
 ### Added
 - Quick-draw `PageImage` methods: `.draw_vline`, `.draw_vlines`, `.draw_hline`, and `.draw_hlines`.

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# PDFPlumber `v0.5.1`
+# PDFPlumber `v0.5.2`
 
 Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
 
@@ -102,6 +102,7 @@ Each instance of `pdfplumber.PDF` and `pdfplumber.Page` provides access to four
 - `.annos`, each representing a single annotation-text character.
 - `.lines`, each representing a single 1-dimensional line.
 - `.rects`, each representing a single 2-dimensional rectangle.
+- `.curves`, each representing a series of connected points.
 
 Each object is represented as a simple Python `dict`, with the following properties:
 
@@ -130,7 +131,7 @@ Each object is represented as a simple Python `dict`, with the following propert
 
 | Property | Description |
 |----------|-------------|
-|`page_number`| Page number on which this character was found.|
+|`page_number`| Page number on which this line was found.|
 |`height`| Height of line.|
 |`width`| Width of line.|
 |`x0`| Distance of left-side extremity from left side of page.|
@@ -147,7 +148,7 @@ Each object is represented as a simple Python `dict`, with the following propert
 
 | Property | Description |
 |----------|-------------|
-|`page_number`| Page number on which this character was found.|
+|`page_number`| Page number on which this rectangle was found.|
 |`height`| Height of rectangle.|
 |`width`| Width of rectangle.|
 |`x0`| Distance of left side of rectangle from left side of page.|
@@ -160,6 +161,24 @@ Each object is represented as a simple Python `dict`, with the following propert
 |`linewidth`| Thickness of line.|
 |`object_type`| "rect"|
 
+#### `curve` properties
+
+| Property | Description |
+|----------|-------------|
+|`page_number`| Page number on which this curve was found.|
+|`points`| Points — as a list of `(x, top)` tuples — describing the curve.|
+|`height`| Height of curve's bounding box.|
+|`width`| Width of curve's bounding box.|
+|`x0`| Distance of curve's left-most point from left side of page.|
+|`x1`| Distance of curve's right-most point from left side of the page.|
+|`y0`| Distance of curve's lowest point from bottom of page.|
+|`y1`| Distance of curve's highest point from bottom of page.|
+|`top`| Distance of curve's highest point from top of page.|
+|`bottom`| Distance of curve's lowest point from top of page.|
+|`doctop`| Distance of curve's highest point from top of document.|
+|`linewidth`| Thickness of line.|
+|`object_type`| "curve"|
+
 Additionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to two derived lists of objects: `.rect_edges` (which decomposes each rectangle into its four lines) and `.edges` (which combines `.rect_edges` with `.lines`). 
 
 ## Visual debugging
@@ -191,7 +210,7 @@ You can pass explicit coordinates or any `pdfplumber` PDF object (e.g., char, li
 
 | Single-object method | Bulk method | Description |
 |----------------------|-------------|-------------|
-|`im.draw_line(line, stroke={color}, stroke_width=1)`| `im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`-like object, or a 4-tuple bounding box.|
+|`im.draw_line(line, stroke={color}, stroke_width=1)`| `im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`, `curve`, or a 2-tuple of 2-tuples (e.g., `((x, y), (x, y))`).|
 |`im.draw_vline(location, stroke={color}, stroke_width=1)`| `im.draw_vlines(list_of_locations, **kwargs)`| Draws a vertical line at the x-coordinate indicated by `location`.|
 |`im.draw_hline(location, stroke={color}, stroke_width=1)`| `im.draw_hlines(list_of_locations, **kwargs)`| Draws a horizontal line at the y-coordinate indicated by `location`.|
 |`im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1)`| `im.draw_rects(list_of_rects, **kwargs)`| Draws a rectangle from a `rect`, `char`, etc., or 4-tuple bounding box.|
@@ -243,7 +262,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
     "snap_tolerance": 3,
     "join_tolerance": 3,
     "edge_min_length": 3,
-    "text_word_threshold": 3,
+    "min_words_vertical": 3,
+    "min_words_horizontal": 1,
     "keep_blank_chars": False,
     "text_tolerance": 3,
     "text_x_tolerance": None,
@@ -263,7 +283,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
 |`"snap_tolerance"`| Parallel lines within `snap_tolerance` pixels will be "snapped" to the same horizontal or vertical position.|
 |`"join_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.|
 |`"edge_min_length"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.|
-|`"text_word_threshold"`| When using the `text` strategy, at least `text_word_threshold` words must share the same alignment.|
+|`"min_words_vertical"`| When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.|
+|`"min_words_horizontal"`| When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.|
 |`"keep_blank_chars"`| When using the `text` strategy, consider `" "` chars to be *parts* of words and not word-separators.|
 |`"text_tolerance"`, `"text_x_tolerance"`, `"text_y_tolerance"`| When the `text` strategy searches for words, it will expect the individual letters in each word to be no more than `text_tolerance` pixels apart.|
 |`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`| When combining edges into cells, orthogonal edges most be within `intersection_tolerance` pixels to be considered intersecting.|
@@ -290,6 +311,7 @@ Both `vertical_strategy` and `horizontal_strategy` accept the following options:
 
 - [Using `extract_table` on a California Worker Adjustment and Retraining Notification (WARN) report](examples/notebooks/extract-table-ca-warn-report.ipynb). Demonstrates basic visual debugging and table extraction.
 - [Using `extract_table` on the FBI's National Instant Criminal Background Check System PDFs](examples/notebooks/extract-table-nics.ipynb). Demonstrates how to use visual debugging to find optimal table extraction settings. Also demonstrates `Page.crop(...)` and `Page.extract_text(...)`
+- [Inspecting and visualizing `curve` objects](examples/notebooks/ag-energy-roundup-curves.ipynb).
 
 ## Acknowledgments / Contributors
 

diff --git a/examples/notebooks/ag-energy-roundup-curves.ipynb b/examples/notebooks/ag-energy-roundup-curves.ipynb
diff --git a/examples/pdfs/ag-energy-round-up-2017-02-24.pdf b/examples/pdfs/ag-energy-round-up-2017-02-24.pdf
diff --git a/pdfplumber/_version.py b/pdfplumber/_version.py
@@ -1,2 +1,2 @@
-version_info = (0, 5, 1)
+version_info = (0, 5, 2)
 __version__ = '.'.join(map(str, version_info))
diff --git a/pdfplumber/container.py b/pdfplumber/container.py
@@ -40,7 +40,7 @@ def annos(self):
 
     @property
     def rect_edges(self):
-        if hasattr(self, "_rect_edges"): return self._edges
+        if hasattr(self, "_rect_edges"): return self._rect_edges
         rect_edges_gen = (utils.rect_to_edges(r) for r in self.rects)
         self._rect_edges = list(chain(*rect_edges_gen))
         return self._rect_edges

diff --git a/pdfplumber/display.py b/pdfplumber/display.py
@@ -88,16 +88,18 @@ def reset(self):
     def copy(self):
         return self.__class__(self.page, self.original)
 
-    def draw_line(self, points_or_line,
+    def draw_line(self, points_or_obj,
         stroke=DEFAULT_STROKE,
         stroke_width=DEFAULT_STROKE_WIDTH):
-        if isinstance(points_or_line, (tuple, list)):
-            points = points_or_line
+        if isinstance(points_or_obj, (tuple, list)):
+            points = points_or_obj
+        elif type(points_or_obj) == dict and "points" in points_or_obj:
+            points = points_or_obj["points"]
         else:
-            obj = points_or_line
-            points = (obj["x0"], obj["top"], obj["x1"], obj["bottom"])
+            obj = points_or_obj
+            points = ((obj["x0"], obj["top"]), (obj["x1"], obj["bottom"]))
         self.draw.line(
-            self._reproject_bbox(points),
+            list(map(self._reproject, points)),
             fill=stroke,
             width=stroke_width
         )
@@ -165,10 +167,10 @@ def draw_rect(self, bbox_or_obj,
 
         if stroke_width > 0:
             segments = [
-                (x0, top, x1, top), # top
-                (x0, bottom, x1, bottom), # bottom
-                (x0, top, x0, bottom), # left
-                (x1, top, x1, bottom), # right
+                ((x0, top), (x1, top)), # top
+                ((x0, bottom), (x1, bottom)), # bottom
+                ((x0, top), (x0, bottom)), # left
+                ((x1, top), (x1, bottom)), # right
             ]
             self.draw_lines(
                 segments,
@@ -195,7 +197,12 @@ def draw_circle(self, center_or_obj,
                 (obj["top"] + obj["bottom"]) / 2
             )
         cx, cy = center
-        bbox = (cx - radius, cy - radius, cx + radius, cy + radius)
+        bbox = self.decimalize((
+            cx - radius,
+            cy - radius,
+            cx + radius,
+            cy + radius
+        ))
         self.draw.ellipse(
             self._reproject_bbox(bbox),
             fill,

diff --git a/pdfplumber/page.py b/pdfplumber/page.py
@@ -21,22 +21,22 @@ def __init__(self, pdf, page_obj, page_number=None, initial_doctop=0):
         self.initial_doctop = self.decimalize(initial_doctop)
 
         cropbox = page_obj.attrs.get("CropBox", page_obj.attrs.get("MediaBox"))
-        self.cropbox = tuple(map(self.decimalize, cropbox))
+        self.cropbox = self.decimalize(cropbox)
 
         if self.rotation in [ 90, 270 ]:
-            self.bbox = tuple(map(self.decimalize, (
+            self.bbox = self.decimalize((
                 min(cropbox[1], cropbox[3]),
                 min(cropbox[0], cropbox[2]),
                 max(cropbox[1], cropbox[3]),
                 max(cropbox[0], cropbox[2]),
-            )))
+            ))
         else:
-            self.bbox = tuple(map(self.decimalize, (
+            self.bbox = self.decimalize((
                 min(cropbox[0], cropbox[2]),
                 min(cropbox[1], cropbox[3]),
                 max(cropbox[0], cropbox[2]),
                 max(cropbox[1], cropbox[3]),
-            )))
+            ))
 
     def decimalize(self, x):
         return utils.decimalize(x, self.pdf.precision)
@@ -69,11 +69,33 @@ def parse_objects(self):
         idc = self.initial_doctop
         pno = self.page_number
 
-        def process_object(obj):
+        def point2coord(pt):
+            x, y = pt
+            return (
+                d(x),
+                h - d(y)
+            )
+
+        IGNORE = [
+            "bbox",
+            "matrix",
+            "_text",
+            "_objs",
+            "groups",
+            "stream",
+            "colorspace",
+            "imagemask",
+            "pts",
+        ]
+
+        NON_DECIMALIZE = [
+            "fontname", "name", "upright",
+        ]
 
-            attr = dict((k, d(v)) for k, v in obj.__dict__.items()
-                if isinstance(v, (float, int, string_types))
-                    and k[0] != "_")
+        def process_object(obj):
+            attr = dict((k, (v if k in NON_DECIMALIZE else d(v)))
+                for k, v in obj.__dict__.items()
+                    if k not in IGNORE)
 
             kind = re.sub(lt_pat, "", obj.__class__.__name__).lower()
             attr["object_type"] = kind
@@ -82,6 +104,9 @@ def process_object(obj):
             if hasattr(obj, "get_text"):
                 attr["text"] = obj.get_text()
 
+            if kind == "curve":
+                attr["points"] = list(map(point2coord, obj.pts))
+
             if attr.get("y0") != None:
                 attr["top"] = h - attr["y1"]
                 attr["bottom"] = h - attr["y0"]
@@ -145,7 +170,7 @@ def objects(self):
                 return self._objects
 
         cropped = CroppedPage(self)
-        cropped.bbox = tuple(map(self.decimalize, bbox))
+        cropped.bbox = self.decimalize(bbox)
         return cropped
 
     def within_bbox(self, bbox):
@@ -162,7 +187,7 @@ def objects(self):
                 return self._objects
 
         cropped = CroppedPage(self)
-        cropped.bbox = tuple(map(self.decimalize, bbox))
+        cropped.bbox = self.decimalize(bbox)
         return cropped
 
     def filter(self, test_function):

diff --git a/pdfplumber/table.py b/pdfplumber/table.py
@@ -4,6 +4,8 @@
 
 DEFAULT_SNAP_TOLERANCE = 3
 DEFAULT_JOIN_TOLERANCE = 3
+DEFAULT_MIN_WORDS_VERTICAL = 3
+DEFAULT_MIN_WORDS_HORIZONTAL = 1
 
 def move_to_avg(objs, orientation):
     """
@@ -87,7 +89,7 @@ def get_group(edge):
     return edges
 
 def words_to_edges_h(words,
-    word_threshold=3):
+    word_threshold=DEFAULT_MIN_WORDS_HORIZONTAL):
     """
     Find (imaginary) horizontal lines that connect the tops of at least `word_threshold` words.
     """
@@ -117,7 +119,7 @@ def words_to_edges_h(words,
     return edges
 
 def words_to_edges_v(words,
-    word_threshold=3):
+    word_threshold=DEFAULT_MIN_WORDS_VERTICAL):
     """
     Find (imaginary) vertical lines that connect the left, right, or center of at least `word_threshold` words.
     """
@@ -213,7 +215,7 @@ def intersections_to_cells(intersections):
 
     def edge_connects(p1, p2):
         def edges_to_set(edges):
-            return set(map(tuple, [ x.items() for x in edges ]))
+            return set(map(utils.obj_to_bbox, edges))
 
         if p1[0] == p2[0]:
             common = edges_to_set(intersections[p1]["v"])\
@@ -395,7 +397,8 @@ def char_in_bbox(char, bbox):
     "snap_tolerance": DEFAULT_SNAP_TOLERANCE,
     "join_tolerance": DEFAULT_JOIN_TOLERANCE,
     "edge_min_length": 3,
-    "text_word_threshold": 3,
+    "min_words_vertical": DEFAULT_MIN_WORDS_VERTICAL,
+    "min_words_horizontal": DEFAULT_MIN_WORDS_HORIZONTAL,
     "keep_blank_chars": False,
     "text_tolerance": 3,
     "text_x_tolerance": None,
@@ -505,7 +508,7 @@ def v_edge_desc_to_edge(desc):
                 edge_type="lines")
         elif v_strat == "text":
             v_base = words_to_edges_v(words,
-                word_threshold=settings["text_word_threshold"])
+                word_threshold=settings["min_words_vertical"])
         elif v_strat == "explicit":
             v_base = []
 
@@ -539,7 +542,7 @@ def h_edge_desc_to_edge(desc):
                 edge_type="lines")
         elif h_strat == "text":
             h_base = words_to_edges_h(words,
-                word_threshold=settings["text_word_threshold"])
+                word_threshold=settings["min_words_horizontal"])
         elif h_strat == "explicit":
             h_base = []
 
@@ -553,5 +556,3 @@ def h_edge_desc_to_edge(desc):
             )
         return utils.filter_edges(edges,
             min_length=settings["edge_min_length"])
-
-
diff --git a/pdfplumber/utils.py b/pdfplumber/utils.py
@@ -69,15 +69,24 @@ def decode_text(s):
         return ''.join(PDFDocEncoding[o] for o in ords)
 
 def decimalize(v, q=None):
-    if isinstance(v, numbers.Integral):
+    # If already a decimal, just return itself
+    if isinstance(v, Decimal):
+        return v
+    # If tuple/list passed, bulk-convert
+    elif isinstance(v, (tuple, list)):
+        return type(v)(decimalize(x, q) for x in v)
+    # Convert int-like
+    elif isinstance(v, numbers.Integral):
         return Decimal(int(v))
-    if isinstance(v, numbers.Real):
+    # Convert float-like
+    elif isinstance(v, numbers.Real):
         if q != None:
             return Decimal(repr(v)).quantize(Decimal(repr(q)),
                 rounding=ROUND_HALF_UP)
         else:
             return Decimal(repr(v))
-    return v
+    else:
+        raise ValueError("Cannot convert {0} to Decimal.".format(v))
 
 def is_dataframe(collection):
     cls = collection.__class__
@@ -117,8 +126,7 @@ def objects_to_bbox(objects):
         max(map(itemgetter("bottom"), objects)),
     )
 
-def rect_to_bbox(rect):
-    return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])
+obj_to_bbox = itemgetter("x0", "top", "x1", "bottom")
 
 def bbox_to_rect(bbox):
     return {
@@ -267,7 +275,7 @@ def clip_obj(obj, bbox, score=None):
     return copy
 
 def n_points_intersecting_bbox(objs, bbox):
-    bbox = tuple(map(decimalize, bbox))
+    bbox = decimalize(bbox)
     objs = to_list(objs)
     scores = (obj_inside_bbox_score(obj, bbox) for obj in objs)
     return list(scores)