Any way to detect formatting? #1106

enrac5 · 2024-03-08T17:32:57Z

enrac5
Mar 8, 2024

Hi there, I am parsing a PDF with tables and I'd like to be able to detect formatting like italics and bold in the text. Any ideas on if that's possible (or any hacks anyone has) and how to do it?

Edit: I have this code snippet that works for characters:

import pdfplumber
pdf_path = "/tmp/Foo_1.pdf"

pdf = pdfplumber.open(pdf_path)
page = pdf.pages[0]

line_list = []
for char in page.chars:
    print(char["fontname"])

Which is great, but how do I do this for a given table?

Answered by enrac5

Mar 12, 2024

Ah, it looks like I needed to use find_tables(...) not the extraction methods. Hmmm, this is much better than where I was, thank you!

View full answer

jsvine · 2024-03-11T01:30:40Z

jsvine
Mar 11, 2024
Maintainer

You've got the right idea about finding bold/italic through the fontname. Re. your final question, however, it depends on what you mean by "do this for a given table". Could you expand on what you mean, and perhaps provide an example PDF?

11 replies

enrac5 Mar 12, 2024
Author

Hi @jsvine thank you for the response, when I try:

pdf_path = "/path_to/test_doc_as_pdf_001.pdf"
pdf = pdfplumber.open(pdf_path)
page = pdf.pages[0]
table = page.extract_table()

for row in table:
    for cell in row:
        if cell is not None:
            words = page.crop(cell).extract_words(extra_attrs=["fontname"])
            print(words)

I get:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[117], [line 9](vscode-notebook-cell:?execution_count=117&line=9)
      [7](vscode-notebook-cell:?execution_count=117&line=7) for cell in row:
      [8](vscode-notebook-cell:?execution_count=117&line=8)     if cell is not None:
----> [9](vscode-notebook-cell:?execution_count=117&line=9)         words = page.crop(cell).extract_words(extra_attrs=["fontname"])
     [10](vscode-notebook-cell:?execution_count=117&line=10)         print(words)

File [/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:475](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:475), in Page.crop(self, bbox, relative, strict)
    [472](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:472) def crop(
    [473](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:473)     self, bbox: T_bbox, relative: bool = False, strict: bool = True
    [474](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:474) ) -> "CroppedPage":
--> [475](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:475)     return CroppedPage(self, bbox, relative=relative, strict=strict)

File [/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:610](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:610), in CroppedPage.__init__(self, parent_page, crop_bbox, crop_fn, relative, strict)
    [607](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:607)     crop_bbox = (x0 + o_x0, top + o_top, x1 + o_x0, bottom + o_top)
    [609](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:609) if strict:
--> [610](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:610)     test_proposed_bbox(crop_bbox, parent_page.bbox)
    [612](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:612) def _crop_fn(objs: T_obj_list) -> T_obj_list:
    [613](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:613)     return crop_fn(objs, crop_bbox)

File [/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:576](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:576), in test_proposed_bbox(bbox, parent_bbox)
    [575](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:575) def test_proposed_bbox(bbox: T_bbox, parent_bbox: T_bbox) -> None:
--> [576](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:576)     bbox_area = utils.calculate_area(bbox)
    [577](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/page.py:577)     if bbox_area == 0:
...
---> [69](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/utils/geometry.py:69)     left, top, right, bottom = bbox
     [70](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/utils/geometry.py:70)     if left > right or top > bottom:
     [71](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/pdfplumber/utils/geometry.py:71)         raise ValueError(f"{bbox} has a negative width or height.")

ValueError: too many values to unpack (expected 4)

What am I doing wrong?
test_doc_as_pdf_001.pdf

enrac5 Mar 12, 2024
Author

Hmm, looks like it's expecting a bounding box, maybe I'm using the function incorrectly? The row's cells item is a list of strings, not sure if it contains bounding box info.

enrac5 Mar 12, 2024
Author

Is there some way to get the bounding box for a given cell? In the library, when you extract a table, it's a list of lists of strings.

enrac5 Mar 12, 2024
Author

Ah, it looks like I needed to use find_tables(...) not the extraction methods. Hmmm, this is much better than where I was, thank you!

Answer selected by enrac5

jsvine Mar 12, 2024
Maintainer

Yep, and glad to hear it!

enrac5 Mar 12, 2024
Author

I do have one more dumb question. If I want to check the text in a cell, how can I do that? Like if I have this loop:

for row in table.rows:
    for cell in row.cells:
        if cell is not None:
            cropped = page.crop(cell)
            fonts = set(c["fontname"] for c in cropped.chars if c["text"] != " ")
            texts = {
                font:cropped.filter(lambda obj: obj.get("fontname") == font).extract_text()
                for font in fonts}
            print(texts)

and I want to do a small sanity check on the text of the first cell in the row to determine if I want to process it or not, how do I access it?

jsvine Mar 12, 2024
Maintainer

Right after the cropped = line, you can do text_to_check = cropped.extract_text().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any way to detect formatting? #1106

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Any way to detect formatting? #1106

enrac5 Mar 8, 2024

Replies: 1 comment · 11 replies

jsvine Mar 11, 2024 Maintainer

enrac5 Mar 12, 2024 Author

enrac5 Mar 12, 2024 Author

enrac5 Mar 12, 2024 Author

enrac5 Mar 12, 2024 Author

jsvine Mar 12, 2024 Maintainer

enrac5 Mar 12, 2024 Author

jsvine Mar 12, 2024 Maintainer

enrac5
Mar 8, 2024

Replies: 1 comment 11 replies

jsvine
Mar 11, 2024
Maintainer

enrac5 Mar 12, 2024
Author

enrac5 Mar 12, 2024
Author

enrac5 Mar 12, 2024
Author

enrac5 Mar 12, 2024
Author

jsvine Mar 12, 2024
Maintainer

enrac5 Mar 12, 2024
Author

jsvine Mar 12, 2024
Maintainer