Need to identify "correct" carriage returns #1130

enrac5 · 2024-04-24T18:32:25Z

enrac5
Apr 24, 2024

I have a table that has some text in one of the columns (see attached). The text is split up into paragraphs (usually representing dialogue). I need to be able to correctly identify which sections of text are paragraphs. Right now, using the extract_tables(...) method, the text in the third column has line breaks for each line, which makes sense, but makes paragraph detection difficult. Any ideas on how I can correctly identify the separate blocks of text?

enrac5 · 2024-04-30T18:22:33Z

enrac5
Apr 30, 2024
Author

Hi @jsvine any thoughts on this one?

0 replies

jsvine · 2024-05-15T18:26:15Z

jsvine
May 15, 2024
Maintainer

Passing { "text_layout": True } may help. E.g.:

table = page.extract_table({ "text_layout": True })
print(table[0][2])

Returns:

 CAMERA FOLLOWS IN
 FRONT OF MS LANE AS
 SHE WALKS.       
                  
 BEATLES (VO)     
 (overlaps) (singing) (through
 speakers) ...and shout!
                  
 MS LANE          
 (singing) ...twist and shout!
 (off) oy vay I’m tired. <sighs>
 (on) <hums> <grunts>

For more on how to adjust the layout parameters, see the When layout=True portion of the readme; keeping in mind that text-extraction parameters need to be prefixed with text_ in the table extraction settings.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to identify "correct" carriage returns #1130

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Need to identify "correct" carriage returns #1130

enrac5 Apr 24, 2024

Replies: 2 comments

enrac5 Apr 30, 2024 Author

jsvine May 15, 2024 Maintainer

enrac5
Apr 24, 2024

enrac5
Apr 30, 2024
Author

jsvine
May 15, 2024
Maintainer