ADD script to create a simplified version of hocr-files #152

JKamlah · 2019-07-26T11:45:08Z

A script to create a simplified version of hocr-files.
It contains two main functions:

set a new maximum level of typesetting and remove the lower ones
remove unneeded properties

zuphilip

I think this fits well into the scope of the hocr-tools and looks good in general. Thank you @JKamlah for this nice PR! CC @kba @stweil @tmbdev for comments about a new script.

Some comments below from my review and we may want to test it a little bit further. Possibly we have to do something more about words separated into two lines by a hyphen. Moreover, if we have information about glyphs and alternatives, then the text content is maybe repeating some words etc.

Finally, the README would need to update as well.

zuphilip · 2019-07-27T11:30:05Z

hocr-simplify

+            if key in args.remove_properties:
+                if args.verbose:
+                    print("Replaced :{}".format(title))
+                title = title.replace(prop + ";", "").strip()


This does not work when the property is the last one (no semi-colon then).

Alternatively, you can also try something like this, which looks much shorter (code not yet tested):

title = node.get("title") title = re.sub(r"\s?(%s)\s+[^;$];?*" % args.remove_properties.join("|"), "")

BTW don't you have to save it back in the doc somehow?

You could use https://github.com/kba/hocr-spec-python/blob/master/hocr_spec/spec.py#L530 to parse the properties

Yeah, but we don't need to parse it in details, we just have to delete the parameters together with their values, which are not needed anymore.

Thanks for the suggestions. If reworked this part, but without regexp. Also i had to replace the double quotation with single ones.

node.set('title', ';'.join([prop.replace("\"","'") for prop in title.split(";") if prop.strip().split(None, 1)[0] not in args.remove_properties]))

zuphilip · 2019-07-27T11:33:08Z

hocr-simplify

+
+parser.add_argument('file', nargs='?', default=sys.stdin)
+parser.add_argument('-t', '--typesetting', type=str,
+                    choices=['glyph', 'word', 'line', 'par', 'carea', 'page'],


Is the choice glyph doing anything for simplification? I haven't seen an hocr-example where there was an element inside a ocr-glyph.

I thought i would need them, to remove char choices, but i've implemented it in another place. So i removed the "glyph" typesetting option.

zuphilip · 2019-07-27T11:37:01Z

hocr-simplify

+parser.add_argument('-r', '--remove-properties', nargs='+',
+                    help='List of properties: {}'.format(','.join(properties)))
+parser.add_argument('fileout', nargs='?',
+                    help="Outputpath, default: print to terminal")


s/Outputpath/Output path/

(Also in the comment below.)

zuphilip · 2019-07-27T11:41:30Z

hocr-simplify

+    for node in doc.xpath("//*[@title]"):
+        title = node.get("title")
+        for prop in title.split(";"):
+            (key, args) = prop.strip().split(None, 1)


Why do you use None here and not the white-space character to split key and value?

To be fair i've took this part from hocr-cut.

zuphilip · 2019-07-27T11:58:48Z

hocr-simplify

+                if args.verbose:
+                    print("Replaced :{}".format(title))
+                title = title.replace(prop + ";", "").strip()
+


We also have to update the ocr-capabilities meta tag.

zuphilip · 2019-07-27T12:03:44Z

hocr-simplify

+              'imagemd5', 'lpageno', 'ppageno', 'nlp', 'order', 'poly',
+              'scan_res', 'textangle', 'x_booxes', 'x_font', 'x_fsize',
+              'x_confs', 'x_scanner', 'x_source', 'x_wconf']
+


It would be nice to have also an option to delete id and/or dir parameter, but they are on their own.

Removing attributes is now implemented

zuphilip · 2019-07-27T12:05:41Z

test/hocr-simplify/hocr-simplify.tsht

+TESTDATA="../testdata"
+SIMPLEFILE="./tess.simple.hocr"
+
+plan 5


That is the number of test cases, i.e. should be 2 here.

Changed Plan 5 to Plan 3. I added two more test case, with the new char choice options.

kba

I would appreciate some documentation to understand the use cases better. Some more examples would make it easier to test more extensively to catch edge cases like @zuphilip lists.

But in general, it LGTM.

zuphilip · 2019-07-29T18:32:31Z

One use case is to make the hocr-output of tesseract and ocropy look more equally. Then, in a complex workflow where you used ocropy before, you then can also use tesseract + hocr-simplify instead.

…oices (only for tesseract output atm), ADD remove attributes, e.g. id, title.

…ents or only containing whitespaces,.

…IX typesetting format problems, REWORK string format 97-98.

JKamlah

Thank you for the great review. I hope i have fixed all mentioned problem positions. In the new version i added some new feature:

remove attributes
remove empty contents
remove choices

The idea behind simple hocr is like @zuphilip said to make the output look more equally, to optimize the size for the needs and the option to derive a new version without performing the ocr again. E.g. this could be handy if someone works with tesseract outputs with char choices.

Add information for hocr-simplify

JKamlah added 5 commits July 26, 2019 11:53

ADD script to create a simplified version of hocr-files.

ba74b3e

Refactored code.

7385e5a

Style fixes.

4f0a271

Added ws and removed another.

9160877

ADD test case for hocr-simplify

50f4855

zuphilip mentioned this pull request Jul 27, 2019

ADD script to create a simplified version of hocr-files. UB-Mannheim/hocr-tools#23

Closed

zuphilip reviewed Jul 27, 2019

View reviewed changes

kba approved these changes Jul 29, 2019

View reviewed changes

JKamlah added 7 commits August 5, 2019 15:44

FIX remove properties, ADD meta information correction, ADD remove ch…

e264c2f

…oices (only for tesseract output atm), ADD remove attributes, e.g. id, title.

FIX char encoding, ADD remove-empty-contents which removes empty cont…

be4bb77

…ents or only containing whitespaces,.

ADD remove choices, which removes all lstm_choices (tesseract only),F…

4fb6a4c

…IX typesetting format problems, REWORK string format 97-98.

FIX max char in line.

95fa53f

REWORK help messages and comments.

009d746

FIX encoding read and write.

a762023

ADD new tests and testfiles.

4c44dea

JKamlah commented Aug 7, 2019

View reviewed changes

JKamlah added 3 commits August 9, 2019 13:33

README update

cbc78fa

Add information for hocr-simplify

README add EOL

a050fbc

README DEL two ws.

6a4e4ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADD script to create a simplified version of hocr-files #152

ADD script to create a simplified version of hocr-files #152

JKamlah commented Jul 26, 2019

zuphilip left a comment •

edited

Loading

zuphilip Jul 27, 2019

zuphilip Jul 27, 2019

kba Jul 29, 2019

zuphilip Jul 29, 2019

JKamlah Aug 7, 2019

zuphilip Jul 27, 2019

JKamlah Aug 7, 2019

zuphilip Jul 27, 2019

zuphilip Jul 27, 2019

JKamlah Aug 7, 2019

zuphilip Jul 27, 2019

JKamlah Aug 7, 2019

zuphilip Jul 27, 2019

JKamlah Aug 7, 2019

zuphilip Jul 27, 2019

JKamlah Aug 7, 2019

zuphilip Jul 27, 2019 •

edited

Loading

JKamlah Aug 7, 2019

kba left a comment

zuphilip commented Jul 29, 2019

JKamlah left a comment

ADD script to create a simplified version of hocr-files #152

Are you sure you want to change the base?

ADD script to create a simplified version of hocr-files #152

Conversation

JKamlah commented Jul 26, 2019

zuphilip left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuphilip Jul 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kba left a comment

Choose a reason for hiding this comment

zuphilip commented Jul 29, 2019

JKamlah left a comment

Choose a reason for hiding this comment

zuphilip left a comment •

edited

Loading

zuphilip Jul 27, 2019 •

edited

Loading