-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADD script to create a simplified version of hocr-files #152
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fits well into the scope of the hocr-tools and looks good in general. Thank you @JKamlah for this nice PR! CC @kba @stweil @tmbdev for comments about a new script.
Some comments below from my review and we may want to test it a little bit further. Possibly we have to do something more about words separated into two lines by a hyphen. Moreover, if we have information about glyphs and alternatives, then the text content is maybe repeating some words etc.
Finally, the README would need to update as well.
hocr-simplify
Outdated
if key in args.remove_properties: | ||
if args.verbose: | ||
print("Replaced :{}".format(title)) | ||
title = title.replace(prop + ";", "").strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work when the property is the last one (no semi-colon then).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you can also try something like this, which looks much shorter (code not yet tested):
title = node.get("title")
title = re.sub(r"\s?(%s)\s+[^;$];?*" % args.remove_properties.join("|"), "")
BTW don't you have to save it back in the doc somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use https://github.com/kba/hocr-spec-python/blob/master/hocr_spec/spec.py#L530 to parse the properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but we don't need to parse it in details, we just have to delete the parameters together with their values, which are not needed anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions. If reworked this part, but without regexp. Also i had to replace the double quotation with single ones.
node.set('title', ';'.join([prop.replace("\"","'") for prop in title.split(";") if prop.strip().split(None, 1)[0] not in args.remove_properties]))
hocr-simplify
Outdated
|
||
parser.add_argument('file', nargs='?', default=sys.stdin) | ||
parser.add_argument('-t', '--typesetting', type=str, | ||
choices=['glyph', 'word', 'line', 'par', 'carea', 'page'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the choice glyph
doing anything for simplification? I haven't seen an hocr-example where there was an element inside a ocr-glyph
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought i would need them, to remove char choices, but i've implemented it in another place. So i removed the "glyph" typesetting option.
hocr-simplify
Outdated
parser.add_argument('-r', '--remove-properties', nargs='+', | ||
help='List of properties: {}'.format(','.join(properties))) | ||
parser.add_argument('fileout', nargs='?', | ||
help="Outputpath, default: print to terminal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Outputpath/Output path/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also in the comment below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved.
hocr-simplify
Outdated
for node in doc.xpath("//*[@title]"): | ||
title = node.get("title") | ||
for prop in title.split(";"): | ||
(key, args) = prop.strip().split(None, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use None
here and not the white-space character to split key and value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair i've took this part from hocr-cut.
hocr-simplify
Outdated
if args.verbose: | ||
print("Replaced :{}".format(title)) | ||
title = title.replace(prop + ";", "").strip() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to update the ocr-capabilities
meta tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved.
'imagemd5', 'lpageno', 'ppageno', 'nlp', 'order', 'poly', | ||
'scan_res', 'textangle', 'x_booxes', 'x_font', 'x_fsize', | ||
'x_confs', 'x_scanner', 'x_source', 'x_wconf'] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have also an option to delete id
and/or dir
parameter, but they are on their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing attributes is now implemented
TESTDATA="../testdata" | ||
SIMPLEFILE="./tess.simple.hocr" | ||
|
||
plan 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the number of test cases, i.e. should be 2 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed Plan 5 to Plan 3. I added two more test case, with the new char choice options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate some documentation to understand the use cases better. Some more examples would make it easier to test more extensively to catch edge cases like @zuphilip lists.
But in general, it LGTM.
One use case is to make the hocr-output of tesseract and ocropy look more equally. Then, in a complex workflow where you used ocropy before, you then can also use tesseract + hocr-simplify instead. |
…oices (only for tesseract output atm), ADD remove attributes, e.g. id, title.
…ents or only containing whitespaces,.
…IX typesetting format problems, REWORK string format 97-98.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great review. I hope i have fixed all mentioned problem positions. In the new version i added some new feature:
- remove attributes
- remove empty contents
- remove choices
The idea behind simple hocr is like @zuphilip said to make the output look more equally, to optimize the size for the needs and the option to derive a new version without performing the ocr again. E.g. this could be handy if someone works with tesseract outputs with char choices.
Add information for hocr-simplify
A script to create a simplified version of hocr-files.
It contains two main functions: