More fine-grained control of "pixels is space" for Import OCR #8957

dobratzp · 2024-10-30T19:35:33Z

Some PGS or VOBSUB subtiltes use fonts where specific letters overhang more than others.

It would be nice to be able to specify the space before or after certain letters during the OCR process.

The "No of pixels is space" control can be used to mostly get there, but with some fonts it it not possible to get all the words to split correctly.

I suggest adding the following options to the OCR window near "No of pixels is space":

left overhang, number of pixels, set of characters
right overhang, number of pixels, set of characters
left underhang, number of pixels, set of characters
right underhang, number of pixels, set of characters

For example:
No of pixels is space: 6
left overhang pixels 1, characters: y j J w W A
right overhang pixels 1, characters: f t w W A
left underhang pixels 1, characters: '
right underhang pixels 1, characters: J 1 I

To work around the issue, I usually increase the pixels per space until I don't see any spaces in the middle of works and then use multi-character matches to force a space between any words that are incorrectly joined together.

For example, in the phrase "of just", the f overhangs more to the right and the j overhangs more to the left.

Creating a multi-match for "of just" fixes those specific sequence of characters, but has some limitations. There could be another character sequence which is a supersequence of those same characters "off of just" and then you have to delete the "of just" mutli-match, create "off of just" multi-match, and then go back and create "of just" multi-match. Also, a multi-match is considered a single character for the purposes of determining italics and non-italics. This could create a situation where the italics are turned to non-italics incorrectly. Also, it is not possible to specify that some of the characters within the multi-match are italic and some are not.

I believe the variable spacing between letters in the source image is intentional and is typically referred to as kerning.

See also:
https://en.wikipedia.org/wiki/Kerning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More fine-grained control of "pixels is space" for Import OCR #8957

More fine-grained control of "pixels is space" for Import OCR #8957

dobratzp commented Oct 30, 2024

More fine-grained control of "pixels is space" for Import OCR #8957

More fine-grained control of "pixels is space" for Import OCR #8957

Comments

dobratzp commented Oct 30, 2024