Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More fine-grained control of "pixels is space" for Import OCR #8957

Open
dobratzp opened this issue Oct 30, 2024 · 0 comments
Open

More fine-grained control of "pixels is space" for Import OCR #8957

dobratzp opened this issue Oct 30, 2024 · 0 comments

Comments

@dobratzp
Copy link

Some PGS or VOBSUB subtiltes use fonts where specific letters overhang more than others.

It would be nice to be able to specify the space before or after certain letters during the OCR process.

The "No of pixels is space" control can be used to mostly get there, but with some fonts it it not possible to get all the words to split correctly.

I suggest adding the following options to the OCR window near "No of pixels is space":

  • left overhang, number of pixels, set of characters
  • right overhang, number of pixels, set of characters
  • left underhang, number of pixels, set of characters
  • right underhang, number of pixels, set of characters

For example:
No of pixels is space: 6
left overhang pixels 1, characters: y j J w W A
right overhang pixels 1, characters: f t w W A
left underhang pixels 1, characters: '
right underhang pixels 1, characters: J 1 I

To work around the issue, I usually increase the pixels per space until I don't see any spaces in the middle of works and then use multi-character matches to force a space between any words that are incorrectly joined together.

For example, in the phrase "of just", the f overhangs more to the right and the j overhangs more to the left.

Creating a multi-match for "of just" fixes those specific sequence of characters, but has some limitations. There could be another character sequence which is a supersequence of those same characters "off of just" and then you have to delete the "of just" mutli-match, create "off of just" multi-match, and then go back and create "of just" multi-match. Also, a multi-match is considered a single character for the purposes of determining italics and non-italics. This could create a situation where the italics are turned to non-italics incorrectly. Also, it is not possible to specify that some of the characters within the multi-match are italic and some are not.

I believe the variable spacing between letters in the source image is intentional and is typically referred to as kerning.

See also:
https://en.wikipedia.org/wiki/Kerning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant