You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
IMO we should strive to support much more validation and repair features in ocrd_validators.page_validator – esp. functionality known from PRImA Converter and Validator (PCV) and HTR United VX (HTRVX).
From PCV:
-val-rules <Rule1[,Rule2,...]>: Defines what to validate (optional)
Note: If no rules are defined, everthing
is validated.
Available rules (use comma only; no spaces):
General Checks:
VALIDATE_GTSID_DEFINED
VALIDATE_LAYERS
Reading Order:
VALIDATE_READING_ORDER_DEFINED
VALIDATE_READING_ORDER_COMPLETE
VALIDATE_TYPE_OF_REGIONS_IN_READING_ORDER
Region Related Checks:
VALIDATE_REGIONS_WITHIN_DOCUMENT_BOUNDARIES
VALIDATE_REGIONS_WITHIN_BORDER
VALIDATE_REGIONS_DONT_OVERLAP
CALCULATE_REGION_OVERLAP_AREA
VALIDATE_REGION_WITHIN_PARENT_REGION
VALIDATE_NO_INTERSECTING_POLYGON_LINES
VALIDATE_GHOST_REGIONS
VALIDATE_PRINTSPACE
VALIDATE_PENDING_REGIONS
VALIDATE_COMPONENTS_INSIDE_REGIONS (requires image)
VALIDATE_MISSING_ELEMENTS
VALIDATE_NESTED_REGIONS
Text Related Checks:
VALIDATE_TEXT_DEFINED
VALIDATE_UNICODE_TEXT_DEFINED
VALIDATE_DEPRECATED_CHARACTERS
VALIDATE_REPLACEMENT_CHARACTER
VALIDATE_PENDING_CHARACTER
VALIDATE_TEXT_CONTENT
Other:
STRUCTURAL_INTEGRITY
-val-params <INI file>: Load additional validation parameters (optional)
-remove <Filter1[,Filter2,...]>: Remove layout objects (optional).
Available filters (use comma only; no spaces):
REGIONS,NESTED_REGIONS,TEXT_LINES,WORDS,GLYPHS,READING_ORDER,LAYERS
-remove-ghosts <Filter1[,Filter2,...]>: Remove ghost objects (optional)
Ghosts are regions, text lines,
words or glyphs without outline.
Available filters (use comma only; no spaces):
REGIONS,TEXT_LINES,WORDS,GLYPHS,ALL
-convert-text <XML file with rules>: Text content conversion (optional)
-apply-offset <offsetX,offsetY>: Move all layout objects by specified offset
(optional)
Example: -10,20 (no spaces!)
-scale <scaleX,scaleY>: Scale all layout objects by specified factor
Use 'auto' for scaleX and/or scaleY to scale using
the difference between image and XML dimensions.
(optional) (done after apply-offset)
Example: 0.5,0.5 (no spaces!)
-rotate <degrees>: Rotates all polygon points of all layout objects clockwise
around the centre of the page.
-refine-outlines: Refine region outlines. Applies to conversion from non-PAGE
formats (e.g. ALTO) if supported.
General Checks:
VALIDATE_GTSID_DEFINED
trivial
repair: uuid, or rather based on file name?
VALIDATE_LAYERS
Checks if all regions are assigned to
layers (only if there exists at least
one layer).
Not entirely sure we need this, and what it is used for.
Some text regions are missing in the
reading order.
trivial
but not so easy to repair (would require merging existing RO with generated entries)
VALIDATE_TYPE_OF_REGIONS_IN_READING_ORDER
There are one or more regions within
the reading order that shouldn’t be
there (only paragraphs, headings,
drop-capitals, catch-words and TOC-
entries are supposed to be in the
reading order).
trivial
repair: maybe just a filter?
Region Related Checks:
VALIDATE_REGIONS_WITHIN_DOCUMENT_BOUNDARIES
already covered by check_coords
repair: see ocrd-segment-repair
VALIDATE_REGIONS_WITHIN_BORDER
already covered by check_coords
repair: see ocrd-segment-repair
VALIDATE_REGIONS_DONT_OVERLAP
trivial
repair: not trivial, but see ocrd-segment-repair with plausibilize=true and ocrd-cis-ocropy-clip
CALCULATE_REGION_OVERLAP_AREA
trivial
VALIDATE_REGION_WITHIN_PARENT_REGION
already covered by check_coords
repair: see ocrd-segment-repair
VALIDATE_NO_INTERSECTING_POLYGON_LINES
One ore more polygons have
intersecting lines and therefore
contain loops.
i.e. self-intersection
already covered by check_coords
repair: see ocrd-segment-repair
VALIDATE_GHOST_REGIONS
Looks for ghost regions (regions
without outline)
not sure what this exactly means; zero area? negligent area? no coords at all (would already be syntactically invalid)?
VALIDATE_PRINTSPACE
Checks if regions of type page-
number, signature-mark, marginalia
or catch-word are NOT within the
print space
trivial
repair: shrink PrintSpace?
VALIDATE_PENDING_REGIONS
There are text regions without
parent (e.g. word without text line)
One ore more regions contain
connected components that are
partly outside the region.
not difficult
repair: see ocrd-cis-ocropy-clip, ocrd-cis-ocropy-resegment (for textlines) and functions postprocess/morphmasks in ocrd-detectron2
VALIDATE_MISSING_ELEMENTS
Some text elements have no child
elements
again, unclear what that entails
Text Related Checks:
VALIDATE_TEXT_DEFINED
Some text regions have no text
ground-truth.
trivial
VALIDATE_UNICODE_TEXT_DEFINED
Some text regions have plain text
defined but not Unicode text.
trivial
VALIDATE_DEPRECATED_CHARACTERS
Deprecated characters are characters that were linked to a private Unicode code point but now have
a dedicated slot in the normal Unicode sections. The filter corrects such changes
doable
VALIDATE_REPLACEMENT_CHARACTER
Checks for occurances of the
replacement character (Unicode
+FFFD) in text element
trivial
VALIDATE_PENDING_CHARACTER
Checks for occurances of pending
characters (Unicode +F51C) in text
elements
trivial
VALIDATE_TEXT_CONTENT
Checks the content of text elements
for inconsistencies (e.g. spaces in
words, trailing line breaks, non
matching text of child and parent
text objects)
already covered by page_textequiv_consistency and page_textequiv_strategy
For XML validation errors, XML
reader warnings (e.g. old PAGE
format) and wrong image
dimensions.
would need to look at the exact implementation, but we should indeed pass any errors from the schema-backed (generateds) parser and present them as (actionable) exceptions
Available filters (use comma only; no spaces):
REGIONS,TEXT_LINES,WORDS,GLYPHS,ALL
Ghosts are regions, text lines,
words or glyphs without outline.
see above
-convert-text : Text content conversion (optional)
To apply a filter use -convert-text in the command line call and provide the file name of the XML file
containing the filter rules as additional argument. The XML file must have the following format:
Each parameter element contains a replacement rule. The sortIndex attribute specifies in which
order the rules will be applied. The id attribute must be unique (easiest to use the same value as the
sort index). The description is optional but helps to understand the rules. The actual rule is encoded
in the value attribute. The general format is “HHHH[,HHHH,...]:=[HHHH,HHHH,...]”. HHHH is a
Unicode character represented as 4 digit hexadecimal number. In the example above “0065:=0061”
means ‘replace all characters e with character a’. To replace a character sequence separate the
single characters by comma. The same applies for the right-hand side (the replacement character or
sequence). It is also possible to remove characters by leaving the right-hand side empty (e.g.
“0074:=” to delete all ts).
Sounds like a lot of effort for little gain. People could write their own XSLTs and text processors. But perhaps there are already tons of existing patterns, so supporting this mechanism does have merit?
-apply-offset <offsetX,offsetY>: Move all layout objects by specified offset
(optional)
Example: -10,20 (no spaces!)
trivial
-refine-outlines: Refine region outlines. Applies to conversion from\n");
non-PAGE formats (e.g. ALTO) if supported.\n")
not sure what that entails
From HTRVX:
-s, --segmonto False Apply Segmonto Zoning verification
--zone TEXT None Provide a custom zone to control zone types instead of Segmonto
--line TEXT None Provide a custom line to control Line types instead of Segmonto
tbh, I don't understand that part
-e, --check-empty False Check for empty lines or empty zones
-r, --raise-empty False Warns but not fails if empty lines or empty zones are found
see above
-x, --xsd False Apply XSD Schema verification
see above
-i, --check-image False Check if the image link in the XML points to the right path
already covered
The text was updated successfully, but these errors were encountered:
IMO we should strive to support much more validation and repair features in ocrd_validators.page_validator – esp. functionality known from PRImA Converter and Validator (PCV) and HTR United VX (HTRVX).
From PCV:
trivial
repair: uuid, or rather based on file name?
Not entirely sure we need this, and what it is used for.
trivial
repair: see page-ensure-readingorder
trivial
but not so easy to repair (would require merging existing RO with generated entries)
trivial
repair: maybe just a filter?
already covered by
check_coords
repair: see ocrd-segment-repair
already covered by
check_coords
repair: see ocrd-segment-repair
trivial
repair: not trivial, but see
ocrd-segment-repair with plausibilize=true
andocrd-cis-ocropy-clip
trivial
already covered by
check_coords
repair: see ocrd-segment-repair
i.e. self-intersection
already covered by
check_coords
repair: see ocrd-segment-repair
not sure what this exactly means; zero area? negligent area? no coords at all (would already be syntactically invalid)?
trivial
repair: shrink PrintSpace?
How is that even possible syntactically?
not difficult
repair: see ocrd-cis-ocropy-clip, ocrd-cis-ocropy-resegment (for textlines) and functions
postprocess
/morphmasks
in ocrd-detectron2again, unclear what that entails
trivial
trivial
doable
trivial
trivial
already covered by
page_textequiv_consistency
andpage_textequiv_strategy
repair: see page-textequiv-lines-to-regions, page-textequiv-words-to-lines, and function
page_update_higher_textequiv_levels
in all our*-recognize
processorswould need to look at the exact implementation, but we should indeed pass any errors from the schema-backed (generateds) parser and present them as (actionable) exceptions
trivial
see above
Sounds like a lot of effort for little gain. People could write their own XSLTs and text processors. But perhaps there are already tons of existing patterns, so supporting this mechanism does have merit?
trivial
not sure what that entails
From HTRVX:
tbh, I don't understand that part
see above
see above
already covered
The text was updated successfully, but these errors were encountered: