Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gt-labelling to OcrdMets API #783

Open
bertsky opened this issue Jan 21, 2022 · 1 comment
Open

add gt-labelling to OcrdMets API #783

bertsky opened this issue Jan 21, 2022 · 1 comment
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Jan 21, 2022

Note: In METS, the labels are a flat sequence of gt:state elements with @prop from the above mentioned schema file, one per page.

   <mets:dmdSec ID="DMDGT_0001">
      <mets:mdWrap MDTYPE="OTHER" OTHERMDTYPE="GT">
         <mets:xmlData>
            <gt:gt>
               <gt:state prop="granularity/physical/document-related/word"/>
               <gt:state prop="granularity/physical/document-related/text-line"/>
               <gt:state prop="granularity/physical/document-related/region"/>
               <gt:state prop="data-attributes/document-related/visual/text/font/multi-font/typefaces"/>
               <gt:state prop="data-attributes/document-related/visual/text/font/multi-font/font-sizes"/>
               <gt:state prop="data-attributes/language/mixed"/>
               <gt:state prop="condition/production-related/document-faults/ink-from-facing"/>
               <gt:state prop="condition/wear/additions/informative/annotations"/>
               <gt:state prop="condition/production-related/document-characteristics/low-contrast"/>
               <gt:state prop="condition/acquisition/method-flaws/imaging/uneven-illumination"/>
            </gt:gt>
         </mets:xmlData>
      </mets:mdWrap>
   </mets:dmdSec>

These are then referenced under each physical structMap's page via @DMDID.

IMO in core we first need some additional API to support that. Like (in analogy to pageId):

OcrdMets.get_gt_labelling(self, for_fileIds=None) # returns dict of file ID to label list
OcrdMets.get_gt_labelling_for_file(self, ocrd_file) # returns label list
OcrdMets.set_gt_labelling_for_file(self, labels, ocrd_file) # takes label list
# but also:
OcrdMets.add_file(self, ... labels=None, ...) # add full label list
OcrdMets.find_files(self, ... labels=None, ...) # filter by label list (match any)

What's your opinion, @kba?

Perhaps – instead of parsing this from the METS, we could also see to it that OCR-D mirrors them in the parsed PAGE-XML, i.e. OcrdPage.

For example as:

  <MetadataItem type="imageProperties" name="gt-labelling">
    <Labels externalModel="https://github.com/OCR-D/gt-labelling/blob/master/xsd_schema/OCR-D_GT_schema.xsd" externalId="http://www.ocr-d.de/GT/">
      <Label value="granularity/physical/document-related/word"/>
      <Label value="granularity/physical/document-related/text-line"/>
      <Label value="granularity/physical/document-related/region"/>
      <Label value="data-attributes/document-related/visual/text/font/multi-font/typefaces"/>
      <Label value="data-attributes/document-related/visual/text/font/multi-font/font-sizes"/>
      <Label value="data-attributes/language/mixed"/>
      <Label value="condition/production-related/document-faults/ink-from-facing"/>
      <Label value="condition/wear/additions/informative/annotations"/>
      <Label value="condition/production-related/document-characteristics/low-contrast"/>
      <Label value="condition/acquisition/method-flaws/imaging/uneven-illumination"/>
    </Labels>
  </MetadataItem>

This would make it easier to access the labels from a processor or PAGE viewer.

Originally posted by @bertsky in hnesk/browse-ocrd#36 (comment)

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 15, 2022

@tboenig perhaps relevant for gt-guideline-examples etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants