Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base file group other than OCR-D-IMG #7

Closed
bertsky opened this issue Oct 2, 2020 · 5 comments
Closed

base file group other than OCR-D-IMG #7

bertsky opened this issue Oct 2, 2020 · 5 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented Oct 2, 2020

I have a METS here which does contain a fileGrp OCR-D-IMG, but not comprising all physical pages. This gives me:

INFO ocrd.resolver.workspace_from_nothing - Writing METS to /tmp/ocrd-core-ttugo0kk/mets.xml
Traceback (most recent call last):
  File "ocrd_browser/ui/window.py", line 88, in _open
    self.page_list.set_document(self.document)
  File "ocrd_browser/ui/page_browser.py", line 39, in set_document
    self.model = PageListStore(self.document)
  File "ocrd_browser/ui/page_store.py", line 56, in __init__
    file = str(file_lookup[page_id])
KeyError: 'f00037100714864

So I digged into ocrd_browser.ui.page_store and thought it might be sufficient to just check page_id in file_lookup before appending a row to the Gtk list. But this raises bigger questions:

  1. Why should the initial view be restricted to pages contained in OCR-D-IMG at all? This could easily just be empty. With practical library systems, the initial image fileGrp could realistically be called MAX, ORIGINAL or something else instead. My understanding of this program is that it should try to present a view of all physical pages (at least initially, before selecting a fileGrp explicitly). So how about presenting all structMap entries sorted by their @ORDER (if present) or @ID with the first fptr that shows up?

  2. How do you change to a different fileGrp? ui.view.base has a View.use_file_group property fixed to OCR-D-IMG.

@hnesk
Copy link
Owner

hnesk commented Oct 13, 2020

That's a valid question, and I had the problem myself (original fileGrp not named OCR-D-IMG).
Solutions to your questions:

  1. The fileGrp to display with PageListStore should not be hardcoded to OCR-D-IMG, but should be selectable like in ViewImages. As a default it should try:
  • The first of a (configurable) list of preferred fileGroups to display as images ( OCR-D-IMG, MAX, ORIGINAL) which have a mime-type matching image/*.
  • The first fileGroup (sorted by ???? maybe string length, because derived images usually have more complex name than the original?) which has a mime-type matching image/*.
  • If there is no match according to the page_id in file_lookup-logic you described, display a "missing image"-icon
  1. View.use_file_group is overridden in ViewXmland ViewImages. OCR-D-IMG is just the default value for all possible views. The use_file_group implementations in these views are actually quite robust and are taking the user selection and availability of the selected fileGrp into account. I think the way to go is to base PageListStore on the same implementation.

What do you think?

@bertsky
Copy link
Contributor Author

bertsky commented Oct 13, 2020

2: Oh, I see! Yes, sounds reasonable to base the initial view on that as well.

1: Yes, this would be very intuitive behaviour and easy to use IMHO. Or (instead of the second criterion) one could even start with an empty view if the first criterion (fixed/configured list of preferred groups) does not yield any images.

hnesk added a commit that referenced this issue Oct 14, 2020
hnesk added a commit that referenced this issue Oct 14, 2020
@hnesk
Copy link
Owner

hnesk commented Oct 14, 2020

I hope I have fixed most of the points now, except the selectable file group, which is quite difficult to implement (but absolutely worth it) and now handled in #9

Why should the initial view be restricted to pages contained in OCR-D-IMG

The initial view file_group is now determined by the algorithm outlined in my comment, point 1., the (quite dirty) implementation is here

So how about presenting all structMap entries sorted by their @ORDER (if present) or @ID with the first fptr that shows up?

The page browser now uses all page_ids from ocrd_models.ocrd_mets.OcrdMets.physical_pages (but without taking @ORDER into account) to determine which pages "exist". It then tries to find matching image files from a given file_group to display. If no image is found for the page a "missing-image" icon is displayed.

@bertsky
Copy link
Contributor Author

bertsky commented Oct 15, 2020

I hope I have fixed most of the points now, except the selectable file group, which is quite difficult to implement (but absolutely worth it) and now handled in #9

Great work!

The initial view file_group is now determined by the algorithm outlined in my comment, point 1., the (quite dirty) implementation is here

Wow, you even have a heuristic for the length of the candidate fileGrps in there!

The page browser now uses all page_ids from ocrd_models.ocrd_mets.OcrdMets.physical_pages (but without taking @ORDER into account) to determine which pages "exist". It then tries to find matching image files from a given file_group to display. If no image is found for the page a "missing-image" icon is displayed.

Works perfectly, many thanks!

@hnesk
Copy link
Owner

hnesk commented Oct 20, 2020

I will close this now, for the rest see #9

@hnesk hnesk closed this as completed Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants