Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workspace clone: always copy local-only file paths #1149

Open
bertsky opened this issue Dec 11, 2023 · 3 comments
Open

workspace clone: always copy local-only file paths #1149

bertsky opened this issue Dec 11, 2023 · 3 comments
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Dec 11, 2023

When you ocrd workspace clone /some/path/to/mets.xml (without the indiscriminate download option) on a workspace which contains local files, the following happens:

  1. a mets:file with remote FLocat will still keep its (now defunct) local FLocat
  2. a mets:file with only local path FLocat will not be copied

IMO, either workspace clone from a relative path should either always copy all local files, or at least the ones in 2 (and removing the local refs in 1).

Copying of the content files itself could also attempt to do CoW (zero-cost) copies, in case the filesystem permits it.

@bertsky
Copy link
Collaborator Author

bertsky commented May 24, 2024

Also:

When you ocrd workspace clone --download /some/path/to/mets.xml (with the download option) on a workspace which contains local files, the following happens:

  1. a mets:file with only local path FLocat will get an additional remote FLocat with an absolute path (combining the baseurl prefix with the relative path).

@bertsky
Copy link
Collaborator Author

bertsky commented May 24, 2024

@kba this is a severe problem IMO.

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 25, 2024

Another example of this (trying to get ocrd_tesserocr tests to work on v3):

    @fixture
    def workspace_kant_binarized(tmpdir):
        initLogging()
        with pushd_popd(tmpdir):
>           yield Resolver().workspace_from_url(METS_KANT_BINARIZED, dst_dir=tmpdir, download=True)

test/conftest.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../core/src/ocrd/resolver.py:229: in workspace_from_url
    workspace.download_file(f)
../core/src/ocrd/workspace.py:222: in download_file
    f.local_filename = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E               FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: 'OCR-D-GT-WORD/INPUT_0017.xml

So because METS_KANT_BINARIZED is only a local workspace to "download" from, the baseurl mechanism does not work. So at the time the download is tried, there is already no information on where the absolute path was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants