Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle mismatched HDF5 locking flags #178

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jl-wynen
Copy link
Member

@jl-wynen jl-wynen commented Feb 5, 2025

When opening a file multiple times in the same process, HDF5 compares the flags used and raises an error if there is a mismatch. This PR adds a mechanism to adapt to mismatches and try to open the file in a compatible way.

The error can occur, e.g., when opening a file on a read-only filesystem multiple times. The first time, _open_nexus_file falls back to locking=False. The second time, if the file is still open, the old _open_nexus_file_from_path(locking=locking) would raise an error. _attempt_to_open_without_locking handles this error.

I encountered this in the LoKI tests in https://git.esss.dk/dmsc-nightly/dmsc-nightly/-/merge_requests/87. It seems that the raw data files are opened multiple times in the same process.

Note that file locking in HDF5 is about concurrent accesses from multiple processes. Locks have no effect when opening a file multiple times in the same process. This is where HDF5's flag consistency checks come into play.

@jl-wynen jl-wynen force-pushed the locking-with-file-already-open branch from c6850d0 to 5be5fd6 Compare February 5, 2025 14:13
@nvaytet
Copy link
Member

nvaytet commented Feb 6, 2025

Is it because we are using load_nexus_component and load_nexus_data to load different parts of the file and some of this can happen in parallel with Dask?
Screenshot_20250206_103909

@@ -93,7 +101,7 @@ def _open_nexus_file(
file_path: FilePath | NeXusFile | NeXusGroup,
definitions: Mapping | None | NoNewDefinitionsType = NoNewDefinitions,
*,
locking: bool | None = None,
locking: bool | str | None | NoLockingIfNeededType = NoLockingIfNeeded,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just have bool | str | None here?

None and NoLockingIfNeeded have the same effect right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. None means that h5py should figure out what to. But NoLockingIfNeeded means that we use our own special logic. If a user passes None, it means that they want h5py's default, so we don't fall back to opening without locking.

At the moment, there is no way to specify this parameter from the outside. But that may need to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it now 👍

return _open_nexus_file_from_path(
file_path,
definitions,
locking=None if locking is NoLockingIfNeeded else locking,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using swmr=True when opening the hdf5 files in the nightly tests? We should, it will help making the file access more robust.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when we use swmr semantics we tell hdf5 that we expect another (swmr) writing process, and that can avoid locking.

If we don't tell hdf5 that we expect another writing process, hdf5 has to place a (read) lock on the file to tell the writing process that someone is reading without swmr.

Can't say for sure it will make a difference for us in this case, but since swmr is used by the writing process it's just the most sane way of doing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that can avoid locking.

No, it can't. It still needs to lock while opening the file. Otherwise, processes would have to synchronise in some other way. See https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox

If we don't tell hdf5 that we expect another writing process, hdf5 has to place a (read) lock on the file to tell the writing process that someone is reading without swmr.

As we discussed in person, in our case, the reader cannot communicate with the writer because they are on different systems which only have a one way connection (effectively). So the writer won't know that someone is reading the file - neither by writing SWMR flags into the file or by acquiring a file lock.

In general, our workflows assume that their input files are complete. If the files are not complete, the workflow might crash. Or worse, run to completion but use inconsistent data. For example, it might read detector counts for 100 pulses but monitor counts for 50 pulses because the monitor data has not been fully written. So using SWMR doesn't really protect us. And as I said above, it has no effect on the writer. The writer never knows whether there are any readers and it can never be crashed or corrupt files because of a reader thanks to the separation of filesystems. So I don't see any benefit to using SWMR.

Copy link
Contributor

@jokasimr jokasimr Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right that there are some flags or locking mechanisms used even when swmr is turned on. (This section from the document you linked: https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox#L46-L65)

But in principle swmr should require less coordination and might in some situations use fewer flags or locks than non-swmr read access.
Obeying that access pattern is likely to mean we will run into fewer of these or similar issues.
Also consider that the details here might change in future versions of hdf5, and if they do it's good if we communicate our intent as clear as possible when using the library, and our intent is to read, possibly while someone is writing to the file - that is - swmr.

We probably still have to disable file locking, I'm not saying we don't.

What drawbacks you see with using swmr when reading from the ess filesystem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Selected
Development

Successfully merging this pull request may close these issues.

3 participants