-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle mismatched HDF5 locking flags #178
base: main
Are you sure you want to change the base?
Conversation
c6850d0
to
5be5fd6
Compare
@@ -93,7 +101,7 @@ def _open_nexus_file( | |||
file_path: FilePath | NeXusFile | NeXusGroup, | |||
definitions: Mapping | None | NoNewDefinitionsType = NoNewDefinitions, | |||
*, | |||
locking: bool | None = None, | |||
locking: bool | str | None | NoLockingIfNeededType = NoLockingIfNeeded, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just have bool | str | None
here?
None
and NoLockingIfNeeded
have the same effect right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. None
means that h5py should figure out what to. But NoLockingIfNeeded
means that we use our own special logic. If a user passes None
, it means that they want h5py's default, so we don't fall back to opening without locking.
At the moment, there is no way to specify this parameter from the outside. But that may need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it now 👍
return _open_nexus_file_from_path( | ||
file_path, | ||
definitions, | ||
locking=None if locking is NoLockingIfNeeded else locking, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using swmr=True
when opening the hdf5
files in the nightly tests? We should, it will help making the file access more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because when we use swmr semantics we tell hdf5 that we expect another (swmr) writing process, and that can avoid locking.
If we don't tell hdf5 that we expect another writing process, hdf5 has to place a (read) lock on the file to tell the writing process that someone is reading without swmr.
Can't say for sure it will make a difference for us in this case, but since swmr is used by the writing process it's just the most sane way of doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and that can avoid locking.
No, it can't. It still needs to lock while opening the file. Otherwise, processes would have to synchronise in some other way. See https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox
If we don't tell hdf5 that we expect another writing process, hdf5 has to place a (read) lock on the file to tell the writing process that someone is reading without swmr.
As we discussed in person, in our case, the reader cannot communicate with the writer because they are on different systems which only have a one way connection (effectively). So the writer won't know that someone is reading the file - neither by writing SWMR flags into the file or by acquiring a file lock.
In general, our workflows assume that their input files are complete. If the files are not complete, the workflow might crash. Or worse, run to completion but use inconsistent data. For example, it might read detector counts for 100 pulses but monitor counts for 50 pulses because the monitor data has not been fully written. So using SWMR doesn't really protect us. And as I said above, it has no effect on the writer. The writer never knows whether there are any readers and it can never be crashed or corrupt files because of a reader thanks to the separation of filesystems. So I don't see any benefit to using SWMR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right that there are some flags or locking mechanisms used even when swmr is turned on. (This section from the document you linked: https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox#L46-L65)
But in principle swmr should require less coordination and might in some situations use fewer flags or locks than non-swmr read access.
Obeying that access pattern is likely to mean we will run into fewer of these or similar issues.
Also consider that the details here might change in future versions of hdf5, and if they do it's good if we communicate our intent as clear as possible when using the library, and our intent is to read, possibly while someone is writing to the file - that is - swmr.
We probably still have to disable file locking, I'm not saying we don't.
What drawbacks you see with using swmr when reading from the ess filesystem?
When opening a file multiple times in the same process, HDF5 compares the flags used and raises an error if there is a mismatch. This PR adds a mechanism to adapt to mismatches and try to open the file in a compatible way.
The error can occur, e.g., when opening a file on a read-only filesystem multiple times. The first time,
_open_nexus_file
falls back tolocking=False
. The second time, if the file is still open, the old_open_nexus_file_from_path(locking=locking)
would raise an error._attempt_to_open_without_locking
handles this error.I encountered this in the LoKI tests in https://git.esss.dk/dmsc-nightly/dmsc-nightly/-/merge_requests/87. It seems that the raw data files are opened multiple times in the same process.
Note that file locking in HDF5 is about concurrent accesses from multiple processes. Locks have no effect when opening a file multiple times in the same process. This is where HDF5's flag consistency checks come into play.