-
Notifications
You must be signed in to change notification settings - Fork 1
Handle mismatched HDF5 locking flags #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,6 +30,14 @@ class NoNewDefinitionsType: ... | |
NoNewDefinitions = NoNewDefinitionsType() | ||
|
||
|
||
class NoLockingIfNeededType: | ||
def __repr__(self) -> str: | ||
return "NoLockingIfNeeded" | ||
|
||
|
||
NoLockingIfNeeded = NoLockingIfNeededType() | ||
|
||
|
||
def load_component( | ||
location: NeXusLocationSpec, | ||
*, | ||
|
@@ -93,7 +101,7 @@ def _open_nexus_file( | |
file_path: FilePath | NeXusFile | NeXusGroup, | ||
definitions: Mapping | None | NoNewDefinitionsType = NoNewDefinitions, | ||
*, | ||
locking: bool | None = None, | ||
locking: bool | str | None | NoLockingIfNeededType = NoLockingIfNeeded, | ||
) -> AbstractContextManager[snx.Group]: | ||
if isinstance(file_path, getattr(NeXusGroup, '__supertype__', type(None))): | ||
if ( | ||
|
@@ -106,32 +114,63 @@ def _open_nexus_file( | |
return nullcontext(file_path) | ||
|
||
try: | ||
return _open_nexus_file_from_path(file_path, definitions, locking=locking) | ||
return _open_nexus_file_from_path( | ||
file_path, | ||
definitions, | ||
locking=None if locking is NoLockingIfNeeded else locking, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How so? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because when we use swmr semantics we tell hdf5 that we expect another (swmr) writing process, and that can avoid locking. If we don't tell hdf5 that we expect another writing process, hdf5 has to place a (read) lock on the file to tell the writing process that someone is reading without swmr. Can't say for sure it will make a difference for us in this case, but since swmr is used by the writing process it's just the most sane way of doing it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, it can't. It still needs to lock while opening the file. Otherwise, processes would have to synchronise in some other way. See https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox
As we discussed in person, in our case, the reader cannot communicate with the writer because they are on different systems which only have a one way connection (effectively). So the writer won't know that someone is reading the file - neither by writing SWMR flags into the file or by acquiring a file lock. In general, our workflows assume that their input files are complete. If the files are not complete, the workflow might crash. Or worse, run to completion but use inconsistent data. For example, it might read detector counts for 100 pulses but monitor counts for 50 pulses because the monitor data has not been fully written. So using SWMR doesn't really protect us. And as I said above, it has no effect on the writer. The writer never knows whether there are any readers and it can never be crashed or corrupt files because of a reader thanks to the separation of filesystems. So I don't see any benefit to using SWMR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are probably right that there are some flags or locking mechanisms used even when swmr is turned on. (This section from the document you linked: https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox#L46-L65) But in principle swmr should require less coordination and might in some situations use fewer flags or locks than non-swmr read access. We probably still have to disable file locking, I'm not saying we don't. What drawbacks you see with using swmr when reading from the ess filesystem? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Also keep in mind, that this code is used for all input NeXus files, not just the ones that actually are written in SWMR mode on ESS hardware. Can we just keep in simple? We can always enable SWMR if needed later. I don't want to do anything extra that has currently no effect, just in case our data storage gets changed fundamentally and we can start using the feature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I see it using swmr when reading from the ess filesystem is "keeping it simple" because that adheres to the contract with the people writing the files and that might prevent unforseen issues.
This is not the mechanism we should use to enforce that. In my opinion that should be done on a workflow by workflow basis. Some workflows could rather easily support that case by, for example, masking events in regions without monitor data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But this PR was not the best place for me to raise this. It's a question for the integration testing repo. |
||
) | ||
except OSError as err: | ||
if err.errno == errno.EROFS: | ||
# Failed to open because the filesystem is read-only. | ||
# (According to https://www.ioplex.com/%7Emiallen/errcmpp.html | ||
# this error code is universal.) | ||
# | ||
# On ESS machines, this happens for network filesystems of data that was | ||
# ingested into SciCat, including raw data. | ||
# In this case, it is safe to open the file without locking because: | ||
# - For raw files, they were written on a separate machine and are synced | ||
# with the one running reduction software. So there cannot be concurrent | ||
# write and read accesses to the same file on the same filesystem. | ||
# The ground truth on the filesystem used by the file writer is protected | ||
# and cannot be corrupted by our reader. | ||
# - For processed data, the file was copied to the read-only filesystem. | ||
# So the copy we are opening was not written by HDF5 directly and thus | ||
# locking has no effect anyway. | ||
# | ||
# When running on user machines, disabling locking can potentially corrupt | ||
# files. But the risk is minimal because very few users will have read-only | ||
# filesystems and do concurrent reads and writes. | ||
if _attempt_to_open_without_locking(err, locking): | ||
return _open_nexus_file_from_path(file_path, definitions, locking=False) | ||
raise | ||
|
||
|
||
# On ESS machines, some network filesystems are read-only. | ||
# E.g., data that was ingested into SciCat, including raw data. | ||
# HDF5 fails to open such files because it cannot lock the files. | ||
# In this case, it is safe(*) to open the file without locking because: | ||
# | ||
# - For raw files, they were written on a separate machine and are synced | ||
# with the one running reduction software. So there cannot be concurrent | ||
# write and read accesses to the same file on the same filesystem. | ||
# The ground truth on the filesystem used by the file writer is protected | ||
# and cannot be corrupted by our reader. | ||
# - For processed data, the file was copied to the read-only filesystem. | ||
# So the copy we are opening was not written by HDF5 directly and thus | ||
# locking has no effect anyway. | ||
# | ||
# When running on user machines, disabling locking can potentially corrupt | ||
# files. But the risk is minimal because very few users will have read-only | ||
# filesystems and do concurrent reads and writes. | ||
# | ||
# (*) Files on the read-only filesystem may still change while a file is open for | ||
# reading if they get updated from the original file. E.g., when reading a file that is | ||
# currently being written to. This can crash the reader. But our code is anyway not set | ||
# up to deal with changing files, so the added risk is not significant. | ||
# | ||
# See https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox | ||
# about HDF5 file locking. | ||
def _attempt_to_open_without_locking( | ||
err: OSError, locking: bool | str | None | NoLockingIfNeededType | ||
) -> bool: | ||
if locking is not NoLockingIfNeeded: | ||
return False # Respect user's choice. | ||
if err.errno == errno.EROFS: | ||
# Read-only filesystem. | ||
# (According to https://www.ioplex.com/%7Emiallen/errcmpp.html | ||
# this error code is universal.) | ||
return True | ||
|
||
# HDF5 tracks file locking flags internally within a single process. | ||
# If the same file is opened multiple times, we can get a flag mismatch. | ||
# We can try opening without locking, maybe this matches the original flags. | ||
if "file locking flag values don't match" in err.args[0]: | ||
return True | ||
if "file locking 'ignore disabled locks' flag values don't match" in err.args[0]: | ||
return True | ||
return False | ||
|
||
|
||
def _open_nexus_file_from_path( | ||
file_path: FilePath, | ||
definitions: Mapping | None | NoNewDefinitionsType, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just have
bool | str | None
here?None
andNoLockingIfNeeded
have the same effect right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
None
means that h5py should figure out what to. ButNoLockingIfNeeded
means that we use our own special logic. If a user passesNone
, it means that they want h5py's default, so we don't fall back to opening without locking.At the moment, there is no way to specify this parameter from the outside. But that may need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it now 👍