Skip to content

Handle mismatched HDF5 locking flags #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 61 additions & 22 deletions src/ess/reduce/nexus/_nexus_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,14 @@ class NoNewDefinitionsType: ...
NoNewDefinitions = NoNewDefinitionsType()


class NoLockingIfNeededType:
def __repr__(self) -> str:
return "NoLockingIfNeeded"


NoLockingIfNeeded = NoLockingIfNeededType()


def load_component(
location: NeXusLocationSpec,
*,
Expand Down Expand Up @@ -93,7 +101,7 @@ def _open_nexus_file(
file_path: FilePath | NeXusFile | NeXusGroup,
definitions: Mapping | None | NoNewDefinitionsType = NoNewDefinitions,
*,
locking: bool | None = None,
locking: bool | str | None | NoLockingIfNeededType = NoLockingIfNeeded,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just have bool | str | None here?

None and NoLockingIfNeeded have the same effect right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. None means that h5py should figure out what to. But NoLockingIfNeeded means that we use our own special logic. If a user passes None, it means that they want h5py's default, so we don't fall back to opening without locking.

At the moment, there is no way to specify this parameter from the outside. But that may need to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it now 👍

) -> AbstractContextManager[snx.Group]:
if isinstance(file_path, getattr(NeXusGroup, '__supertype__', type(None))):
if (
Expand All @@ -106,32 +114,63 @@ def _open_nexus_file(
return nullcontext(file_path)

try:
return _open_nexus_file_from_path(file_path, definitions, locking=locking)
return _open_nexus_file_from_path(
file_path,
definitions,
locking=None if locking is NoLockingIfNeeded else locking,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using swmr=True when opening the hdf5 files in the nightly tests? We should, it will help making the file access more robust.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when we use swmr semantics we tell hdf5 that we expect another (swmr) writing process, and that can avoid locking.

If we don't tell hdf5 that we expect another writing process, hdf5 has to place a (read) lock on the file to tell the writing process that someone is reading without swmr.

Can't say for sure it will make a difference for us in this case, but since swmr is used by the writing process it's just the most sane way of doing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that can avoid locking.

No, it can't. It still needs to lock while opening the file. Otherwise, processes would have to synchronise in some other way. See https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox

If we don't tell hdf5 that we expect another writing process, hdf5 has to place a (read) lock on the file to tell the writing process that someone is reading without swmr.

As we discussed in person, in our case, the reader cannot communicate with the writer because they are on different systems which only have a one way connection (effectively). So the writer won't know that someone is reading the file - neither by writing SWMR flags into the file or by acquiring a file lock.

In general, our workflows assume that their input files are complete. If the files are not complete, the workflow might crash. Or worse, run to completion but use inconsistent data. For example, it might read detector counts for 100 pulses but monitor counts for 50 pulses because the monitor data has not been fully written. So using SWMR doesn't really protect us. And as I said above, it has no effect on the writer. The writer never knows whether there are any readers and it can never be crashed or corrupt files because of a reader thanks to the separation of filesystems. So I don't see any benefit to using SWMR.

Copy link
Contributor

@jokasimr jokasimr Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right that there are some flags or locking mechanisms used even when swmr is turned on. (This section from the document you linked: https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox#L46-L65)

But in principle swmr should require less coordination and might in some situations use fewer flags or locks than non-swmr read access.
Obeying that access pattern is likely to mean we will run into fewer of these or similar issues.
Also consider that the details here might change in future versions of hdf5, and if they do it's good if we communicate our intent as clear as possible when using the library, and our intent is to read, possibly while someone is writing to the file - that is - swmr.

We probably still have to disable file locking, I'm not saying we don't.

What drawbacks you see with using swmr when reading from the ess filesystem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What drawbacks you see with using swmr when reading from the ess filesystem?

  • Overhead. I couldn't find anything quantifying the overhead, only a release note saying that it was recently reduced but not eliminated.
  • Miscommunication. You say 'our intent is to read, possibly while someone is writing to the file'. But the second clause is not correct. We do not support that case.

Also keep in mind, that this code is used for all input NeXus files, not just the ones that actually are written in SWMR mode on ESS hardware.

Can we just keep in simple? We can always enable SWMR if needed later. I don't want to do anything extra that has currently no effect, just in case our data storage gets changed fundamentally and we can start using the feature.

Copy link
Contributor

@jokasimr jokasimr Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it using swmr when reading from the ess filesystem is "keeping it simple" because that adheres to the contract with the people writing the files and that might prevent unforseen issues.

Miscommunication. You say 'our intent is to read, possibly while someone is writing to the file'. But the second clause is not correct. We do not support that case.

This is not the mechanism we should use to enforce that. In my opinion that should be done on a workflow by workflow basis. Some workflows could rather easily support that case by, for example, masking events in regions without monitor data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this PR was not the best place for me to raise this. It's a question for the integration testing repo.

)
except OSError as err:
if err.errno == errno.EROFS:
# Failed to open because the filesystem is read-only.
# (According to https://www.ioplex.com/%7Emiallen/errcmpp.html
# this error code is universal.)
#
# On ESS machines, this happens for network filesystems of data that was
# ingested into SciCat, including raw data.
# In this case, it is safe to open the file without locking because:
# - For raw files, they were written on a separate machine and are synced
# with the one running reduction software. So there cannot be concurrent
# write and read accesses to the same file on the same filesystem.
# The ground truth on the filesystem used by the file writer is protected
# and cannot be corrupted by our reader.
# - For processed data, the file was copied to the read-only filesystem.
# So the copy we are opening was not written by HDF5 directly and thus
# locking has no effect anyway.
#
# When running on user machines, disabling locking can potentially corrupt
# files. But the risk is minimal because very few users will have read-only
# filesystems and do concurrent reads and writes.
if _attempt_to_open_without_locking(err, locking):
return _open_nexus_file_from_path(file_path, definitions, locking=False)
raise


# On ESS machines, some network filesystems are read-only.
# E.g., data that was ingested into SciCat, including raw data.
# HDF5 fails to open such files because it cannot lock the files.
# In this case, it is safe(*) to open the file without locking because:
#
# - For raw files, they were written on a separate machine and are synced
# with the one running reduction software. So there cannot be concurrent
# write and read accesses to the same file on the same filesystem.
# The ground truth on the filesystem used by the file writer is protected
# and cannot be corrupted by our reader.
# - For processed data, the file was copied to the read-only filesystem.
# So the copy we are opening was not written by HDF5 directly and thus
# locking has no effect anyway.
#
# When running on user machines, disabling locking can potentially corrupt
# files. But the risk is minimal because very few users will have read-only
# filesystems and do concurrent reads and writes.
#
# (*) Files on the read-only filesystem may still change while a file is open for
# reading if they get updated from the original file. E.g., when reading a file that is
# currently being written to. This can crash the reader. But our code is anyway not set
# up to deal with changing files, so the added risk is not significant.
#
# See https://github.com/HDFGroup/hdf5/blob/e9ab45f0f4d7240937d5f88055f6c217da80f0d4/doxygen/dox/file-locking.dox
# about HDF5 file locking.
def _attempt_to_open_without_locking(
err: OSError, locking: bool | str | None | NoLockingIfNeededType
) -> bool:
if locking is not NoLockingIfNeeded:
return False # Respect user's choice.
if err.errno == errno.EROFS:
# Read-only filesystem.
# (According to https://www.ioplex.com/%7Emiallen/errcmpp.html
# this error code is universal.)
return True

# HDF5 tracks file locking flags internally within a single process.
# If the same file is opened multiple times, we can get a flag mismatch.
# We can try opening without locking, maybe this matches the original flags.
if "file locking flag values don't match" in err.args[0]:
return True
if "file locking 'ignore disabled locks' flag values don't match" in err.args[0]:
return True
return False


def _open_nexus_file_from_path(
file_path: FilePath,
definitions: Mapping | None | NoNewDefinitionsType,
Expand Down
68 changes: 67 additions & 1 deletion tests/nexus/nexus_loader_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
import scippnexus as snx

from ess.reduce import nexus
from ess.reduce.nexus.types import NeXusLocationSpec
from ess.reduce.nexus._nexus_loader import NoLockingIfNeeded
from ess.reduce.nexus.types import FilePath, NeXusLocationSpec

year_zero = sc.datetime('1970-01-01T00:00:00')

Expand Down Expand Up @@ -625,3 +626,68 @@ def compute_component_position_returns_input_if_no_depends_on() -> None:
dg = sc.DataGroup(position=sc.vector([1, 2, 3], unit='m'))
result = nexus.compute_component_position(dg)
assert result is dg


# Some filesystems (e.g., for raw files at ESS) are read-only and
# h5py cannot open files on these systems with file locks.
# We cannot reasonably emulate this within Python tests.
# So the following tests only check the behaviour on a basic level.
# The tests use the private `_open_nexus_file` directly to focus on what matters.
#
# A file may already be open in this or another process.
# We should still be able to open it
@pytest.mark.parametrize(
"locks",
[
(False, False),
(True, True),
(None, None),
(NoLockingIfNeeded, NoLockingIfNeeded),
# These are ok because the second user adapts to the first:
(False, NoLockingIfNeeded),
(None, NoLockingIfNeeded),
# This would fail on a read-only filesystem because the second case can't adapt:
(NoLockingIfNeeded, None),
],
)
def test_open_nexus_file_multiple_times(tmp_path: Path, locks: tuple[Any, Any]) -> None:
from ess.reduce.nexus._nexus_loader import _open_nexus_file

path = FilePath(tmp_path / "file.nxs")
with snx.File(path, "w"):
pass
with _open_nexus_file(path, locking=locks[0]) as f1:
with _open_nexus_file(path, locking=locks[1]) as f2:
assert f1.name == f2.name


@pytest.mark.parametrize(
"locks",
[
(True, False),
(True, None),
(False, True),
(False, None),
(None, True),
(None, False),
# On a read-only filesystem, this would work:
(NoLockingIfNeeded, False),
# This could be supported, but it could cause problems because the first
# user expects the file to be locked.
(True, NoLockingIfNeeded),
# Same as above but with roles reversed:
(NoLockingIfNeeded, True),
],
)
def test_open_nexus_file_with_mismatched_locking(
tmp_path: Path, locks: tuple[Any, Any]
) -> None:
from ess.reduce.nexus._nexus_loader import _open_nexus_file

path = FilePath(tmp_path / "file.nxs")
with snx.File(path, "w"):
pass

with _open_nexus_file(path, locking=locks[0]):
with pytest.raises(OSError, match="flag values don't match"):
_ = _open_nexus_file(path, locking=locks[1])