📰 Custom Issue
TLDR; AncillaryVariables have their data payload read multiple times during the merge process. Often, Cube's have AncillaryVariables the same shape as their data (which can be BIG) and this is causing a lot of I/O and very slow merge times.
Details
When merging multiple cubes that contain an AncillaryVariable both the ancillary metadata and variable data are compared for equality:
|
if self.ancillary_variables_and_dims != other.ancillary_variables_and_dims: |
|
msgs.append("cube.ancillary_variables differ") |
via the inherited method from ther _DimensionalMatadata base class:
|
# data values comparison |
|
if eq and eq is not NotImplemented: |
|
eq = iris.util.array_equal( |
|
self._core_values(), other._core_values(), withnans=True |
|
) |
This makes sense in the context of merging as merging by design will only expand scalar variables and expects all the other dimensional metadata like objects on the cube to be the same.
However, AncillaryVariables often form some sort of status flag data for the cube data, and in this case the user likely wants to concatenate them into a single cube (assuming for instance that they have separate files per timestep for a variable). This can be achieved by adding a new axis to the cube and ancillary variable to prior to concatenation, as detailed in #6790.
However, in this case the merge process still checks every Cube's ancillary coord against every other candidate Cube. As the AncillaryVariables are often the same size as the cube data (which can potentially be very large) this results in the ancillary data being repeatedly read in from disk during the merge process. This is a potentially big I/O hit and can result in very slow merge times for large datasets.
A workaround for a specific case was to patch the __eq__ operator on the AncillaryVariable to only check the metadata:
# Custom equality operator for AncillaryVariable class:
def ancil_eq(self, other):
if other is self:
return True
if hasattr(other, "metadata"):
# metadata comparison
return self.metadata == other.metadata
return NotImplemented
# Patch AncillaryVariable __eq__ operator:
orig_eq_method = iris.coords.AncillaryVariable.__eq__
iris.coords.AncillaryVariable.__eq__ = ancil_eq
iris.load(...)
# Revert patch:
iris.coords.AncillaryVariable.__eq__ = orig_ancil_eq
Obviously, this only works in specific cases where the user knows it is safe to ignore the value comparison of the AncillaryVariable data.
Potential solution
It might be possible to pass a check_ancils flag to iris.cube.CubeList.merge() (and .merge_cube()) in the same way as iris.cube.CubeList.concatenate(). This would allow the user to optionally turn off the comparison of ancillary data (just compare the metadata) if they are confident it is safe to do so with their data files and they intend to concatenate multiple cubes with an AncillaryVariable into a single Cube.
📰 Custom Issue
TLDR;
AncillaryVariables have their data payload read multiple times during the merge process. Often, Cube's haveAncillaryVariablesthe same shape as their data (which can be BIG) and this is causing a lot of I/O and very slow merge times.Details
When merging multiple cubes that contain an
AncillaryVariableboth the ancillary metadata and variable data are compared for equality:iris/lib/iris/_merge.py
Lines 448 to 449 in 94b80d0
via the inherited method from ther
_DimensionalMatadatabase class:iris/lib/iris/coords.py
Lines 665 to 669 in 94b80d0
This makes sense in the context of merging as merging by design will only expand scalar variables and expects all the other dimensional metadata like objects on the cube to be the same.
However,
AncillaryVariables often form some sort of status flag data for the cube data, and in this case the user likely wants to concatenate them into a single cube (assuming for instance that they have separate files per timestep for a variable). This can be achieved by adding a new axis to the cube and ancillary variable to prior to concatenation, as detailed in #6790.However, in this case the merge process still checks every Cube's ancillary coord against every other candidate Cube. As the AncillaryVariables are often the same size as the cube data (which can potentially be very large) this results in the ancillary data being repeatedly read in from disk during the merge process. This is a potentially big I/O hit and can result in very slow merge times for large datasets.
A workaround for a specific case was to patch the
__eq__operator on theAncillaryVariableto only check the metadata:Obviously, this only works in specific cases where the user knows it is safe to ignore the value comparison of the
AncillaryVariabledata.Potential solution
It might be possible to pass a
check_ancilsflag toiris.cube.CubeList.merge()(and.merge_cube()) in the same way asiris.cube.CubeList.concatenate(). This would allow the user to optionally turn off the comparison of ancillary data (just compare the metadata) if they are confident it is safe to do so with their data files and they intend to concatenate multiple cubes with anAncillaryVariableinto a singleCube.