Skip to content

Datatree prototyping #360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
2 changes: 2 additions & 0 deletions src/xradio/datatree/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from xradio.datatree.datatree_builder import DatatreeBuilder
import xradio.datatree.datatree_accessor # noqa
136 changes: 136 additions & 0 deletions src/xradio/datatree/datatree_accessor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
import xarray
from xarray import DataTree

from xradio.datatree.datatree_builder import VALID_OPTIONS

class DatasetNotFound(ValueError):
pass

class InvalidAccessorLocation(ValueError):
pass


VISIBILITY_DATASET_TYPES = {"visibility", "spectrum", "wvr"}
SECONDARY_DATASET_TYPES = {"antenna", "field_and_source", "gain_curve", "system_calibration", "weather"}
DATASET_TYPES = VISIBILITY_DATASET_TYPES | SECONDARY_DATASET_TYPES


@xarray.register_datatree_accessor("msa")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, would at least drop the "a". This needs to be memorable - alternative might be msv4? Though not sure what we will do with image or calibration datamodels then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msa is an acronym for Measurement Set Accessor which signals its purpose fairly clearly IMO. To me ms or msv4 impllies that the Measurement Set is an attribute of the main Datatree structure, when actually the Measurement Set is equivalent to the Datatee.

Regarding image and calibration data models, it should be possible to create separate accessors for them. After, having written the previous paragraph, dt.cala and dt.imga seem aesthetically unappealing to me => dt.cal and dt.{img,image} are nicer IMO.

maybe dt.ms is a good compromise for visibility data. Other ideas:

  • dt.msx
  • dt.imgx
  • dt.calx

which phonetically might imply the accessor idea.

Copy link
Collaborator

@scpmw scpmw Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I's effectively a namespace, so I'm still somewhat partial to xrd. msx also has a nice ring to it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I's effectively a namespace, so I'm still somewhat partial to xrd. msx also has a nice ring to it.

I would argue that xrd is an implementation of the spec, while ms{v4} is the spec.

Copy link
Author

@sjperkins sjperkins Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the example details a geo accessor (presumably for geosciences) how about using a radio astronomy related acronoym:

  • xra: xarray radio astronomy
  • rax: radio astronomy xarray/axxessor

ra may be to close to right ascension

class MeasurementSetAccessor:
_dt: DataTree
_option: float
_remove_suffix: bool

def __init__(self, datatree: DataTree):
self._dt = datatree
root = self._dt.root
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether this would also be the place for a schema check... Could see both sides: I'd be nice if you could just go ds.msv4, and you immediately had some likely assumptions checked. However, might become bothersome if we'd end up artificially gating something behind it. We'd likely want to provide some sort of "skip schema check" environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this is called on every Datatree node so their might be performance implications. Another place to do the checks might be on the accessor. datatree.ms.validate() for example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the accessor gets created lazily - so the check would "only" happen the first time you use the accessor on any particular datatree node. Which would seem reasonable to me...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a quick check and it seems that it creates an accessor for each node. Filtering doesn't not seem to create new nodes (or accessors) but admittedly I haven't tried anything complicated re: filtering.

$ python src/xradio/datatree/dt_convert.py ~/data/VLBA_TL016B_split
_lsrk.vis.zarr/
Datatree Build Strategy
  * Datatree proposal option:'1.0'
  * Data in '/home/simon/data/VLBA_TL016B_split_lsrk.vis.zarr' will be copied to
    '/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr'
  * '/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr' will be overwritten
  * Metadata will be consolidated at the Datatree root
dt = xarray.open_datatree("/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr", engine="zarr")
Creating MS Accessor /
Creating MS Accessor /VLBA_TL016B_split_0
Creating MS Accessor /VLBA_TL016B_split_1
Creating MS Accessor /VLBA_TL016B_split_2
Creating MS Accessor /VLBA_TL016B_split_3
Creating MS Accessor /VLBA_TL016B_split_0/antenna_xds
Creating MS Accessor /VLBA_TL016B_split_0/correlated_xds
Creating MS Accessor /VLBA_TL016B_split_0/field_and_source_xds_base
Creating MS Accessor /VLBA_TL016B_split_0/gain_curve_xds
Creating MS Accessor /VLBA_TL016B_split_0/system_calibration_xds
Creating MS Accessor /VLBA_TL016B_split_0/weather_xds
Creating MS Accessor /VLBA_TL016B_split_1/antenna_xds
Creating MS Accessor /VLBA_TL016B_split_1/correlated_xds
Creating MS Accessor /VLBA_TL016B_split_1/field_and_source_xds_base
Creating MS Accessor /VLBA_TL016B_split_1/gain_curve_xds
Creating MS Accessor /VLBA_TL016B_split_1/system_calibration_xds
Creating MS Accessor /VLBA_TL016B_split_1/weather_xds
Creating MS Accessor /VLBA_TL016B_split_2/antenna_xds
Creating MS Accessor /VLBA_TL016B_split_2/correlated_xds
Creating MS Accessor /VLBA_TL016B_split_2/field_and_source_xds_base
Creating MS Accessor /VLBA_TL016B_split_2/gain_curve_xds
Creating MS Accessor /VLBA_TL016B_split_2/system_calibration_xds
Creating MS Accessor /VLBA_TL016B_split_2/weather_xds
Creating MS Accessor /VLBA_TL016B_split_3/antenna_xds
Creating MS Accessor /VLBA_TL016B_split_3/correlated_xds
Creating MS Accessor /VLBA_TL016B_split_3/field_and_source_xds_base
Creating MS Accessor /VLBA_TL016B_split_3/gain_curve_xds
Creating MS Accessor /VLBA_TL016B_split_3/system_calibration_xds
Creating MS Accessor /VLBA_TL016B_split_3/weather_xds


# option and remove_suffix would be unnecessary in a complete solution
# but are used here to delegate based on the different prototype options
# and configurations
try:
option = float(root.attrs["__datatree_proposal_option__"])
except (KeyError, ValueError):
option = 1.0

if option not in VALID_OPTIONS:
raise ValueError(f"{option} is not a valid DataTree Proposal option {VALID_OPTIONS}")

self._option = option

try:
remove_suffix = bool(root.attrs["__datatree_proposal_remove_suffix__"])
except (KeyError, ValueError):
remove_suffix = False

self._remove_suffix = remove_suffix

def _maybe_rename_dataset(self, dataset: str) -> str:
return dataset.replace("_xds", "") if self._remove_suffix else dataset

@property
def option(self) -> float:
""" DataTree proposal option getter """
return self._option

@option.setter
def option(self, value: float) -> None:
""" DataTree proposal option setter """
if value not in VALID_OPTIONS:
raise ValueError(f"{value} is not a valid DataTree Proposal option {VALID_OPTIONS}")

self._option = value

def _get_optional_dataset(self, dataset: str) -> DataTree | None:
name = self._maybe_rename_dataset(dataset)

if self._dt.attrs.get("type") not in {"visibility", "spectrum", "wvr"}:
raise InvalidAccessorLocation(
f"{self._dt.path} is not a visibility node. "
f"There is no {name} dataset associated with it.")

try:
if self._option == 1.0:
other_ds = self._dt.siblings[name]
elif self._option == 2.0:
other_ds = self._dt.children[name]
elif self._option == 2.5:
other_ds = self._dt.children[name]
elif self._option == 3.0:
partition_str = self._dt.path.split("/")[-1]
partition = int(partition_str[len("xds"):])
other_ds = self._dt.root[f"/{name}/xds{partition}"]
else:
raise ValueError(f"Invalid option {self._option}")
except KeyError as e:
return None

return other_ds

def field_and_source(self, data_group_name="base") -> DataTree | None:
return self._get_optional_dataset(f"field_and_source_xds_{data_group_name}")

@property
def gain_curve(self) -> DataTree | None:
return self._get_optional_dataset("gain_curve_xds")

@property
def phase_calibration(self) -> DataTree | None:
return self._get_optional_dataset("phase_calibration_xds")

@property
def system_calibration(self) -> DataTree | None:
return self._get_optional_dataset("system_calibration_xds")

@property
def weather(self) -> DataTree | None:
return self._get_optional_dataset("weather_xds")

@property
def antennas(self) -> DataTree:
""" Access the antenna dataset """
if self._dt.attrs.get("type") not in VISIBILITY_DATASET_TYPES:
raise InvalidAccessorLocation(
f"{self._dt.path} is not a visibility node. "
f"There is no antenna dataset connected with it.")

name = self._maybe_rename_dataset("antenna_xds")

try:
if self._option == 1.0:
antenna_ds = self._dt.siblings[name]
elif self._option == 2.0:
antenna_ds = self._dt.children[name]
elif self._option == 2.5:
antenna_ds = self._dt.root[name]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear - proposal 2.5 was meant to suggest that datasets such as antenna_xds could appear both below the correlated data dataset or at any node above. That's why it's 2.5: It is a generalisation. And same logic should apply to weather etc (shared weather table doesn't sound crazy at all).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. My immediate thought here is that the flexibility is going to be confusing for software development, but probably solvable with links (which are not been explored in this PR, yet).

Would it be possible to divide secondary datasets into shareable/non-shareable categories where the former are located as a child of the tree root, while the rest are children or siblings of the partition?

I like option 2.5 as it uses Datatree's coordinate inheritance which should work seamlessly with @Jan-Willem 's work on interpolating visibility coordinates in secondary dataset. This seems to work on the single VLBA I've been testing on but probably needs to be smoke tested on the rest of the MSv4 test data.

This might be better to discuss in person in the wG

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but probably solvable with links (which are not been explored in this PR, yet).

One of the findings I take from this prototype is that it's fairly easy to navigate the various options using the standard DataTree methods (root, parent, children, siblings).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear - proposal 2.5 was meant to suggest that datasets such as antenna_xds could appear both below the correlated data dataset or at any node above. That's why it's 2.5: It is a generalisation. And same logic should apply to weather etc (shared weather table doesn't sound crazy at all).

Maybe this is possible without links. What about this idea?

  • If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset
  • Same logic applies for other secondary datasets

This could be enforced via the dt.msa.antenna accessor.

Copy link
Author

@sjperkins sjperkins Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is possible without links. What about this idea?

  • If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset
  • Same logic applies for other secondary datasets

This could be enforced via the dt.msa.antenna accessor.

The danger with this approach maybe global/parition datasets getting out of sync. Perhaps it should be enforced as global xor partition by both the schema checker and accessor.

This would be a relaxation of the following:

Would it be possible to divide secondary datasets into shareable/non-shareable categories where the former are located as a child of the tree root, while the rest are children or siblings of the partition?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is possible without links. What about this idea?

  • If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset
  • Same logic applies for other secondary datasets

Yes, that was the proposal. I'm giving pseudo-code on Confluence:

# Option 2.5
for parent in pathlib.parents(path):
    if 'ANTENNA' in parent.children:
        antenna = parent.children['ANTENNA']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just to be clear - I would just propose to support this in the accessor, but not necessarily go out of our way to actually make systematic use of it. It's just to retain the optimisation option.

elif self._option == 3.0:
partition_str = self._dt.path.split("/")[-1]
partition = int(partition_str[len("xds"):])
antenna_ds = self._dt.root[f"/{name}/xds{partition}"]
else:
raise ValueError(f"Invalid option {self._option}")
except KeyError as e:
raise DatasetNotFound(f"No antenna dataset found relative to {self._dt.path}") from e

return antenna_ds

Loading