-
Notifications
You must be signed in to change notification settings - Fork 10
Datatree prototyping #360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Datatree prototyping #360
Changes from all commits
4b463b9
b4341b6
d75f7f9
335827e
720bda2
b5205cf
6b2abe7
6efad6a
330996f
eeeb30d
b4c2b4b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
from xradio.datatree.datatree_builder import DatatreeBuilder | ||
import xradio.datatree.datatree_accessor # noqa |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
import xarray | ||
from xarray import DataTree | ||
|
||
from xradio.datatree.datatree_builder import VALID_OPTIONS | ||
|
||
class DatasetNotFound(ValueError): | ||
pass | ||
|
||
class InvalidAccessorLocation(ValueError): | ||
pass | ||
|
||
|
||
VISIBILITY_DATASET_TYPES = {"visibility", "spectrum", "wvr"} | ||
SECONDARY_DATASET_TYPES = {"antenna", "field_and_source", "gain_curve", "system_calibration", "weather"} | ||
DATASET_TYPES = VISIBILITY_DATASET_TYPES | SECONDARY_DATASET_TYPES | ||
|
||
|
||
@xarray.register_datatree_accessor("msa") | ||
class MeasurementSetAccessor: | ||
_dt: DataTree | ||
_option: float | ||
_remove_suffix: bool | ||
|
||
def __init__(self, datatree: DataTree): | ||
self._dt = datatree | ||
root = self._dt.root | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering whether this would also be the place for a schema check... Could see both sides: I'd be nice if you could just go There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect that this is called on every Datatree node so their might be performance implications. Another place to do the checks might be on the accessor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the accessor gets created lazily - so the check would "only" happen the first time you use the accessor on any particular datatree node. Which would seem reasonable to me... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just did a quick check and it seems that it creates an accessor for each node. Filtering doesn't not seem to create new nodes (or accessors) but admittedly I haven't tried anything complicated re: filtering.
|
||
|
||
# option and remove_suffix would be unnecessary in a complete solution | ||
# but are used here to delegate based on the different prototype options | ||
# and configurations | ||
try: | ||
option = float(root.attrs["__datatree_proposal_option__"]) | ||
except (KeyError, ValueError): | ||
option = 1.0 | ||
|
||
if option not in VALID_OPTIONS: | ||
raise ValueError(f"{option} is not a valid DataTree Proposal option {VALID_OPTIONS}") | ||
|
||
self._option = option | ||
|
||
try: | ||
remove_suffix = bool(root.attrs["__datatree_proposal_remove_suffix__"]) | ||
except (KeyError, ValueError): | ||
remove_suffix = False | ||
|
||
self._remove_suffix = remove_suffix | ||
|
||
def _maybe_rename_dataset(self, dataset: str) -> str: | ||
return dataset.replace("_xds", "") if self._remove_suffix else dataset | ||
|
||
@property | ||
def option(self) -> float: | ||
""" DataTree proposal option getter """ | ||
return self._option | ||
|
||
@option.setter | ||
def option(self, value: float) -> None: | ||
""" DataTree proposal option setter """ | ||
if value not in VALID_OPTIONS: | ||
raise ValueError(f"{value} is not a valid DataTree Proposal option {VALID_OPTIONS}") | ||
|
||
self._option = value | ||
|
||
def _get_optional_dataset(self, dataset: str) -> DataTree | None: | ||
name = self._maybe_rename_dataset(dataset) | ||
|
||
if self._dt.attrs.get("type") not in {"visibility", "spectrum", "wvr"}: | ||
raise InvalidAccessorLocation( | ||
f"{self._dt.path} is not a visibility node. " | ||
f"There is no {name} dataset associated with it.") | ||
|
||
try: | ||
if self._option == 1.0: | ||
other_ds = self._dt.siblings[name] | ||
elif self._option == 2.0: | ||
other_ds = self._dt.children[name] | ||
elif self._option == 2.5: | ||
other_ds = self._dt.children[name] | ||
elif self._option == 3.0: | ||
partition_str = self._dt.path.split("/")[-1] | ||
partition = int(partition_str[len("xds"):]) | ||
other_ds = self._dt.root[f"/{name}/xds{partition}"] | ||
else: | ||
raise ValueError(f"Invalid option {self._option}") | ||
except KeyError as e: | ||
return None | ||
|
||
return other_ds | ||
|
||
def field_and_source(self, data_group_name="base") -> DataTree | None: | ||
return self._get_optional_dataset(f"field_and_source_xds_{data_group_name}") | ||
|
||
@property | ||
def gain_curve(self) -> DataTree | None: | ||
return self._get_optional_dataset("gain_curve_xds") | ||
|
||
@property | ||
def phase_calibration(self) -> DataTree | None: | ||
return self._get_optional_dataset("phase_calibration_xds") | ||
|
||
@property | ||
def system_calibration(self) -> DataTree | None: | ||
return self._get_optional_dataset("system_calibration_xds") | ||
|
||
@property | ||
def weather(self) -> DataTree | None: | ||
return self._get_optional_dataset("weather_xds") | ||
|
||
@property | ||
def antennas(self) -> DataTree: | ||
""" Access the antenna dataset """ | ||
if self._dt.attrs.get("type") not in VISIBILITY_DATASET_TYPES: | ||
raise InvalidAccessorLocation( | ||
f"{self._dt.path} is not a visibility node. " | ||
f"There is no antenna dataset connected with it.") | ||
|
||
name = self._maybe_rename_dataset("antenna_xds") | ||
|
||
try: | ||
if self._option == 1.0: | ||
antenna_ds = self._dt.siblings[name] | ||
elif self._option == 2.0: | ||
antenna_ds = self._dt.children[name] | ||
elif self._option == 2.5: | ||
antenna_ds = self._dt.root[name] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear - proposal 2.5 was meant to suggest that datasets such as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for clarifying. My immediate thought here is that the flexibility is going to be confusing for software development, but probably solvable with links (which are not been explored in this PR, yet). Would it be possible to divide secondary datasets into shareable/non-shareable categories where the former are located as a child of the tree root, while the rest are children or siblings of the partition? I like option 2.5 as it uses Datatree's coordinate inheritance which should work seamlessly with @Jan-Willem 's work on interpolating visibility coordinates in secondary dataset. This seems to work on the single VLBA I've been testing on but probably needs to be smoke tested on the rest of the MSv4 test data. This might be better to discuss in person in the wG There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
One of the findings I take from this prototype is that it's fairly easy to navigate the various options using the standard There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Maybe this is possible without links. What about this idea?
This could be enforced via the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The danger with this approach maybe global/parition datasets getting out of sync. Perhaps it should be enforced as global xor partition by both the schema checker and accessor. This would be a relaxation of the following:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that was the proposal. I'm giving pseudo-code on Confluence:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And just to be clear - I would just propose to support this in the accessor, but not necessarily go out of our way to actually make systematic use of it. It's just to retain the optimisation option. |
||
elif self._option == 3.0: | ||
partition_str = self._dt.path.split("/")[-1] | ||
partition = int(partition_str[len("xds"):]) | ||
antenna_ds = self._dt.root[f"/{name}/xds{partition}"] | ||
else: | ||
raise ValueError(f"Invalid option {self._option}") | ||
except KeyError as e: | ||
raise DatasetNotFound(f"No antenna dataset found relative to {self._dt.path}") from e | ||
|
||
return antenna_ds | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, would at least drop the "a". This needs to be memorable - alternative might be
msv4
? Though not sure what we will do with image or calibration datamodels then.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msa
is an acronym forMeasurement Set Accessor
which signals its purpose fairly clearly IMO. To mems
ormsv4
impllies that the Measurement Set is an attribute of the main Datatree structure, when actually the Measurement Set is equivalent to the Datatee.Regarding
image
andcalibration
data models, it should be possible to create separate accessors for them. After, having written the previous paragraph,dt.cala
anddt.imga
seem aesthetically unappealing to me =>dt.cal
anddt.{img,image}
are nicer IMO.maybe
dt.ms
is a good compromise for visibility data. Other ideas:dt.msx
dt.imgx
dt.calx
which phonetically might imply the accessor idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I's effectively a namespace, so I'm still somewhat partial to
xrd
.msx
also has a nice ring to it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that
xrd
is an implementation of the spec, whilems{v4}
is the spec.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the example details a
geo
accessor (presumably for geosciences) how about using a radio astronomy related acronoym:xra
:xarray radio astronomy
rax
:radio astronomy xarray/axxessor
ra
may be to close toright ascension