Skip to content
This repository was archived by the owner on Oct 24, 2024. It is now read-only.

Why are tree nodes and DataArrays treated equally by __getitem__() and __setitem__()? #211

Closed
shaprann opened this issue Jan 26, 2023 · 3 comments

Comments

@shaprann
Copy link

Hello everyone! I have recently discovered your project and I'm very excited about the direction it is going! I see a lot of ways it can help me to better organize my Xarray Datasets and write cleaner code.

Being new to DataTree, I was quite confused when I discovered that my Datasets are not actually stored in the tree nodes. It took me some time to dig into the development history of the project and discover that this behavior was mainly introduced because of the name collision problem. After reading this page, the commit history, and a lot of related discussions, I believe I can now more or less understand why the things were implemented the way they are.

However, I can't stop thinking about one thing: why does DataTree even allow us to access DataArrays directly like this dt["data_array"] in the first place? As a new user I would expect that __getitem__() and __setitem__() would only work to navigate the tree nodes, and that's it. Intuitively I would then access a particular DataArray via the Dataset variable: dt.ds["data_array"].

In fact, I was wondering about this behavior even before I knew about the name collision issue, namely when I was learning how to polulate my tree with data. It turned out that creating a new node via dt["level1/level2/level3"] = something would either create a new node at level3 as expected, or in some cases it would instead create a new Dataset at level2 and put my data into a DataArray named "level3", which I found quite confusing.

Since a logical separation of tree nodes and DataArrays could help solve the name collision problem (at least before we try to write the result to NetCDF) and in my opinion would be more intuitive, I assume that this "equivalence" of tree nodes and DataArrays has a very strong reason to be implemented, thus I feel like I'm missing something important. I'd be thankful if someone could point me to the right direction on this one. I need to clarify, that even though I am an active Xarray user, I am not particularly familiar with the NetCDF format or with the CF conventions. So maybe my confusion comes from the fact that I am viewing this issue mostly from the Xarray user perspective.

@TomNicholas
Copy link
Member

TomNicholas commented Jan 27, 2023

Hi @shaprann - thanks for taking the time to raise this thoughtful issue! Let me try and respond to your individual points.

Being new to DataTree, I was quite confused when I discovered that my Datasets are not actually stored in the tree nodes.

The design of DataTree is somewhat similar to xarray.Dataset itself in this sense - I would be interested to know if you were similarly confused to discover that xarray.Dataset doesn't actually store DataArray objects? (Instead it actually stores Variable objects and reconstructs DataArray objects on __getitem__.)

It took me some time to dig into the development history

For reference, the first mention of the name collision problem was here, and the original suggestion for storing variables at the node level was here.

It turned out that creating a new node via dt["level1/level2/level3"] = something would either create a new node at level3 as expected, or in some cases it would instead create a new Dataset at level2 and put my data into a DataArray named "level3", which I found quite confusing.

I'm sorry to hear this was confusing. The basic rule here was intended to be that if "something" is a mapping (Dataset or DataTree) you are creating a new node, if it's a single variable-like object you are creating a DataArray. There is definitely a usability issue here in that the __setitem__ method is very overloaded. (And if we implemented #200 it would be further overloaded still...)

To what extent do you think we might solve this confusion through better documentation?


However, I can't stop thinking about one thing: why does DataTree even allow us to access DataArrays directly like this dt["data_array"] in the first place? As a new user I would expect that __getitem__() and __setitem__() would only work to navigate the tree nodes, and that's it. Intuitively I would then access a particular DataArray via the Dataset variable: dt.ds["data_array"].

I assume that this "equivalence" of tree nodes and DataArrays has a very strong reason to be implemented

I don't know about "very strong" reasons, but there were definitely some reasons:

  1. Matching file format specifications

This design maps onto Zarr model, as in:

>>> root['foo/bar']
<zarr.hierarchy.Group '/foo/bar'>
>>> root['foo/bar/baz']
<zarr.core.Array '/foo/bar/baz' (10000, 10000) int32>

In conjunction with each node's contents being entirely separate, this means there is a one-to-one correspondence between a single Zarr store and a single DataTree object. This is a very very nice property to be able to guarantee.

I am not particularly familiar with the NetCDF format

The requirement that node names and variable names don't collide is also present in netcdf, and I think comes from the fact that a HDF5 file is basically pointing to folders in an actual filesystem underneath (zarr groups are stored in a similar way on disk too).

or with the CF conventions

Those we are not actually obeying in datatree (or in xarray). Technically obeying them would imply "inheritance" of individual variables from parent nodes, which is a whole other topic.

  1. Neatly solving the name collision problem

There are two main options for avoiding name collisions:
(a) store variables and nodes to as separate mappings (i.e. dt vs dt.ds), or
(b) store at the same level but restrict choices of name.

In order to have a simple correspondence with Zarr/netCDF (see (1)) you need to restrict the names. If you restrict the names it would be weird to store them in seemingly-unrelated mappings, i.e. you should use (b). In that case dt alone as a mapping interface then is technically sufficient, and dt.ds is just for convenience.

  1. Allowing you to pretend a node is a Dataset

Accessing DataArrays via dt['var'] allows you to mostly treat a node like a Dataset, and if it turns out to contain children then dataset method operations are automatically mapped down to them too. If you have a node with no children (i.e. a leaf node), it literally behaves identically to an xarray.Dataset but has some extra tree-related properties and methods. This also means that your code can look like dt['child']['var'] instead of dt['child'].ds['var']. (I acknowledge that there is some redundancy in treating a node like a dataset and also having a .ds property.)


@shaprann do you have a specific suggestion of how you think it should work instead? Forbid variable access on dt in favour of access only via dt.ds?

If we were to make a change like this it should happen now, before integration into xarray upstream.

cc @shoyer @jhamman @keewis @dcherian for opinions

@shaprann
Copy link
Author

Hey @TomNicholas, thank you for such a detailed response, it totally answers my question and gives me some valuable insights into the project as a whole!

Actually, this

This design maps onto Zarr model

is already enough for me to justify that access syntax, I just wasn't that much aware of the Zarr data model.

But my confusion also came from the fact that I conceptually thought of a DataTree object as a "container" for multiple Datasets, similar to how a Dataset is a "container" for multiple DataArrays, at least from the user point of view. However, now I realize that a DataTree object actually represents the current node, which can be thought as a "replacement" for the Dataset proper. So it totally makes sense that we can access DataArrays using __getitem__(). What led to this initial confusion was the lack of conceptual separation between the "DataTree" as a whole and a particular "NodeDataset" (which essentially acts like an xr.Dataset), since in current implementation both are the same thing.

I believe that this issue can be closed, since my original question has been answered.

As for your last question, I think I'd better join the the discussion in #80 because it would be a more appropriate place. Thank you for the answer once again!

@TomNicholas
Copy link
Member

I'm glad that answered your question @shaprann !

What led to this initial confusion was the lack of conceptual separation between the "DataTree" as a whole and a particular "NodeDataset" (which essentially acts like an xr.Dataset), since in current implementation both are the same thing.

Yes, I should try to make this clearer in the documentation somewhere.

Given that we already have a place to discuss this in #80, I'm going to close this for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants