-
Notifications
You must be signed in to change notification settings - Fork 42
Why are tree nodes and DataArrays treated equally by __getitem__() and __setitem__()? #211
Comments
Hi @shaprann - thanks for taking the time to raise this thoughtful issue! Let me try and respond to your individual points.
The design of
For reference, the first mention of the name collision problem was here, and the original suggestion for storing variables at the node level was here.
I'm sorry to hear this was confusing. The basic rule here was intended to be that if "something" is a mapping ( To what extent do you think we might solve this confusion through better documentation?
I don't know about "very strong" reasons, but there were definitely some reasons:
This design maps onto Zarr model, as in: >>> root['foo/bar']
<zarr.hierarchy.Group '/foo/bar'>
>>> root['foo/bar/baz']
<zarr.core.Array '/foo/bar/baz' (10000, 10000) int32> In conjunction with each node's contents being entirely separate, this means there is a one-to-one correspondence between a single Zarr store and a single
The requirement that node names and variable names don't collide is also present in netcdf, and I think comes from the fact that a HDF5 file is basically pointing to folders in an actual filesystem underneath (zarr groups are stored in a similar way on disk too).
Those we are not actually obeying in datatree (or in xarray). Technically obeying them would imply "inheritance" of individual variables from parent nodes, which is a whole other topic.
There are two main options for avoiding name collisions: In order to have a simple correspondence with Zarr/netCDF (see (1)) you need to restrict the names. If you restrict the names it would be weird to store them in seemingly-unrelated mappings, i.e. you should use (b). In that case
Accessing DataArrays via @shaprann do you have a specific suggestion of how you think it should work instead? Forbid variable access on If we were to make a change like this it should happen now, before integration into xarray upstream. |
Hey @TomNicholas, thank you for such a detailed response, it totally answers my question and gives me some valuable insights into the project as a whole! Actually, this
is already enough for me to justify that access syntax, I just wasn't that much aware of the Zarr data model. But my confusion also came from the fact that I conceptually thought of a DataTree object as a "container" for multiple Datasets, similar to how a Dataset is a "container" for multiple DataArrays, at least from the user point of view. However, now I realize that a DataTree object actually represents the current node, which can be thought as a "replacement" for the Dataset proper. So it totally makes sense that we can access DataArrays using I believe that this issue can be closed, since my original question has been answered. As for your last question, I think I'd better join the the discussion in #80 because it would be a more appropriate place. Thank you for the answer once again! |
I'm glad that answered your question @shaprann !
Yes, I should try to make this clearer in the documentation somewhere. Given that we already have a place to discuss this in #80, I'm going to close this for now. |
Hello everyone! I have recently discovered your project and I'm very excited about the direction it is going! I see a lot of ways it can help me to better organize my Xarray Datasets and write cleaner code.
Being new to DataTree, I was quite confused when I discovered that my Datasets are not actually stored in the tree nodes. It took me some time to dig into the development history of the project and discover that this behavior was mainly introduced because of the name collision problem. After reading this page, the commit history, and a lot of related discussions, I believe I can now more or less understand why the things were implemented the way they are.
However, I can't stop thinking about one thing: why does DataTree even allow us to access DataArrays directly like this
dt["data_array"]
in the first place? As a new user I would expect that__getitem__()
and__setitem__()
would only work to navigate the tree nodes, and that's it. Intuitively I would then access a particular DataArray via the Dataset variable:dt.ds["data_array"]
.In fact, I was wondering about this behavior even before I knew about the name collision issue, namely when I was learning how to polulate my tree with data. It turned out that creating a new node via
dt["level1/level2/level3"] = something
would either create a new node at level3 as expected, or in some cases it would instead create a new Dataset at level2 and put my data into a DataArray named "level3", which I found quite confusing.Since a logical separation of tree nodes and DataArrays could help solve the name collision problem (at least before we try to write the result to NetCDF) and in my opinion would be more intuitive, I assume that this "equivalence" of tree nodes and DataArrays has a very strong reason to be implemented, thus I feel like I'm missing something important. I'd be thankful if someone could point me to the right direction on this one. I need to clarify, that even though I am an active Xarray user, I am not particularly familiar with the NetCDF format or with the CF conventions. So maybe my confusion comes from the fact that I am viewing this issue mostly from the Xarray user perspective.
The text was updated successfully, but these errors were encountered: