More control over lazy data creation (chunking) #5401
Replies: 17 comments
-
Hi @TomekTrzeciak . Thanks for suggesting. In what operations do you want to control chunking ? You can already pass a pre-created dask array to the cube constructor, or assign it into cube.data. |
Beta Was this translation helpful? Give feedback.
-
Constructing the cube directly is not very convenient. I guess reassigning cube.data could be an option, but still rather awkward to write something like this:
instead of just:
|
Beta Was this translation helpful? Give feedback.
-
Regarding load, options will depend on the source format. The "field-based" file formats (FF, PP, GRIB) deal only in 2D fields, and they don't have any efficient access to subregions of a 2d field (i.e. the format code can only load a whole field then extract from it). Thus, the natural chunksize is the whole field, and I don't think there will ever be any practical use for chunking differently in those cases. But I guess you are talking about netCDF ???
There is certainly scope for controlling that : for instance, the chunk reduction assumes c-order contiguity, so will be worst-case if earlier dimensions vary faster in the file. So I think we are talking about adding a chunk-control keyword to the netcdf loader. |
Beta Was this translation helpful? Give feedback.
-
Yes, I think extra keyword passed through from load api to netcdf loader would be all that's needed. |
Beta Was this translation helpful? Give feedback.
-
I had a quick look. Unfortunately, there is no support for additional args/kwargs in the generic load functions,
In the netcdf-specific loader, we currently have
Will this work for your purposes? |
Beta Was this translation helpful? Give feedback.
-
@pp-mo, exposing chunks in I think An alternative could be to use a context manager to set/pass backend options without bloating top level APIs. I've noticed that there already exists
|
Beta Was this translation helpful? Give feedback.
-
Hi @TomekTrzeciak thanks for sticking with this.
I don't think there is any serious reason to oppose additional load controls. I just thought it sounded like more trouble to get such a change agreed. My concern is that, to be useful, I think we need to be able to specify chunking of individual file variables (see why below...). This means that the controls can't be expressed in terms of core Iris concepts such as cube identity, which then looks rather different to the 'save' case.
I don't see any sensible way around this, as you can't easily predict what a given load will produce, or which Iris objects relate to which parts of a source file -- because Iris itself doesn't make any simple guarantees about those behaviours : If data changes, you can't reliably know beforehand how it will merge, how many cubes are returned, or in what order -- see for example #3314. The reason I think we need a flexible control solution is that we do need to cope with large AuxCoords -- often larger than the data variables. That is exactly why we implemented lazy loading for AuxCoords. So it means we will want to control chunking of those variables too. In the near future, we also expect to be dealing with large unstructured grids, which will present the same problem. I think it could be fine, if we can design a default behaviour that enables us to simplify the simple cases. I'm just a bit wary, as it isn't immediately obvious to me how that can work. |
Beta Was this translation helpful? Give feedback.
-
Hints of progress ? |
Beta Was this translation helpful? Give feedback.
-
Cross-copied from #3357
|
Beta Was this translation helpful? Give feedback.
-
Updated in understanding (mine, anyway) ...
This is true, but I think quite rare, as stated.
Though that may be true for the abstract 'as_lazy_data' call, I now think that is probably not so for netcdf data. Frustratingly, I can't find a clear statement of this anywhere. |
Beta Was this translation helpful? Give feedback.
-
Hi @pp-mo One possible alternative solution -- or a part-solution -- would be to support the specification, by the user, of a chunking hint. There would be a default setting, of course: the canonical one (whatever that might be). I adopted this approach in a Python utility I developed for writing multiple compressed variables to netCDF files with different chunking strategies. Looking at the code, I can see that my utility supported the following chunking hints:
Without digging around in the low-level code, I can't remember off the top of my head what each of these hints led to in terms of chunking policy. But that doesn't matter here; you'd obviously choose a variety of hints suitable for chunking dask arrays in different ways. Your solution might want to fall back on a default chunking hint/policy for those cases when the user (or calling program) doesn't specify chunk sizes explicitly. It sounds like Iris is already implementing a default policy, even if it is just the dask default. Anyways, just thought I'd throw this into the mix, although it might not be suitable for the current use-case. (PS: If you did want to snoop around the code, I can sort that out - it's in a private MetO repo.) |
Beta Was this translation helpful? Give feedback.
-
Interested this too ? @cpelley |
Beta Was this translation helpful? Give feedback.
-
Following some more recent experiences, I'm changing my mind on this. My key motivating example:
So, I now believe we really do need to enable user chunking control in such cases,
|
Beta Was this translation helpful? Give feedback.
-
I believe that #4448 is also a very similar problem, with possibly a similar solution |
Beta Was this translation helpful? Give feedback.
-
Hot news! I wrote a draft something that I'm hoping may be useable for this : #4572 |
Beta Was this translation helpful? Give feedback.
-
See also : #4994 -- since xarray already offers chunking control. Though, that does not provide variable-specific control as suggested in #4572. |
Beta Was this translation helpful? Give feedback.
-
I hope we can consider this closed since #5588 |
Beta Was this translation helpful? Give feedback.
-
Currently, it is not possible to control how the lazy data gets chunked. It is also not possible to change that afterwards (dask
rechunk
function does not change the original chunking, it only adds additional split/merge operations on top of it). While the default choice of chunking might be OK in some cases, in other it might be unsuitable and it would be useful to allow for user choice in this respect.Beta Was this translation helpful? Give feedback.
All reactions