More control over lazy data creation (chunking) #5401

TomekTrzeciak · 2019-06-18T12:39:35Z

TomekTrzeciak
Jun 18, 2019

Currently, it is not possible to control how the lazy data gets chunked. It is also not possible to change that afterwards (dask rechunk function does not change the original chunking, it only adds additional split/merge operations on top of it). While the default choice of chunking might be OK in some cases, in other it might be unsuitable and it would be useful to allow for user choice in this respect.

pp-mo · 2019-06-20T16:18:15Z

pp-mo
Jun 20, 2019
Maintainer

Hi @TomekTrzeciak . Thanks for suggesting.

In what operations do you want to control chunking ?

You can already pass a pre-created dask array to the cube constructor, or assign it into cube.data.
We also have chunking control when saving to netcdf : iris.fileformats.netcdf.save supports a 'chunksize' keyword, so that can also appear as a kwarg in an iris.save call.

0 replies

TomekTrzeciak · 2019-06-20T17:47:48Z

TomekTrzeciak
Jun 20, 2019
Author

In what operations do you want to control chunking ?

iris.load & co.

You can already pass a pre-created dask array to the cube constructor, or assign it into cube.data.

Constructing the cube directly is not very convenient. I guess reassigning cube.data could be an option, but still rather awkward to write something like this:

cube = iris.load_cube(filepath)
cube.data = dask.array.from_array(netCDF4.Dataset(filepath)[cube.var_name], chunks=-1)

instead of just:

cube = iris.load_cube(filepath, chunks=-1)

0 replies

pp-mo · 2019-06-21T09:49:30Z

pp-mo
Jun 21, 2019
Maintainer

Regarding load, options will depend on the source format.

The "field-based" file formats (FF, PP, GRIB) deal only in 2D fields, and they don't have any efficient access to subregions of a 2d field (i.e. the format code can only load a whole field then extract from it). Thus, the natural chunksize is the whole field, and I don't think there will ever be any practical use for chunking differently in those cases.

But I guess you are talking about netCDF ???
What we currently have there is basically determined by this code and this code .
In summary :

take a whole variable as a chunk,
.. or use file chunking if set
.. and reduce the earlier dimensions if resulting chunks are impracticably large

There is certainly scope for controlling that : for instance, the chunk reduction assumes c-order contiguity, so will be worst-case if earlier dimensions vary faster in the file.

So I think we are talking about adding a chunk-control keyword to the netcdf loader.
That could be pretty simple, as the underlying interfaces already have suitable controls in them.
Is that what you mean ?

0 replies

TomekTrzeciak · 2019-06-21T10:55:58Z

TomekTrzeciak
Jun 21, 2019
Author

So I think we are talking about adding a chunk-control keyword to the netcdf loader.
That could be pretty simple, as the underlying interfaces already have suitable controls in them.
Is that what you mean ?

Yes, I think extra keyword passed through from load api to netcdf loader would be all that's needed.

0 replies

pp-mo · 2019-06-21T11:33:54Z

pp-mo
Jun 21, 2019
Maintainer

extra keyword passed through from load api to netcdf loader

I had a quick look. Unfortunately, there is no support for additional args/kwargs in the generic load functions, iris.load/load_cube/load_cubes/load_raw.
I guess that's because it's hard to associate controls with individual load items, unlike save where it makes sense to provide additional controls on a per-cube basis.
So, I hope this is ok for your expectations, I think it will be much harder to extend the iris.load API.

adding a chunk-control keyword to the netcdf loader

In the netcdf-specific loader, we currently have iris.fileformats.netcdf.load_cubes(filenames, callback=None).
I'd expect to add a new facility there.
How about something like ...
iris.fileformats.netcdf.load_cubes(filenames, callback=None, filepath_varname_chunks={})
Then use like

from iris.fileformats.netcdf import load_cubes as nc_load_cubes
cubes = nc_load_cubes(filename, {(filename, 'var1'): (32, -1, 16)})

Will this work for your purposes?
And generally, what does your usecase look like -- we'll be needing a testcase anyway ...
Feel free to propose something.

0 replies

TomekTrzeciak · 2019-06-25T15:40:06Z

TomekTrzeciak
Jun 25, 2019
Author

@pp-mo, exposing chunks in iris.fileformats.netcdf.load_cubes could be one way, but this api is quite low level (returns a cube generator, so one needs to be careful how this gets passed around in user code, probably best to convert to something like CubeList immediately).

I think iris.load/load_cube/load_cubes/load_raw could support additional kwargs the same way that iris.save does - the backends that don't support certain kwargs could simply ignore them. But I can see how this leads to passing arguments through several levels of APIs, so arguably not the nicest pattern.

An alternative could be to use a context manager to set/pass backend options without bloating top level APIs. I've noticed that there already exists iris.config.netcdf context manager, so that could be a possible place to add this (or another one could be added specifically for dask if that's preferable). Then, the user code could be like this:

with iris.config.netcdf.context(chunks=-1):
    cube = iris.load_cube('dataset.nc')

0 replies

pp-mo · 2019-06-25T17:49:23Z

pp-mo
Jun 25, 2019
Maintainer

Hi @TomekTrzeciak thanks for sticking with this.

I think iris.load/load_cube/load_cubes/load_raw could support additional kwargs the same way that iris.save does

I don't think there is any serious reason to oppose additional load controls. I just thought it sounded like more trouble to get such a change agreed.
I myself certainly don't object in principle to passing keywords down from the main 'load' functions. I think that is probably the most obvious way to do it.

My concern is that, to be useful, I think we need to be able to specify chunking of individual file variables (see why below...). This means that the controls can't be expressed in terms of core Iris concepts such as cube identity, which then looks rather different to the 'save' case.
As I said...

it's hard to associate controls with individual load items, unlike save where it makes sense

I don't see any sensible way around this, as you can't easily predict what a given load will produce, or which Iris objects relate to which parts of a source file -- because Iris itself doesn't make any simple guarantees about those behaviours : If data changes, you can't reliably know beforehand how it will merge, how many cubes are returned, or in what order -- see for example #3314.

The reason I think we need a flexible control solution is that we do need to cope with large AuxCoords -- often larger than the data variables. That is exactly why we implemented lazy loading for AuxCoords. So it means we will want to control chunking of those variables too. In the near future, we also expect to be dealing with large unstructured grids, which will present the same problem.

I think it could be fine, if we can design a default behaviour that enables us to simplify the simple cases. I'm just a bit wary, as it isn't immediately obvious to me how that can work.

0 replies

pp-mo · 2019-07-16T18:33:29Z

pp-mo
Jul 16, 2019
Maintainer

Hints of progress ?
#3357 (comment)

0 replies

pp-mo · 2019-07-19T09:06:35Z

pp-mo
Jul 19, 2019
Maintainer

Cross-copied from #3357
I'm now trying to retrospectively separate these 2 discussions ! ...

@TomekTrzeciak Xarray has a nice idea there to accept chunks given as a mapping of dimension name to chunk size.

@pp-mo A really nice spot 💐, but it still has the same problem of being tied to the low-level file encoding, which must then be both known (to the user) and stable : In this case we need dimension names, which don't actually appear anywhere in the iris/CF data representation.
If we could use names of dimension coordinates, that would help loads with this. But I think we can only do that if we first "pre-load", then "re-load" the file. 😢

Also, technically, as the occurrence and order of dimensions will differ between variables, you might still want dimensions divided differently for different variables. But I agree that is probably an obscure case : In principle you could even have an X(r:74, t:20000, y:1500, x:2000) and a Y(x:2000, t:20000, r:74), but I think it would be really rare -- maybe a usecase in ocean data ??

0 replies

pp-mo · 2019-07-19T09:50:37Z

pp-mo
Jul 19, 2019
Maintainer

Updated in understanding (mine, anyway) ...

occurrence and order of dimensions will differ between variables

This is true, but I think quite rare, as stated.
However, where I said above ...

the chunk reduction assumes c-order contiguity, so will be worst-case if earlier dimensions vary faster in the file.

Though that may be true for the abstract 'as_lazy_data' call, I now think that is probably not so for netcdf data.
In at least the overwhelming majority of files (all the files I've seen), you can assume C-ordering of dimension variables. And that really ought to inform chunking policy. However, the dask auto-chunking policy is agnostic on dimension order,
Which at present seems to me a good reason for not relying on it totally (retaining our own policy).

Frustratingly, I can't find a clear statement of this anywhere.
I think it is not guaranteed for HDF5 data generally, as that has a very general access model.
It might well be "a given" for data created via netcdf interface. But I can't find any clear statement of that.

0 replies

rockdoc · 2019-07-30T14:04:52Z

rockdoc
Jul 30, 2019

Hi @pp-mo One possible alternative solution -- or a part-solution -- would be to support the specification, by the user, of a chunking hint. There would be a default setting, of course: the canonical one (whatever that might be). I adopted this approach in a Python utility I developed for writing multiple compressed variables to netCDF files with different chunking strategies.

Looking at the code, I can see that my utility supported the following chunking hints:

# Chunk size adjustment hints
HINT_INTEGER_MULTIPLE_OF_ONE_CHUNK = 1
HINT_DECREMENT_BY_ONE_FROM_LEFT    = 2
HINT_INCREMENT_BY_ONE_FROM_RIGHT   = 3
HINT_SET_TO_ONE_FROM_LEFT          = 4
HINT_SET_TO_MAX_FROM_RIGHT         = 5

Without digging around in the low-level code, I can't remember off the top of my head what each of these hints led to in terms of chunking policy. But that doesn't matter here; you'd obviously choose a variety of hints suitable for chunking dask arrays in different ways.

Your solution might want to fall back on a default chunking hint/policy for those cases when the user (or calling program) doesn't specify chunk sizes explicitly. It sounds like Iris is already implementing a default policy, even if it is just the dask default.

Anyways, just thought I'd throw this into the mix, although it might not be suitable for the current use-case.

(PS: If you did want to snoop around the code, I can sort that out - it's in a private MetO repo.)

0 replies

pp-mo · 2019-08-21T15:11:38Z

pp-mo
Aug 21, 2019
Maintainer

Interested this too ? @cpelley

0 replies

pp-mo · 2022-02-07T12:02:39Z

pp-mo
Feb 7, 2022
Maintainer

Following some more recent experiences, I'm changing my mind on this.
Although it can't be done "in Iris terms", I'm now convinced by a few examples I've seen that only user chunking can usefully fix some of the performance limitations.

My key motivating example:

a loaded cube merges multiple input files : each contains a 3d datacube, dimensions (nz, ny, nz), for a single timestep : result is a time sequence --> dimensions (nt, nz, ny, nx)
the user extracted a single vertical level --> cube[:, i_lev] --> (nt, ny, nx)
when fetched, this attempts to fetch the whole data and only then extract a single vertical level, which exceeds machine memory
when saved to a result file, this takes about 20 x longer than equivalent data of the same size

So, I now believe we really do need to enable user chunking control in such cases,
but for reasons above, it will need to be in file-specific terms.
Also for reasons above, I think it makes sense and will be needed to apply chunking controls

(a) to specific file variables, and
(b) to set chunk sizes for specific file dimensions

0 replies

pp-mo · 2022-02-07T12:32:46Z

pp-mo
Feb 7, 2022
Maintainer

key motivating example:

I believe that #4448 is also a very similar problem, with possibly a similar solution

0 replies

pp-mo · 2022-02-09T12:40:48Z

pp-mo
Feb 9, 2022
Maintainer

Hot news! I wrote a draft something that I'm hoping may be useable for this : #4572

0 replies

pp-mo · 2022-10-27T13:45:59Z

pp-mo
Oct 27, 2022
Maintainer

See also : #4994 -- since xarray already offers chunking control. Though, that does not provide variable-specific control as suggested in #4572.
I'm not saying that this is the way to solve all problems in future -- we should probably still add chunking control to Iris.
But it may be a useful stopgap.

0 replies

trexfeathers · 2024-06-21T13:37:33Z

trexfeathers
Jun 21, 2024
Maintainer

I hope we can consider this closed since #5588

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More control over lazy data creation (chunking) #5401

{{title}}

Replies: 17 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

More control over lazy data creation (chunking) #5401

TomekTrzeciak Jun 18, 2019

Replies: 17 comments

pp-mo Jun 20, 2019 Maintainer

TomekTrzeciak Jun 20, 2019 Author

pp-mo Jun 21, 2019 Maintainer

TomekTrzeciak Jun 21, 2019 Author

pp-mo Jun 21, 2019 Maintainer

TomekTrzeciak Jun 25, 2019 Author

pp-mo Jun 25, 2019 Maintainer

pp-mo Jul 16, 2019 Maintainer

pp-mo Jul 19, 2019 Maintainer

pp-mo Jul 19, 2019 Maintainer

rockdoc Jul 30, 2019

pp-mo Aug 21, 2019 Maintainer

pp-mo Feb 7, 2022 Maintainer

pp-mo Feb 7, 2022 Maintainer

pp-mo Feb 9, 2022 Maintainer

pp-mo Oct 27, 2022 Maintainer

trexfeathers Jun 21, 2024 Maintainer

TomekTrzeciak
Jun 18, 2019

pp-mo
Jun 20, 2019
Maintainer

TomekTrzeciak
Jun 20, 2019
Author

pp-mo
Jun 21, 2019
Maintainer

TomekTrzeciak
Jun 21, 2019
Author

pp-mo
Jun 21, 2019
Maintainer

TomekTrzeciak
Jun 25, 2019
Author

pp-mo
Jun 25, 2019
Maintainer

pp-mo
Jul 16, 2019
Maintainer

pp-mo
Jul 19, 2019
Maintainer

pp-mo
Jul 19, 2019
Maintainer

rockdoc
Jul 30, 2019

pp-mo
Aug 21, 2019
Maintainer

pp-mo
Feb 7, 2022
Maintainer

pp-mo
Feb 7, 2022
Maintainer

pp-mo
Feb 9, 2022
Maintainer

pp-mo
Oct 27, 2022
Maintainer

trexfeathers
Jun 21, 2024
Maintainer