-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dataset::read as #374
base: master
Are you sure you want to change the base?
Add Dataset::read as #374
Conversation
0a689a2
to
9cdcb52
Compare
window_size: (usize, usize), | ||
buffer_size: (usize, usize), | ||
e_resample_alg: Option<ResampleAlg>, | ||
) -> Result<Buffer3D<T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is heavily inspired by RasterBand::read_as
since a lot of the logic is similar. The main change is that the RasterIO
function on dataset needs some direction about which bands to read ( the bands
param).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow this comment, but I'm wondering, is it not possible to specify which bands to read? IOW, is reading all the bands the only option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely possible. I'll try to expose this in a way that make it easy to read all bands if you don't want to bother, but also gives you the power to be more specific.
src/dataset.rs
Outdated
T::gdal_ordinal(), | ||
band_count as i32, | ||
bands.as_mut_ptr() as *mut c_int, | ||
// We want to read a HWC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit arbitrary, but it seems to make more sense than the default CHW because other rust image library (e.g. the png crate) expect image data a HWC. Happy to get feedback / revise this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this terminology. Does "CHW" mean "channel/height/width" and "HWC" mean "height/width/channel"?
How does this compare to how Rasterio handles reading multiple bands at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the jargon. Yes indeed CHW is for channel/height/width
.
How does this compare to how Rasterio handles reading multiple bands at once?
Do you mean the C++ GDALDataset::RasterIO
API or the python rasterio
project ? I'll try to answer for both:
GDALDataset::RasterIO
lets you specify this through thenPixelSpace
,nLineSpace
,nBandSpace
paramsrasterio
DatasetReader.read
will return data in channel-first format (channel/heigh/width). Numpy makes it trivial to change that though (calldata.transpose(1, 2, 0)
). I feel this is going to be a bit more involved in rust so having a good default would make sense - but maybe with the option to change it.
src/dataset.rs
Outdated
/// * `buffer_size` - the desired size of the 'Buffer' | ||
/// * `e_resample_alg` - the resample algorithm used for the interpolation. Default: `NearestNeighbor`. | ||
/// | ||
/// The returned buffer contains the image data in HWC order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "HWC" is common jargon in the GDAL/EO world... Do you mean "rows/cols/band" order?
|
||
#[derive(Debug)] | ||
pub struct Buffer3D<T: GdalType> { | ||
pub size: (usize, usize, usize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document how to interpret the size/shape?
window_size: (usize, usize), | ||
buffer_size: (usize, usize), | ||
e_resample_alg: Option<ResampleAlg>, | ||
) -> Result<Buffer3D<T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud here: Why don't we just expose RasterIO
instead? If you want to read all the cells in rows/cols/band order, this gets the job done, but how common is that use case compared to needing something more flexible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wondering if this should be paired with exposing AdviseRead
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. One thing I wasn't sure is if georust/gdal
aims to be mostly a thin wrapper around GDAL C++ API or focus more on ergonomics (like the rasterio
python project).
For example currently RasterBand
doesn't expose GDALRasterBand::RasterIO
but instead exposes a number of read_as
function that also make some choices (e.g. they return images in width / height format and I don't think they offer the possibility to get it in height/width format.
I do like the idea of exposing RasterIO (and AdviseRead possibly). Then I feel it would make sense to also expose these on the RasterBand
type then.
And then maybe still keep some high-level read_as
functions that simplify common usage patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I wasn't sure is if georust/gdal aims to be mostly a thin wrapper around GDAL C++ API or focus more on ergonomics (like the rasterio python project).
I'm not sure about that either! 😅 Actually, it's more like a thin wrapper around the C API, with inspiration from the C++ API, while leveraging safety benefits of Rust when possible. I personally would like to focus more on the ergonomics, but there are good, well-considered arguments against that. There have been lots of cooks in this kitchen, and lacking a project BDFL, my default position is to try to be consistent, yet look for ways burn-down friction and entropy along the way.
RasterBand doesn't expose GDALRasterBand::RasterIO but instead exposes a number of read_as function that also make some choices
I'm too ignorant of the project history to offer an explanation for that. It probably best scratched an itch at the time.
they return images in width / height format and I don't think they offer the possibility to get it in height/width format.
This may be a simple situation of convention. I've worked on a number of different geospatial raster processing systems, and I can't think of one that doesn't assum row-major order.
I should also note that if ndarray
is turned on, you have all the same shape manipulating view capabilities as you do in numpy.
Something that might help is to understand the use case for this PR. Is calling RasterBand::read*
multiple times to construct your tensor a real bottleneck? Or is this just a convenience? In my work code, we do that and assemble the results into the shape we want using ndarray methods. Duplicating all the read*
methods on Dataset
has some code smell to it, except that there's precedence in the C++ library. But my beef with it gets back to your original question: is the goal just a wrapper, or do we aim for better ergonomics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions, let me try to give some more motivations.
The reason I started working on this is because I was working on a xyz tiler service (similar to rio-tiler) to serve XYZ tiles to a web frontend from cloud-optimized GeoTiffs stored on a blobstore. This is a case where I always want to read all bands of the tiff because either the tiffs are RGB or they have more than 3 bands but then I'm gonna do some band combinations to turn into 3 bands for visualization.
I do already have a similar service in Python and I'm using gdal.Dataset.ReadAsArray
from the GDAL python bindings and so I was looking for the same thing in rust. Since this function is also available in the GDAL C API (GDALDatasetRasterIO
), it felt quite natural to have it in the rust bindings.
Now, granted I could do this by reading band-by-band and recombining with ndarray
, but:
- I'd like to avoid the
ndarray
dependency since I'm not using it elsewhere - Even with
ndarray
, at some point when I want to turn thendarray
into aheight x width x channel
buffer to write to a PNG encoder, this will involve an extra copy I feel - Without
ndarray
, the logic of recombining the bands in an image - even if simple - feels kind of cumbersome to write and something that should be provided by the GDAL wrapper - I haven't benchmarked this in a while, but I think reading band by band is less efficient than reading a whole band at once, depending on the compresion/interleaving of the TIFF file. In particular, when using JPEG compression, reading band by band will still require reading and uncompressiong all the bands and discard everything but the requested band. If I understand correctly, this is what this comment in the GDAL code is referring to
- I am also not sure if reading band by band could cause one network request (when reading from a blobstore) for each band instead of one request for all bands when reading by block. GDAL does have some caching logic, so I'm unsure about this one.
All in all, if the goal is for this crate to be a wrapper around the C API (that seem like a good guideline to me) but inspired by the GDAL C++ API, then I feel exposing Dataset::RasterIO
make sense because it's available on both of those API.
I get the code duplication argument though, but then I would almost argue that if we have to pick one function to expose, it should be RasterIO
on the Dataset
and not on the RasterBand
. Because the Dataset
one lets you do a single band read (by passing the band) while the opposite is not true (you can't read the whole dataset from RasterBand
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienr Interesting and helpful context. Thanks for taking the time to write it up! I hope I didn't come across as questioning your need; I just thought for the sake of one of the other contributors having the use case fleshed out would help with thinking about the best long-term API.
I have a few additional reflections on the above, in case it's helpful (ignore if not), but I'll need to defer to a more senior contributor to determine the appropriate direction.
I'd like to avoid the ndarray dependency since I'm not using it elsewhere
We have a rust-based XYZ tiler as well, and I avoided ndarray
for a while, but found it's worth the price in the end. It's just too damn convenient. And it's impressively efficient, at least in our use cases. The API has a learning curve, but being able to re-shape things and zip
arrays (and even try out parallelization with rayon) has been worth it to us.
at some point when I want to turn the ndarray into a height x width x channel buffer to write to a PNG encoder, this will involve an extra copy I feel
FWIW, we create an in-memory Dataset where we assemble the bands, apply color correction, etc., and then use the built-in GDAL PNG encoder to render. Negligible overhead compared to the S3 network transfer. And GDAL appears to take care of any reshaping needed.
the logic of recombining the bands in an image - even if simple - feels kind of cumbersome to write and something that should be provided by the GDAL wrapper
I agree, but even with the C/C++ API, GDAL doesn't help much, unless you start looking at GDAL Warp and the pixel functions. I aspire to write some wrappers for that API, but I've run away from the dragons therein several times 😅! But maybe I'm missing a part of the API here. 🤔 I've found ndarray
more of a help here.
I haven't benchmarked this in a while, but I think reading band by band is less efficient than reading a whole band at once
That may be so. I wonder if it depends on how big your blocks are, and if allocating B x W x H
contiguous memory is more efficient than B x (W x H)
. 🤔
In particular, when using JPEG compression, reading band by band will still require reading and uncompressing all the bands and discard everything but the requested band.
Maybe I'm wrong, but if it's truly a COG format with TILED=YES
, I don't think this is the case. We're certainly not seeing that behavior in our system, in our use cases. We also use the Kakadu JP2K driver and haven't had problems with range reads as long as it's internally tiled. But I can't claim comprehensive experience or analysis here. It's definitely been "fast enough" and definitely isn't downloading the full raster.
I feel exposing Dataset::RasterIO make sense because it's available on both of those API.
Based on your arguments, I agree. But I'd go all the way and expose as much of the API as you can without leaking unsafe
.
I get the code duplication argument though,
I think it's a minor consideration. We could consider traits for a common interface and share functions as we're able, but a lot of this is just baked into GDAL.
I'm not qualified to approve this PR, but a couple other questions that come to mind are:
|
@metasim thanks for the thorough review and the comments ! I'll convert this to draft because I think some more feedback/discussion on the API design is required to make this good.
Should we adopt the same strategy as for
I need to think about this a bit more. I feel this would require turning the
Indeed, this isn't well specified. It isn't well specified currently for (I feel I've adressed the other 2 questions in my other comments) |
So I haven't made up my mind on the API here, but I think we should offer some API for it. Pixel-interleaved images are a good argument (although the data will probably be cached unless you're reading too much at once), and in general I'd like us to have an equivalent for every GDAL API. |
I think doing it in this PR is probably best... we already have the conversation going here. |
9cdcb52
to
a11a53a
Compare
I've pushed an updated version that expose an API that is very close to GDAL RasterIO (including choosing the interleaving) without sacrificing safety. The only thing I haven't adressed (yet) is the I also see that the CI is failing, but I think this isn't this branch fault but more than recent clippy is more nitpicky that old versions. |
Rebasing on #426 should fix the linting |
This is similar to RasterBand::read_as, but it is reading the whole image at once. Similar to how GDAL has RasterIO on both the dataset and band level.
a11a53a
to
76fdac5
Compare
CHANGES.md
if knowledge of this change could be valuable to users.This adds a
Dataset::read_as
function that is the dataset counterpart toRasterBand::read_as
. and reads the whole raster at once. This avoids requiring multiple calls toRasterBand::read_as
.