-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous GeoTIFF reader #13
Comments
I started a prototype here: https://github.com/developmentseed/aiocogeo-rs It's able to read all TIFF and GeoTIFF metadata and decode a JPEG tile, though general read/decompressor support is not yet implemented. I definitely plan to implement full-tile-decoding support, so that we can use it from Python efficiently. It's not at all reliant on the One question is how to marry both synchronous and async read APIs. One good reference is the |
Hmmm.... I wanted to use tiffs in
Basically: Put all non-io-related code (working on |
Oh wow, thanks @feefladder for all that work (and the PR at image-rs/image-tiff#245)!
Not sure either, but I'd definitely like it if most of the async code could be put in Eventually though, it would be great to have less fragmentation on different ways of reading GeoTIFFs in Rust, and have all the core implementations live upstream in |
Actually, I did some work on georaster based on the async stuff... lemme see if I can make that into a PR...bEDIT: ah it turns out my files didn't save properly... |
My pr over at image-rs/image-tiff#245 got rejected because it was too big and I didn't create an issue first. Then a proposal I had for more control (image-rs/image-tiff#250) there is no maintainer bandwith to support my proposed changes (also here). So I don't know if it is feasible in the near future to have async on top of image-tiff. Anyone here have ideas for what is a good way forward? |
@feefladder, first off, I really wish I had your energy to do such deep technical work in the (Geo)TIFF Rust ecosystem. I can understand both your frustration as an open source contributor wanting to make meaningful changes, but also feel for the image-tiff maintainer who has to review a 1000+ line PR for a crate used by 47k+ of downstream projects... Personally, I'd prefer not to fragment the TIFF ecosystem by having X different implementations in X repos, which is why I've suggested pushing things to Of course, I don't expect anyone to trust my obviously biased intentions on getting |
To add to this, I would love to figure out a good architecture such that we could use some of the decoding parts of But maybe there's a way to use aiocogeo-rs for the data fetching and tag decoding, but reuse image-tiff for the image decoding? That's what I was heading towards before I ran out of time (I'm doing a lot of work on open source rust vector projects) |
Thanks for the uplifting responses :) I think the aiocogeo approach is a good one - having an async tiff reader specifically for COGs. That could have a smaller code/exec footprint than image/tiff without needing to worry about arcane tiffs (since being only for COGs is in the name). What I was currently thinking of is to put all lessons learned from
How should it read tags? Read all images, including all tags Dos-style? I think it makes sense to read only the relevant tags for the current overview (thinking COG) and geo tags. Then, I dont't really see a way in which one could read the desired part of an image in less than 3 sequential requests (for which we know the overview we want to load). Doing some testing, I've found that finding the nth ifd normally fits within the (16kB) buffer from the first request. Then, when reading tag data (~60MB, further in the file), the pre-existing buffer is cleared, making further reads into other tag data slow. Now, to speed things up, I would Keep the first fetch around, since it contains all ifds, but then that is reader-implemented. There is a slight problem also in the design there, since async doesn't mean concurrent. That is, if we make the decoder reader-agnostic (having pub struct<R: AsyncRead + AsyncSeek> ChunkDecoder {
image: Arc<Image>,
reader: R
} and then "cloning" the reader for concurrently reading multiple chunks - which is actually not really cloning, since it only clones the current buffer and not the pending read/seek -> is that too ugly? That's what At the end of the day, I do still think that image-rs/image-tiff#250 is a way forward, where the decoding functionality is exposed without dependence on a specific reader (or creating a mem_reader) <- I think there's still quite some design needed and would rather build on top of a fork where the needed functionality from
Thanks @weiji14 for reading my meta and discord concerns! Indeed, I would like to be able to direct my (free time/thesis) efforts more in a direction I want to. Then, making a reader that secretly has these license terms that is then used here would seem a bit weird, would such a thing be accepted here (say, I PR a dependency for |
Comparing to a recent development in The hard design question here is finding the right interface to communicate the needs within the file, and properly control resource usage. (We want to avoid the need for buffers to have arbitrary size, the decoder must work with a constant bound of additional in-memory data besides the requested tags and decoded data for medical images and GeoTIFF). The interfaces are a little brittle since the control flow is quite inverted. Maybe there's a better way to write that as a trait with higher-ranked-trait-bounds now. |
So I've mumbled up something over here in the past week:
If I undersand correctly, it's something like this (here in readme: // how HeroicKatora would do it if I understand correctly:
#[tokio::test]
async fn test_concurrency_recover() {
let decoder = CogDecoder::from_url("https://enourmous-cog.com")
.await
.expect("Decoder should build");
decoder
.read_overviews(vec![0])
.await
.expect("decoder should read ifds");
// get a chunk from the highest resolution image
let chunk_1 = decoder.get_chunk(42, 0).unwrap(); // the future doesn't hold any reference to `&self`
// get a chunk from a lower resolution image
if let OverviewNotLoadedError(chunk_err) = decoder.get_chunk(42, 5).unwrap_err() {
// read_overviews changes state of the decoder to LoadingIfds
decoder.read_overviews(chunk_err).await;
//scope of async `&mut self` ends here
}
let chunk_2 = decoder.get_chunk(42, 5);
let data = (chunk_1.await, chunk_2.await);
} So far, I'm not really using the |
From your readme
I'm not sure if you intend that, now that the arrow2 crate is deprecated and it kinda splintered the community 😅 |
ah yes, sort of: The idea was that my crate would be a try-out where the implementation is tested, to be pulled in upstream later and then deprecated. I don't know about arrow2 having splintered the community, that is not what I wanted to do, obviously :) |
That's also why I decided not to build it on top of aiocogeo-rs, because I want to be a (rather big) changelog away from If I understand correctly, some of the async code from |
I think if we can keep the initial scope to COG and ignore strip TIFFs, we should be able to keep the maintenance burden low enough to be stable. Ideally we have one project in the georust sphere (not necessarily in the georust org) that implements this. Personally, I don't see any progress on the
I think it should read all
What do you mean buffer? Are you working with a I want to support COGs, and that means making remote files a first-class citizen. I think the best way to do this is via the
We could make wrappers around the
Under my proposed usage of
You mean that
You may want to look at ArrowReaderBuilder, which is a wrapper around both a sync and async interface to work with Parquet files.
I think aiocogeo would likely want to stick with a generic, popular, free license like MIT or Apache 2. |
I think the initial implementation for simplicity should load all tag data. I think a later implementation could improve performance by automatically loading all inline tag data and storing references to the byte ranges of values for large tags. But at least in a COG all IFDs should be up front I believe, so I think you gain a lot in simplicity by reading all tags, for not a lot of performance hit.
Personally, given the huge apparent lack of maintenance bandwidth in And
I think it's simpler for the reader itself to have no knowledge of buffers, but rather just to fetch specific byte ranges as needed. A wrapper around the raw byte range reader can provide request buffering for IFDs. Essentially this |
I'm not sure if it ever got upstreamed or if |
Rewrite https://github.com/geospatial-jeff/aiocogeo in Rust!
We'll probably want to leave all the non-filesystem I/O to
object_store
, and focus on decoding IFDs asynchronously.Help is most appreciated.
The text was updated successfully, but these errors were encountered: