How one can cache Dataset #425

REASY · 2023-07-21T02:19:25Z

Hello, team,

I have a slippy server that serves Slippy Tiles implemented as HTTP server using gdal-rs. Actual rasters are partitioned in many Cloud Optimized GeoTIFF (COG) files with overviews. On high level, I extract tile information from the request that looks like /:prefix/:layer/:z/:x/:y and map it to overview and offset to read from COG. My COG files are stored in S3 and I use vsis3. In the beginning of request I open Dataset, in the end it is implicitly closed because of drop. Interestingly, if I query the same slippy tile twice, the only first request has high latency, the second one is much faster (is it because of VSI cache?):

2023-07-21T02:13:12.185123Z  INFO tokio-runtime-worker ThreadId(03) qartez_slippy_server::routes: src/routes.rs:169: Read and prepared a tile for .../20/179207/418903.png from /vsis3/.../color/geotiff/5600_13090.tif in 475 ms
2023-07-21T02:13:42.265197Z  INFO tokio-runtime-worker ThreadId(02) qartez_slippy_server::routes: src/routes.rs:169: Read and prepared a tile for .../20/179207/418903.png from /vsis3/.../color/geotiff/5600_13090.tif in 3 ms

Does it make sense in such scenario to cache the C descriptor of Dataset and reuse it? Or VSI_CACHE_SIZE together with GDAL_CACHEMAX should be enough?

Thank you.

The text was updated successfully, but these errors were encountered:

lnicola · 2023-07-21T06:17:44Z

Yeah, it's a bit unfortunate. GDAL doesn't allow you to read from a Dataset from multiple threads at once, even though cURL could probably support it just fine.

So I think your options are to either:

open and close the dataset on each read, which will incur a good bit of overhead (the TLS handshake and and reading the IFDs, I guess)
have a thread or pool of threads where each opens the file, gets a read request from a channel, does the actual read, sends the results back, then loops; this should work pretty well, but you'll be storing duplicate data in the GDAL cache

I should probably ask on the mailing list for clarification, though.

rouault · 2023-08-19T20:08:17Z

Starting with GDAL 3.6.0, if the GDAL_NUM_THREADS config option is set, reading in a TIFF/COG file a window of interest that intersects multiple tiles at one will use multithreaded decompression (cf https://github.com/OSGeo/gdal/blob/v3.6.0/NEWS.md), and in GDAL 3.7.0 this was further improved to trigger parallel network requests

lnicola · 2023-08-20T07:50:31Z

I don't think multi-threaded decoding helps in this case (a tile server), since each request will read a single block if everything is set up properly. But we can't have everything just yet :⁠-⁠).

metasim · 2023-08-21T17:33:56Z

@REASY

Not sure if this could be considered canonical or even acceptable (YMMV), but we have a production tile server written in Axum + georust/gdal and have been caching without problems using this (GdalPath in an internal type which basically combines a GDAL vsi path + band specifiers):

use crate::raster::GdalPath;
use crate::Error;
use gdal::Dataset;
use moka::sync::Cache;
use once_cell::sync::Lazy;
use std::ops::Deref;
use std::sync::{Arc, Mutex};
use std::time::Duration;

pub(crate) struct DatasetCache(Cache<GdalPath, Arc<Mutex<Dataset>>>);

static INSTANCE: Lazy<DatasetCache> = Lazy::new(DatasetCache::new);

impl DatasetCache {
    fn new() -> Self {
        Self(
            Cache::builder()
                .time_to_idle(Duration::from_secs(3600))
                .max_capacity(5)
                .build(),
        )
    }
    pub(crate) fn dataset_for(path: &GdalPath) -> crate::Result<Arc<Mutex<Dataset>>> {
        let ds = INSTANCE.0.try_get_with(path.clone(), || {
            let ds: Result<Dataset> = path.open();
            ds.map(|d| Arc::new(Mutex::new(d)))
                .map_err(|e| e.to_string())
        });
        ds.map_err(|e| Error::Unexpected(e.deref().clone()))
    }
}

ChristianBeilschmidt · 2024-01-30T07:50:56Z

Isn't the problem that Datasets are not Send? You can add Mutexes around it, so that it is Sync , but you cannot enforce the Send.

There are shared datasets in GDAL, but we haven't implemented them since they cannot simply be used with all the stuff currently implemented for a dataset.

We have done the thread + channel thing that @lnicola mentioned 😆 .

EDIT: Was wrong, they are Send but subtypes like bands aren't. So for datasets, you are ready to go.

lnicola · 2024-01-30T08:10:30Z

Yeah, IIRC shared datasets are actually the opposite of the "open the file multiple times" trick. Instead, you (probably) get a mutex around each access, but end up with better cache utilization.

In the beginning of request I open Dataset, in the end it is implicitly closed because of drop.

You can stick them in an Arc<Mutex<HashMap>> or something, of course. They don't have to disappear at the end of the scope.

rouault · 2024-01-30T11:18:08Z

Yeah, IIRC shared datasets are actually the opposite of the "open the file multiple times" trick. Instead, you (probably) get a mutex around each access, but end up with better cache utilization.

no, you don't. You just get the same dataset (if calling GDALOpenShared() from the same thread from which the initial one was opened. Otherwise you'll get a different instance)

lnicola · 2024-01-30T11:22:40Z

Oh, right. Well that's an argument for Dataset not being Send, because otherwise you can open a shared one twice and pass it to a different thread, which is bad.

ChristianBeilschmidt · 2024-02-04T08:55:33Z

You can't call GDALOpenShared with this library at the moment. This is why we can say that Dataset: Send.

There would need to be a second type of Dataset , e.g., SharedDataset, which would call GDALOpenShared under the hood but then not being Send.

lnicola · 2024-02-04T08:59:19Z

You're right, there's even a note in the docs:

Note that the GDAL_OF_SHARED option is removed from the set of allowed option because it subverts the Send implementation that allow passing the dataset the another thread. See #154.

lnicola mentioned this issue Aug 2, 2023

Added Send trait to vector structs #419

Open

2 tasks

lnicola mentioned this issue Feb 22, 2024

Cannot share dataset safely between threads #522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How one can cache Dataset #425

How one can cache Dataset #425

REASY commented Jul 21, 2023 •

edited

Loading

lnicola commented Jul 21, 2023

rouault commented Aug 19, 2023

lnicola commented Aug 20, 2023

metasim commented Aug 21, 2023 •

edited

Loading

ChristianBeilschmidt commented Jan 30, 2024 •

edited

Loading

lnicola commented Jan 30, 2024 •

edited

Loading

rouault commented Jan 30, 2024

lnicola commented Jan 30, 2024

ChristianBeilschmidt commented Feb 4, 2024 •

edited

Loading

lnicola commented Feb 4, 2024

How one can cache Dataset #425

How one can cache Dataset #425

Comments

REASY commented Jul 21, 2023 • edited Loading

lnicola commented Jul 21, 2023

rouault commented Aug 19, 2023

lnicola commented Aug 20, 2023

metasim commented Aug 21, 2023 • edited Loading

ChristianBeilschmidt commented Jan 30, 2024 • edited Loading

lnicola commented Jan 30, 2024 • edited Loading

rouault commented Jan 30, 2024

lnicola commented Jan 30, 2024

ChristianBeilschmidt commented Feb 4, 2024 • edited Loading

lnicola commented Feb 4, 2024

REASY commented Jul 21, 2023 •

edited

Loading

metasim commented Aug 21, 2023 •

edited

Loading

ChristianBeilschmidt commented Jan 30, 2024 •

edited

Loading

lnicola commented Jan 30, 2024 •

edited

Loading

ChristianBeilschmidt commented Feb 4, 2024 •

edited

Loading