Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ngnix Timeout Issue while building the load_datasets() Dictionary . #103

Open
sjanga1736 opened this issue Nov 25, 2024 · 5 comments
Open

Comments

@sjanga1736
Copy link

sjanga1736 commented Nov 25, 2024

Hi Andrew,

I have 2000+ files that are included in the configuration sourced from S3 where file sizes are varying from 100 KB to 500 GB. on initial loading of dataset configurations (config.yaml) and creating the appropriate dataset objects using this datasets = {d["name"]: Dataset.from_config(**d) for d in config["datasets"]} i am running into 504 tIme out Error (though nginix was configured with timeout of 120s). How to resolve this.

Is their any better way to create datasets on startup when there are huge number of files?
or
do i need to create explicitly datasets based on dataset name?

@sjanga1736 sjanga1736 changed the title Ngnix Timeout Issue. Ngnix Timeout Issue while building the load_datasets() Dictionary . Nov 25, 2024
@ajnisbet
Copy link
Owner

ajnisbet commented Jan 3, 2025

How are your S3 files being added: are you mounting your S3 bucket to the local filesystem? Could you share your config.yaml file?

If all your S3 files are in the same projection, it would be fastest to prebuild a single VRT file referencing all the S3 files. Then opentopodata will see your dataset as a single file and not have to reach out to S3 when loading the datasets. There's an example of building a VRT from S3 files here: https://www.opentopodata.org/notes/cloud-storage/

@sjanga1736
Copy link
Author

sjanga1736 commented Jan 28, 2025

This is the sample config file where the dataset projections are different based on the country and i am mounting S3 bucket to ec2 instance (locally)

{ "access_control_allow_origin": "*", "datasets": [ { "name": "32750", "path": "cloud_data/32750/" }, { "name": "32752", "path": "cloud_data/32752/" }, { "name": "32753", "path": "cloud_data/32753/" }, { "name": "32754", "path": "cloud_data/32754/" }, { "name": "32755", "path": "cloud_data/32755/" }, { "name": "32756", "path": "cloud_data/32756/" }, { "name": "auckland", "path": "cloud_data/auckland_contours/auckland-1m-dem-2013-vrt/" }, { "name": "christchurch", "path": "cloud_data/christchurch_contours/christchurch_1m_dem_2018_vrt/" }, { "name": "hawkesbay", "path": "cloud_data/hawkesbay/" }, { "name": "CA_NoCAL_Wildfires_B4_2018", "path": "cloud_data/california_contours/california/CA_NoCAL_Wildfires_B4_2018/CA_NoCAL_Wildfires_B4_2018_vrt/" }, { "name": "CA_SanBernardinoCo_AreaA_2013", "path": "cloud_data/california_contours/california/CA_SanBernardinoCo_AreaA_2013/CA_SanBernardinoCo_AreaA_2013_vrt/" }, { "name": "CA_SanBernardinoCo_AreaB_2013", "path": "cloud_data/california_contours/california/CA_SanBernardinoCo_AreaB_2013/CA_SanBernardinoCo_AreaB_2013_vrt/" }, { "name": "CA_SanDiegoQL2_2014", "path": "cloud_data/california_contours/california/CA_SanDiegoQL2_2014/CA_SanDiegoQL2_2014_vrt/" }, { "name": "AZ_CORiverBasin_L1_2014", "path": "cloud_data/california_contours/california/AZ_CORiverBasin_L1_2014/AZ_CORiverBasin_L1_2014_vrt/" }, { "name": "AZ_ColoradoRiverLot2_2014", "path": "cloud_data/california_contours/california/AZ_ColoradoRiverLot2_2014/AZ_ColoradoRiverLot2_2014_vrt/" }, { "name": "CA_Santa_Clara_DEM_2020_9330", "path": "cloud_data/california_contours/california/CA_Santa_Clara_DEM_2020_9330/" }, { "name": "CA_Eastern_SanDiegoCo_2016", "path": "cloud_data/california_contours/california/CA_Eastern_SanDiegoCo_2016/CA_Eastern_SanDiegoCo_2016_vrt/" }, { "name": "San_Bernadino_County_Flood_Control_Lidar", "path": "cloud_data/california_contours/california/San_Bernadino_County_Flood_Control_Lidar/San_Bernadino_County_Flood_Control_Lidar_vrt/" }, { "name": "CA_YosemiteNP_2019_D19", "path": "cloud_data/california_contours/california/CA_YosemiteNP_2019_D19/CA_YosemiteNP_2019_D19_vrt/" }, { "name": "CA_CarrHirzDeltaFires_2019_B19", "path": "cloud_data/california_contours/california/CA_CarrHirzDeltaFires_2019_B19/CA_CarrHirzDeltaFires_2019_B19_vrt/" }, { "name": "OR_RogueSiskiyouNF_2019_B19", "path": "cloud_data/california_contours/california/OR_RogueSiskiyouNF_2019_B19/OR_RogueSiskiyouNF_2019_B19_vrt/" }, { "name": "CA_AZ_FEMA_R9_Lidar_2017_D18", "path": "cloud_data/california_contours/california/CA_AZ_FEMA_R9_Lidar_2017_D18/CA_AZ_FEMA_R9_Lidar_2017_D18_vrt/" }, { "name": "CA_SantaClaraCounty_2020_A20", "path": "cloud_data/california_contours/california/CA_SantaClaraCounty_2020_A20/CA_SantaClaraCounty_2020_A20_vrt/" }, { "name": "radiant_st_4018", "path": "cloud_data/radiant_st_4018/" }, { "name": "beach_haven_estate_2430", "path": "cloud_data/beach_haven_estate_2430/" }, { "name": "maple_lane_rise_3352", "path": "cloud_data/maple_lane_rise_3352/" }, { "name": "montana_estate_3764", "path": "cloud_data/montana_estate_3764/" }, { "name": "elan_4500", "path": "cloud_data/elan_4500/" }, { "name": "kingsgrove_7109", "path": "cloud_data/kingsgrove_7109/" }, { "name": "donaldson_close_5255", "path": "cloud_data/donaldson_close_5255/" }, { "name": "mount_terry_estate_2527", "path": "cloud_data/mount_terry_estate_2527/" }, { "name": "the_village_grove_2560", "path": "cloud_data/the_village_grove_2560/" } ], "max_locations_per_request": 1000 }

@ajnisbet
Copy link
Owner

Gotcha. Are all 32 of these datasets a VRT?

Loading 32 VRTs via mounted S3 will take a while, though I'd expect it to take a bit less than 120s. If your mounting tool supports caching you could make those options more aggressive.

Otherwise, you could make a single GTI of these 32 datasets. Unlike VRTs, GTIs can handle projection differences: https://gdal.org/en/stable/drivers/raster/gti.html


Unfortunately I don't have plans to add caching to opentopodata. I'm open to it in theory, but it would need a design that can rescan updated datasets.

Perhaps a config option no_changes_since: 2025-01-07T23:54:33. OTD caches dataset info, but if the cache is older than no_changes_since it is rebuilt.

It would also somewhere to persist this information: perhaps mounting a second volume.

I'll think about this design some more!

@sjanga1736
Copy link
Author

sjanga1736 commented Jan 29, 2025

  • The provided configuration file is just a sample, but I have around 2000+ files (both static and dynamic where static being big files in the order of 1G to 200 G and dynamic is of small file size in the order of 10 MB for different countries like AUS, NZL, Canada & USA).
  • Is there a way to optimize this or GTI is the only way?
  • Additionally, after creating the GTI, where should I place or specify the tile index?

@ajnisbet
Copy link
Owner

Yah scanning 2000 sequentially from on a cloud mount is gonna take a while.

In theory opentopodata could scan those files outside of an http request context, build a spatial index, and store that somewhere that persists between reloads. But that's what a GTI is!

You could store the tile index in S3 next to your datasets cloud_data/index/index.gti.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants