Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Multiprocessing interpolation and reprojection #661

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

vschaffn
Copy link
Contributor

@vschaffn vschaffn commented Mar 6, 2025

Resolves #648.

Description

This PR introduces raster interpolation and reprojection using multiprocessing to optimize memory usage and improve performance.

Code changes

  • Multiprocessing interpolation : The function multiproc_interp_points leverages multiprocessing to create a separate process for each point to be interpolated. For each point, the function calculates the smallest possible raster window to open based on the interpolation method and point coordinates, and use the raster.crop method to open the window. The raster.interpolate method is then applied to the cropped raster window, allowing for efficient interpolation without loading the full raster into memory.
  • Multiprocessing reprojection: The function multiproc_reproject utilizes raster tiling (see Implement Tiling for Multiprocessing #652) to perform multiprocessing-based reprojection. A separate process is created for each tile of the raster. Each process opens the corresponding tile using the raster.icrop method, calculates the bounds of the reprojected tile, snaps the bounds to the target reprojected grid, and reprojects the tile using the raster.reproject method. The reprojected tiles are written separately to disk, with safeguards in place to prevent data overwriting.

Tests

  • Unit tests have been implemented to verify the correctness of both multiproc_interp_points and multiproc_reproject. The test results are compared against the behavior of the raster.interpolate and raster.reproject methods to ensure consistency.

Note:

Currently, there are some tile alignment issues when performing reprojection operations that involve more than just translation. Further investigation is required to address these challenges.

Difference between tiled reprojection and classic reprojection (exploradores_aster_dem):

  • Changed CRS:
    image

@rhugonnet
Copy link
Member

rhugonnet commented Mar 10, 2025

Hi @vschaffn @adebardo, nice to see the progress!

Summary: Upon reading the code in delayed_multiproc, I'm fairly sure that in many cases the reproject won't work, and that interp might work but won't be computationally optimal.

Why?
For reproject, it's due to shape deformation during CRS reprojection, you can't work simply on extents. You need the equivalent of the GeoGrid classes, a vectorized format of the grid shape. Because of the nature of the problem, writing these steps as embedded functions would probably take three times the code size including many loops and be hard to grasp, which is why I wrote it this way (and they also did in other packages, like OpenDataCube-Geo).
For interp, it's because loading many tiny windows will be extremely slow. It's better to load a whole block that fits in memory, and perform interpolation on all the points of this block at once.

Moving forward:
Joining my previous comment (#655 (comment)), I suggest to simply re-use all the internal steps of dask_reproject and dask_interp functions that were already developed and extensively tested, and only replace the bits about loading/tiling that work differently without Dask and using Multiprocessing instead.
For example, in dask_reproject, nothing uses Dask until this line:

blocks = darr.to_delayed().ravel()

So all the projected grid matching calculations to map the input chunks to output chunks written above that line (using the GeoGrid classes I mentioned) can simply stay the same 😉

@rhugonnet
Copy link
Member

Also note that Rasterio has reprojection errors that can appear (especially when comparing a tiled reprojection to a reprojection on the full raster), due to internal inconsistencies...
But those look stripped, like this: rasterio/rasterio#2052 (comment)

@rhugonnet
Copy link
Member

To clarify one more remark: In multiprocessing_reproject(), you'd have to slightly change the logic you currently have within your tile loop (line 404).

Due to shape deformations, the mapping of input-output tiles between any CRS requires more than an overlap value, and also depends on the input/output chunksizes (what the GeoGrid code mentioned above does).
So you'll have to loop for each output tile (always 1 by 1), and every time open a combination of input tiles (could be 1, 2, 3, 4, 5, ...).

You can essentially turn this list comprehension here into your loop for multiprocessing_reproject:

# Run the delayed reprojection, looping for each destination block

And let reproject_block stick the pieces of input tiles in the right places (you might not have a perfect square with all the tiles you opened), and give you the output:
def _delayed_reproject_per_block(

@rhugonnet
Copy link
Member

I should have thought about this last week (I was mostly off work and only popped for the comment above, so I didn't think about the big picture, sorry!):
As I just mentioned in GlacioHack/xdem#699, the current code for reproject developed by @vschaffn will still be extremely useful, as it is essentially the multiprocessing equivalent of dask.map_overlap that we need absolutely everywhere (all terrain attributes such as the slope, all kernel-based filters, re-gridding in Coreg.apply(), and much more 😊).

It is only reproject that is a tricky exception! 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[POC] first try of multiprocessing for reprojection 2/2
2 participants