Skip to content

Commit

Permalink
Merge pull request #58 from annefou/chunking
Browse files Browse the repository at this point in the history
Info for chunking to fix #52
  • Loading branch information
clausmichele authored Feb 11, 2025
2 parents 042c6a2 + 7d317f9 commit 66352a5
Show file tree
Hide file tree
Showing 14 changed files with 17,088 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
## Learning Objectives
- Understand what cloud native data formats are
- Understand how the cloud does computing more efficiently
- Understand chunking
- Understand how chunking impact performance


## Outline
Expand All @@ -13,6 +15,7 @@
- Examples
- Performance in the cloud
- Tiling
- Chunking
- Scaling
- Distributed Computing
- The Microsoft Planetary Computer Setup - State of the Art open source cloud native technology stack
Expand All @@ -30,7 +33,26 @@ Cloud native formats or cloud-optimized formats, are file formats specifically d
### Characteristics of cloud native data formats
Cloud-optimized means mainly optimized "read" access with partial reads and also parallel reads. Main characteristics common for cloud-optimized formats:

- **Data Chunking:** Cloud native formats employ a chunk-based organization, where the data is divided into smaller chunks or blocks. This enables parallel processing and efficient retrieval of specific portions of the data, reducing the need to access the entire dataset.
- **Data Chunking**: When working with large data files or collections, it’s often impossible to load all the data into a single computer’s memory at once. In such cases, a data chunking approach can be highly effective. By dividing the dataset into smaller chunks, the data can be processed piece by piece without exceeding the computer's memory capacity. This approach is particularly useful for managing large datasets on a single machine and can also scale to distributed computing environments, such as cloud platforms or high-performance computing systems.

**Cloud native** formats employ a chunk-based organization, where the data is divided into smaller chunks or blocks. This enables parallel processing and efficient retrieval of specific portions of the data, reducing the need to access the entire dataset.

A **chunk** is the smallest atomic unit of a larger dataset that can be processed independently, enabling efficient data handling by dividing the dataset into manageable pieces without requiring the entire dataset to be loaded into memory.

The figure below visually explains the concept of chunking: on the left, a three-dimensional dataset (x, y, and time) is shown without chunks, while on the right, the same dataset is displayed with chunks highlighted.

| Dataset without chunking | Dataset with chunking |
| ---------------------------------------------------------------- | ------------------------------------------------------- |
| ![No Chunking](assets/notchunked.png "Dataset without chunking") | ![Chunking](assets/chunked.png "Dataset with chunking") |


There are different ways to chunk data, depending on the nature of the dataset and the analysis requirements. Spatial chunking divides data based on geographical or spatial dimensions (e.g., longitude, latitude), which is ideal for geospatial datasets where the data is naturally distributed across space. Time-based chunking focuses on temporal dimensions (e.g., by day, month, or year), which is suitable for time-series data. Another approach is box chunking, where data is divided into fixed-size blocks (e.g., cubes or boxes), providing a balance between spatial and time-based chunking. The choice of chunking strategy can significantly impact the efficiency of data access—spatial chunking is optimal for spatial queries, while time-based chunking improves access to time-series data. Using the right chunking strategy can reduce the computational overhead and improve the overall performance of data processing tasks.

The table below illustrates the two most current chunking strategies:

| Spatial chunking strategy | Box chunking strategy |
| ------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| ![Spatial Chunking](assets/spatialchunking.png "Dataset with spatial chunking") | ![Box Chunking](assets/boxchunking.png "Dataset with box chunking") |

- **Internal Indexing:** These formats incorporate internal indexing structures that facilitate fast spatial and attribute queries. This enables efficient data access and retrieval operations without the need for extensive scanning or processing of the entire dataset.

Expand Down Expand Up @@ -80,6 +102,14 @@ Both horizontal and vertical scaling have their advantages and considerations. H

In common workflows, a combination of both approaches is used to ensure optimal speed and resource utilization while being able to keep the simplicity of a workflow.

## How to scale

There are many approaches how to handle scaling properly.
We will use two Pangeo excerside to understand __Vertical scaling__ and __Horizontal scaling__ using chunking and Dask.

[Exercise 2.4 chunking](./exercises/24_chunking.ipynb)

[Exercise 2.4 dask](./exercises/24_dask.ipynb)

### Subscription vs. On-Demand usage

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { "HYBAS_ID": 2090516090, "NEXT_DOWN": 2090516950, "NEXT_SINK": 2090012980, "MAIN_BAS": 2090012980, "DIST_SINK": 334.5, "DIST_MAIN": 334.5, "SUB_AREA": 419.1, "UP_AREA": 419.2, "PFAF_ID": 214040804, "ENDO": 0, "COAST": 0, "ORDER": 3, "SORT": 10988 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 11.075, 46.729166666666693 ], [ 11.072575547960094, 46.728813340928845 ], [ 11.069091118706622, 46.725353325737871 ], [ 11.048285590277802, 46.724646674262175 ], [ 11.042024739583358, 46.730876329210091 ], [ 11.041666666666691, 46.733333333333356 ], [ 11.038123575846377, 46.73424241807728 ], [ 11.0375, 46.745833333333358 ], [ 11.03664923773874, 46.749149237738742 ], [ 11.030017428927975, 46.750850762261308 ], [ 11.028599378797768, 46.756377156575546 ], [ 11.025567287868947, 46.76028951009117 ], [ 11.0244327121311, 46.764710489908879 ], [ 11.021400621202281, 46.768622843424502 ], [ 11.020833333333357, 46.770833333333357 ], [ 11.024291653103322, 46.771720716688392 ], [ 11.025567287868947, 46.781377156575545 ], [ 11.032766045464435, 46.789456176757838 ], [ 11.033900621202282, 46.793877156575547 ], [ 11.036932712131101, 46.797789510091171 ], [ 11.038067287868948, 46.80221048990888 ], [ 11.042586263020858, 46.807629055447073 ], [ 11.052210489908878, 46.808900621202284 ], [ 11.06028951009117, 46.816099378797766 ], [ 11.06471048990888, 46.817233954535617 ], [ 11.068622843424503, 46.820266045464436 ], [ 11.073043823242212, 46.82140062120228 ], [ 11.077974107530407, 46.825221761067731 ], [ 11.079733954535614, 46.830103895399333 ], [ 11.071295166015648, 46.839574517144122 ], [ 11.074432712131101, 46.843622843424505 ], [ 11.075567287868948, 46.846579996744815 ], [ 11.071400621202281, 46.851956176757838 ], [ 11.070266045464434, 46.859321424696205 ], [ 11.082766045464435, 46.872789510091174 ], [ 11.084041680230058, 46.882445949978326 ], [ 11.089710489908878, 46.88390062120228 ], [ 11.099959648980059, 46.893196953667562 ], [ 11.096400621202282, 46.897789510091172 ], [ 11.095266045464435, 46.913437228732661 ], [ 11.103599378797767, 46.922789510091171 ], [ 11.104166666666691, 46.929166666666696 ], [ 11.114924452039954, 46.929519992404543 ], [ 11.11875, 46.933318752712701 ], [ 11.123286946614609, 46.928813340928848 ], [ 11.139924452039956, 46.929519992404543 ], [ 11.143408881293428, 46.932980007595511 ], [ 11.148288302951414, 46.933691067165825 ], [ 11.157980007595512, 46.943408881293429 ], [ 11.158691406250025, 46.948290337456626 ], [ 11.164242214626761, 46.953813340928846 ], [ 11.166666666666693, 46.954166666666694 ], [ 11.167024739583358, 46.951709662543429 ], [ 11.172575547960095, 46.946186659071209 ], [ 11.177424452039956, 46.945480007595513 ], [ 11.180908881293428, 46.942019992404539 ], [ 11.185757785373289, 46.941313340928843 ], [ 11.189242214626761, 46.937853325737876 ], [ 11.194123670789956, 46.93714192708336 ], [ 11.199646674262178, 46.931591118706621 ], [ 11.200353325737872, 46.910075547960098 ], [ 11.204519992404538, 46.905879720052113 ], [ 11.203813340928845, 46.901742214626765 ], [ 11.199646674262178, 46.897546386718773 ], [ 11.200353325737872, 46.893408881293432 ], [ 11.203813340928845, 46.889924452039956 ], [ 11.204519992404538, 46.885075547960099 ], [ 11.212486097547769, 46.877084011501765 ], [ 11.208686659071207, 46.873257785373291 ], [ 11.208333333333359, 46.858333333333363 ], [ 11.210543823242213, 46.857766045464437 ], [ 11.214456176757839, 46.854733954535618 ], [ 11.231377156575547, 46.853599378797767 ], [ 11.239456176757837, 46.846400621202285 ], [ 11.260543823242212, 46.845266045464435 ], [ 11.264583333333359, 46.842135281033009 ], [ 11.26875, 46.845364718967041 ], [ 11.272789510091172, 46.842233954535615 ], [ 11.323043823242214, 46.841099378797772 ], [ 11.331122843424506, 46.833900621202282 ], [ 11.343877156575548, 46.832766045464439 ], [ 11.347789510091172, 46.829733954535612 ], [ 11.353315904405409, 46.828315904405407 ], [ 11.354733954535616, 46.82278951009117 ], [ 11.357766045464437, 46.818877156575546 ], [ 11.358900621202284, 46.814456176757837 ], [ 11.36609937879777, 46.806377156575543 ], [ 11.366666666666694, 46.804166666666688 ], [ 11.365779283311658, 46.800708346896727 ], [ 11.356122843424506, 46.7994327121311 ], [ 11.35208333333336, 46.796301947699675 ], [ 11.347916666666693, 46.799531385633706 ], [ 11.343877156575548, 46.796400621202281 ], [ 11.339456176757839, 46.795266045464437 ], [ 11.335543823242213, 46.792233954535618 ], [ 11.321806165907145, 46.791011895073808 ], [ 11.31306728786895, 46.781377156575545 ], [ 11.311932712131103, 46.776956176757835 ], [ 11.304733954535617, 46.768877156575549 ], [ 11.30359937879777, 46.76445617675784 ], [ 11.300567287868949, 46.760543823242216 ], [ 11.299149237738742, 46.755017428927978 ], [ 11.293622843424505, 46.753599378797766 ], [ 11.289710489908881, 46.750567287868947 ], [ 11.279166666666693, 46.75 ], [ 11.278813340928846, 46.739242214626763 ], [ 11.274646674262179, 46.735046386718778 ], [ 11.275353325737873, 46.730908881293431 ], [ 11.283319430881102, 46.722917344835096 ], [ 11.27951999240454, 46.719091118706622 ], [ 11.278813340928846, 46.705908881293425 ], [ 11.275353325737873, 46.702424452039956 ], [ 11.275, 46.7 ], [ 11.274032253689262, 46.696229383680581 ], [ 11.267413330078151, 46.691099378797766 ], [ 11.260289510091171, 46.692233954535617 ], [ 11.25491333007815, 46.69640062120228 ], [ 11.251956176757838, 46.695266045464436 ], [ 11.248043823242213, 46.692233954535617 ], [ 11.239456176757837, 46.691099378797766 ], [ 11.235543823242214, 46.688067287868947 ], [ 11.218622843424505, 46.686932712131103 ], [ 11.21471048990888, 46.683900621202284 ], [ 11.197789510091171, 46.682766045464433 ], [ 11.193877156575546, 46.679733954535614 ], [ 11.184220716688394, 46.678458319769987 ], [ 11.182413736979193, 46.671416219075546 ], [ 11.173043823242214, 46.663067287868948 ], [ 11.168622843424505, 46.661932712131097 ], [ 11.159270562065997, 46.653599378797765 ], [ 11.154985215928845, 46.655143907335095 ], [ 11.154166666666692, 46.65833333333336 ], [ 11.15085076226131, 46.65918409559464 ], [ 11.15, 46.6625 ], [ 11.150353325737871, 46.669091118706618 ], [ 11.155876329210095, 46.674641927083357 ], [ 11.160757785373288, 46.675353325737873 ], [ 11.166668023003497, 46.68123406304256 ], [ 11.162853325737872, 46.685075547960096 ], [ 11.162146674262178, 46.689924452039953 ], [ 11.15451999240454, 46.697575547960092 ], [ 11.153813340928846, 46.702424452039956 ], [ 11.148290337456622, 46.707975260416688 ], [ 11.143408881293428, 46.708686659071205 ], [ 11.139924452039956, 46.712146674262179 ], [ 11.122575547960095, 46.712853325737875 ], [ 11.119091118706622, 46.716313340928842 ], [ 11.11424221462676, 46.717019992404538 ], [ 11.110757785373288, 46.720480007595512 ], [ 11.101742214626761, 46.721186659071208 ], [ 11.098257785373288, 46.724646674262175 ], [ 11.085075547960095, 46.725353325737871 ], [ 11.081591118706623, 46.728813340928845 ], [ 11.075, 46.729166666666693 ] ] ] } }
]
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 66352a5

Please sign in to comment.