Skip to content

Distributed mass balance simulations workflow #140

@JordiBolibar

Description

@JordiBolibar

Right now there is not centralized way to perform distributed MB simulations using MBM.

It is possible to make the distributed simulations using the following workflow:

# Create geodata object
geoData = mbm.geodata.GeoData(df_grid_monthly)

# Computes and saves gridded MB for a year and glacier
path_glacier_dem = os.path.join(cfg.dataPath, path_xr_grids,
                       f"{glacier_name}_{year}.zarr")
geoData.gridded_MB_pred(df_grid_monthly,
                                    loaded_model,
                                    glacier_name,
                                    year,
                                    all_columns,
                                    path_glacier_dem,
                                    path_save_glw,
                                    save_monthly_pred=True,
                                    type_model='NN')

However, the part to generate the df_grid_monthly data, i.e. the distributed grids based on all glacier pixels to feed to the NN, is not easily generalizeable, and right now it's limited to a single glacier and year. Therefore, we should incorporate this as either a new class within MBM, or as an extra functionality of an existing class.

The current approach based on the Switzerland notebook involved a double for loop through all the glaciers and years to apply the following function:

dataset_grid_yearly = mbm.data_processing.Dataset(
                        cfg=cfg,
                        data=df_grid_y,
                        region_name='CH',
                        region_id=11,
                        data_path=cfg.dataPath+path_PMB_GLAMOS_csv)

                    # Convert to monthly time resolution
                    dataset_grid_yearly.convert_to_monthly(
                        meta_data_columns=cfg.metaData,
                        vois_climate=vois_climate + ['pcsr'],
                        vois_topographical=voi_topographical,
                    )

                    # Ensure 'pcsr' column exists before saving
                    if 'pcsr' not in dataset_grid_yearly.data.columns:
                        raise ValueError(
                            f"'pcsr' column not found in dataset for glacier '{glacier_name}' in year {year}"
                        )

                    # Save the dataset for the specific year
                    save_path = os.path.join(
                        folder_path, f"{glacier_name}_grid_{year}.parquet")
                    print(f'Saving gridded dataset to: {save_path}')
                    dataset_grid_yearly.data.to_parquet(save_path,
                                                        engine="pyarrow",
                                                        compression="snappy")

We should create a wrapper function which can automatically make this simulation for multiple years and glaciers, avoiding the double for loop in Python (which is super slow) and implementing some sort of parallelization.

After this, we should also update the documentation accordingly and add a quick example and tutorial of how to run distributed simulations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    MBM coredocumentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions