-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Right now there is not centralized way to perform distributed MB simulations using MBM.
It is possible to make the distributed simulations using the following workflow:
# Create geodata object
geoData = mbm.geodata.GeoData(df_grid_monthly)
# Computes and saves gridded MB for a year and glacier
path_glacier_dem = os.path.join(cfg.dataPath, path_xr_grids,
f"{glacier_name}_{year}.zarr")
geoData.gridded_MB_pred(df_grid_monthly,
loaded_model,
glacier_name,
year,
all_columns,
path_glacier_dem,
path_save_glw,
save_monthly_pred=True,
type_model='NN')However, the part to generate the df_grid_monthly data, i.e. the distributed grids based on all glacier pixels to feed to the NN, is not easily generalizeable, and right now it's limited to a single glacier and year. Therefore, we should incorporate this as either a new class within MBM, or as an extra functionality of an existing class.
The current approach based on the Switzerland notebook involved a double for loop through all the glaciers and years to apply the following function:
dataset_grid_yearly = mbm.data_processing.Dataset(
cfg=cfg,
data=df_grid_y,
region_name='CH',
region_id=11,
data_path=cfg.dataPath+path_PMB_GLAMOS_csv)
# Convert to monthly time resolution
dataset_grid_yearly.convert_to_monthly(
meta_data_columns=cfg.metaData,
vois_climate=vois_climate + ['pcsr'],
vois_topographical=voi_topographical,
)
# Ensure 'pcsr' column exists before saving
if 'pcsr' not in dataset_grid_yearly.data.columns:
raise ValueError(
f"'pcsr' column not found in dataset for glacier '{glacier_name}' in year {year}"
)
# Save the dataset for the specific year
save_path = os.path.join(
folder_path, f"{glacier_name}_grid_{year}.parquet")
print(f'Saving gridded dataset to: {save_path}')
dataset_grid_yearly.data.to_parquet(save_path,
engine="pyarrow",
compression="snappy")We should create a wrapper function which can automatically make this simulation for multiple years and glaciers, avoiding the double for loop in Python (which is super slow) and implementing some sort of parallelization.
After this, we should also update the documentation accordingly and add a quick example and tutorial of how to run distributed simulations.