Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions to compute, plot, store the local hazard exceedence intensity and RP maps (new branch) #898

Merged
merged 17 commits into from
Jul 15, 2024

Conversation

ValentinGebhart
Copy link
Collaborator

@ValentinGebhart ValentinGebhart commented Jun 20, 2024

Changes proposed in this PR (new version of old PR #857):

  • addition of new function to compute local return period maps local_return_period and _loc_return_period for given hazard intensities, giving out a geodataframe
  • addition of function subplots_from_gdf in util.plot that plots different subplots from the columns of a gdf

Earlier parts removed (commented out in code) until agreement:

  • addition of new functions to store the hazard maps that are generated in the Hazard.plot_rp_intensity() function as raster files write_raster_local_exceedance_inten. (do we also need as netCDF files write_netcdf_local_exceedance_inten?)
  • addition of new function to store these local return period maps for given hazard intensities as raster files write_raster_local_return_periods or as NetCDF files write_netcdf_local_return_periods.

This PR fixes #854
(new version of old PR #857)

PR Author Checklist

PR Reviewer Checklist

Copy link
Member

@chahank chahank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated clean version of this PR! I think this will be a quite useful set of methods.

I have 2 general points that itch me at the moment.

  • I am not sure it makes sense that the plotting and saving methods always call and recalculate the values. Maybe we should rather store the values as a new attribute. Do you know how much computation this costs for large datasets? @emanuel-schmid @peanutfun : what do you think, would it be ok if the method local_return_period adds an attribute to the hazard object?
  • I find the method local_return_period and the method _loc_return_period quite hard to read at the moment. I cannot say exactly why, but probably a mix of not always clear variable names, hidden changing of variables (like the return_periods), and quite some loops.

I thus suggest to a) wait for @emanuel-schmid and @peanutfun opinion about the attribute b) rewrite the code to avoid updating the return periods in the back.

climada/hazard/base.py Outdated Show resolved Hide resolved
climada/hazard/base.py Outdated Show resolved Hide resolved
climada/hazard/base.py Outdated Show resolved Hide resolved
climada/hazard/base.py Outdated Show resolved Hide resolved
climada/hazard/base.py Outdated Show resolved Hide resolved
climada/hazard/io.py Outdated Show resolved Hide resolved
climada/hazard/test/test_base.py Outdated Show resolved Hide resolved
climada/hazard/test/test_io.py Outdated Show resolved Hide resolved
climada/hazard/io.py Outdated Show resolved Hide resolved
climada/hazard/io.py Outdated Show resolved Hide resolved
@peanutfun
Copy link
Member

I am not sure it makes sense that the plotting and saving methods always call and recalculate the values. Maybe we should rather store the values as a new attribute. Do you know how much computation this costs for large datasets? @emanuel-schmid @peanutfun : what do you think, would it be ok if the method local_return_period adds an attribute to the hazard object?

I agree that the values should not be calculated in the plotting method. Instead, the user should insert the data to plot into the appropriate function.

I very much dislike adding new attributes after the class initialization. I would return the data to the user, who should take care of "bookkeeping" themselves. They can then plug the values into a plotting function, store them separately, etc.

@chahank
Copy link
Member

chahank commented Jun 21, 2024

I agree that the values should not be calculated in the plotting method. Instead, the user should insert the data to plot into the appropriate function.

I very much dislike adding new attributes after the class initialization. I would return the data to the user, who should take care of "bookkeeping" themselves. They can then plug the values into a plotting function, store them separately, etc.

I also agree with you that adding new attributes after the class initialization is a bad habit. Here the point is that plotting the intensity return period on the map requires the centroids, and thus having it in the hazard object is convenient... not sure what the best solution is.

@ValentinGebhart
Copy link
Collaborator Author

I very much dislike adding new attributes after the class initialization. I would return the data to the user, who should take care of "bookkeeping" themselves. They can then plug the values into a plotting function, store them separately, etc.

I also agree with you that adding new attributes after the class initialization is a bad habit. Here the point is that plotting the intensity return period on the map requires the centroids, and thus having it in the hazard object is convenient... not sure what the best solution is.

One way of returning data to the user without losing the centroids could be to return a geodataframe?

@peanutfun
Copy link
Member

peanutfun commented Jun 21, 2024

@ValentinGebhart Great idea! Returning a GeoDataFrame with the return period information and the centroids geometry should contain all data necessary for the plot.

Users can then use the geopandas functions for plotting and writing. Maybe this solves all issues? 😅

@emanuel-schmid
Copy link
Collaborator

@ValentinGebhart I agree the idea is not bad - but returning a GeoDataFrame is still somewhat fuzzy. A cleaner solution would be to return a tuple (intensity return period, centroids geometry) or a class with two attributes, intensity_return_period and geometry.
This seems cleaner. Ideally any requirement should be explicit in a method's signature and not hidden away in an object that could contain anything at all, like a dict or a dataframe.

@chahank
Copy link
Member

chahank commented Jun 24, 2024

@emanuel-schmid I strongly disagree with this take as is. The proposed outputs (tuples or custom class) maybe would be cleaner from a programming point of view, but it would make the output mostly useless from a user perspective. I do not see the problem with giving back a geodataframe. Can you maybe explain why this would be a bad idea?

Besides, we already have a precedent with the method impact_at_reg that returns a DataFrame :D

@emanuel-schmid
Copy link
Collaborator

Yeah. Agreed. Although I must stress that the principle holds - in this particular case a dataframe seems alrightish. 😁

@chahank
Copy link
Member

chahank commented Jun 25, 2024

Yeah. Agreed. Although I must stress that the principle holds - in this particular case a dataframe seems alrightish. 😁

Indeed. I think there is a general question on how climada objects should return extra information. We had the questions about Centroids and things like pixel sizes, Impacts and regional values, or here Hazards and return period.

@ValentinGebhart
Copy link
Collaborator Author

ValentinGebhart commented Jun 28, 2024

In the new commit, local_return_period outputs a GeoDataFrame and gdf_meta (named tuple with plotting metadata information) and I add a function subplots_from_gdf to util.plot that plots subplots from the different columns of a gdf (using the gdf_meta information). I thought it would be nice to have this plot function as general as possible (so it could plot return periods and exceedance frequencies and more), but if this is too messy, we can also write a more specific "plot_return_period" function.

After discussion with Lukas, I create the meta data of the gdf as a named tuple, such that it is clear which attributes gdf_meta must have to correctly plot.

I removed the write functions (and their tests) because we might want to generalize them to general gdf. I made a backup if we change our minds.

@ValentinGebhart
Copy link
Collaborator Author

After the changes (see above comment) a new review is requested.

Copy link
Member

@chahank chahank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this simplified version! It makes it much clearer.

There are a few main points to discuss:
a) is it computing what it should? The test looks incorrect to me, but I probably have mixed up something. Can you please enlighten me? Did it myself.
b) I am not sure whether the GdfMeta is a good idea.
c) I am not sure whether the chunking is working as intented (and whether it is needed at all.)

Comment on lines 460 to 461
threshold_intensities : np.array
Hazard intensities to consider.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name threshold_intensities rather confusing, and the docstring does not really help. Can you be a bit more verbatim please?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about "test_intensities"? I agree, I'll update the docstrings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I think the word test would be quite confusing actually. Maybe just intensities? Or even avoid the plural and use intensity? Or keep threshold_intensities but with a clearer docstring.

climada/hazard/base.py Outdated Show resolved Hide resolved
Comment on lines 471 to 472
LOGGER.warning("The Hazard's frequency unit is %s and not %s which "
"most likely leads to incorrect results",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely leads to incorrect results does not tell me much. What is meant by that? Why would having different units of frequencies lead to incorrect results? What one gets is simply return periods in the unit of the frequency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I could just use the inverse of the frequency attribute. I now added using a if elif statement that can read years, months, and weeks. I don't know how to make this more generic and easier because the frequency attribute is just a string and not a "physical" frequency unit that can be converted eg to years.

#create gdf meta data
gdf_meta = GdfMeta(
name = 'Return Period',
unit = 'Years',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the units of the frequency (or rather its inverse)?

climada/hazard/base.py Outdated Show resolved Hide resolved
climada/hazard/test/test_io.py Outdated Show resolved Hide resolved
climada/hazard/test/test_io.py Outdated Show resolved Hide resolved
climada/test/test_plot.py Outdated Show resolved Hide resolved
climada/util/test/test_plot.py Outdated Show resolved Hide resolved
Comment on lines 880 to 881
# Create GdfMeta class (as a named tuple) for GeoDataFrame meta data
GdfMeta = namedtuple('GdfMeta', ['name', 'unit', 'col_name', 'col_unit'])
Copy link
Member

@chahank chahank Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I am not sure whether this is a good idea. Maybe it is, maybe not.

@emanuel-schmid What do you think? Should we add some nametuple in the utility plot for a specific case of geopandas plotting?

Copy link
Member

@chahank chahank Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add more details: this is called very generically GdfMeta, but refers to only one very specific data set of geodataframes.

col_name is already in the dataframe, col_unit and unit are rather confusing, and name is a bit unclear to me too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested this to pass information to the plot that is not explicitly contained in the GeoDataFrame. GdfMeta can be renamed, of course. But I think a namedtuple makes more sense than a dict, because you can be sure which keys/attributes are defined. col_name specifies the name of the column to plot. But I agree that the names can be adjusted and made clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of passing some meta data for the plotting. I am not sure that this is the perfect way this way :D At first I thought that GdfMeta some new property of GeoDataFrames. Also, is the current proposal general enough for a generic geopandas plotting method in CLIMADA? Or should it then be made more specific.

And, namedtuple makes sense indeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ValentinGebhart, could we maybe find another way to put this information into the GeoDataFrame? @chahank and I had the following ideas when discussing this in person:

  • The col_name to plot can be a separate parameter, as users generally have to state which functions to plot in GeoDataFrame.plot() and similar functions
  • The col_unit can be added to the column name in parentheses or brackets. The plot function then reads this information is brackets/parentheses are found, otherwise it does not display a unit.

Copy link
Collaborator Author

@ValentinGebhart ValentinGebhart Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the naming is not optimal. The included information should be: what quantity (+ units) is represented by the values in the gdf, what quantity (+ units) is represented by the values in the column labels. The rows always correspond to centroids. Maybe instead of name we could use quantity or value_type?

Otherwise, the plotting function should be "generic", in the sense that it produces one subplot per column of a generic gdf, using the values to color the different points, and using the gdf_meta varibale for figure and axis label. Of course, it is still using the geo_im_from_array util function, and I'm sure there are other ways to plot a gdf with several columns.
Let me know what you think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peanutfun sorry I didn't see your suggestion in time. We can definitely add the labels and units as parameter(s) to the plot function. Then the user has to know about what their gdf represents without us helping by providing the information, which I guess is ok but is a bit more work to the user. If we put the as defaults to the plot function, the plot function is not generic anymore.

Copy link
Collaborator

@emanuel-schmid emanuel-schmid Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically along the same lines as @peanutfun
I'd suggest to have two parameters instead of gdf_meta: GdfMeta,
colbar_name: str and title_subplots: lambda.

To get the same result as implemented now when calling

subplots_from_gdf(
    gdf=gdf, 
    gdf_meta=GdfMeta(name='Return Period', unit='Years', col_name='Threshold Intensity', col_unit='m/s')
)

you would

subplots_from_gdf(
    gdf=gdf, 
    colbar_name="Return Period (Years)", 
    title_subplots=lambda column: f"Threshold Intensity: {column} m/s",
)

Seems better readable and more flexible

Copy link
Collaborator

@emanuel-schmid emanuel-schmid Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consequently, I'd let the local_return_period return a triple

Returns
-------
gdf : GeoDataFrame
label : str
    GeoDataFrame label, for reporting and plotting
column_label : lambda
    Column label, for reporting and plotting

in case the user is not 100% sure what they did and is happy to get this kind of conformation. (Why not?)

@chahank
Copy link
Member

chahank commented Jul 11, 2024

In the new commit, local_return_period outputs a GeoDataFrame and gdf_meta (named tuple with plotting metadata information) and I add a function subplots_from_gdf to util.plot that plots subplots from the different columns of a gdf (using the gdf_meta information). I thought it would be nice to have this plot function as general as possible (so it could plot return periods and exceedance frequencies and more), but if this is too messy, we can also write a more specific "plot_return_period" function.

I like the generality! I only fear that the GdfMeta might make it not generic.

After discussion with Lukas, I create the meta data of the gdf as a named tuple, such that it is clear which attributes gdf_meta must have to correctly plot.

I removed the write functions (and their tests) because we might want to generalize them to general gdf. I made a backup if we change our minds.

Sounds good!

Copy link
Member

@chahank chahank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Please finalize the cosmetics: the linter complains about trailing white spaces to be removed and some indention that is incorrect. There is one file that is modified by mistake I think.

Also, don't forget to add your name to the list of authors: https://github.com/CLIMADA-project/climada_python/blob/feature/write_haz_rp_maps2/AUTHORS.md

Comment on lines 35 to 37
from climada.test import get_test_file

HAZ_TEST_TC :Path = get_test_file('test_tc_florida')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not part of this PR.

@ValentinGebhart
Copy link
Collaborator Author

Good work! Please finalize the cosmetics: the linter complains about trailing white spaces to be removed and some indention that is incorrect. There is one file that is modified by mistake I think.

Also, don't forget to add your name to the list of authors: https://github.com/CLIMADA-project/climada_python/blob/feature/write_haz_rp_maps2/AUTHORS.md

Thanks for the review(s)! I now understood what the trailing white space means :) For me this looks ready to merge, let me know if there is something else to do.

@chahank
Copy link
Member

chahank commented Jul 15, 2024

This is good, feel free to merge!

@ValentinGebhart ValentinGebhart merged commit d67c804 into develop Jul 15, 2024
12 of 18 checks passed
@ValentinGebhart ValentinGebhart deleted the feature/write_haz_rp_maps2 branch July 15, 2024 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants