Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/issue-61: Resolve Dimensions with MetaData #68

Merged
merged 37 commits into from
Apr 13, 2022
Merged

Conversation

nlenssen2013
Copy link
Contributor

Github Issue: #61

Description

SNDR and TROPOMI have variables without dimensions that need to be transfered

Overview of work done

Resolve issues that xarray_enhancements wasn't completely fixing dimensions

Overview of verification done

Summarize the testing and verification you've done. This includes unit tests or testing with specific data

Overview of integration done

Explain how this change was integration tested. Provide screenshots or logs if appropriate. An example of this would be a local Harmony deployment.

PR checklist:

  • Linted
  • Updated unit tests
  • Updated changelog
  • Integration testing

See Pull Request Review Checklist for pointers on reviewing this pull request

@skorper skorper changed the title Feature/issue 61 - Resolve Dimensions with MetaData feature/issue-61: Resolve Dimensions with MetaData Mar 22, 2022
@@ -1125,7 +1123,7 @@ def subset(file_to_subset, bbox, output_file, variables=None, # pylint: disable
) for lat_var_name in lat_var_names
]
chunks_dict = calculate_chunks(dataset)

print (lat_var_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this print statement and any others?

@@ -1688,6 +1687,98 @@ def test_get_time_OMI(self):
assert "Time" in time_var_names[0]
assert "Latitude" in lat_var_names[0]

def test_sndr_dims(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this test contains a lot of duplicate code from the subsetter itself -- can you call a function from the subsetter and make assertions on the result rather than doing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to minimize calling the entire subset.subset method and just calling the subset_with_bbox method. Will just add assertions to the dimensions input vs output

while not var_dims:
var_group_parent = var_group_parent.parent
var_dims = list(var_group_parent.dimensions.keys())
var_dims = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't var_dims already be an empty list? It seems like this line does nothing. Also I'm a little concerned about this breaking functionality with our existing datasets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I removed the var_dims reassign. Do you have an idea of what existing datasets might break because all the unittests pass?

var for var in dataset.data_vars.keys()
if var in variables and var not in group_vars and not var.startswith(tuple(lat_var_prefix))
])
group_vars.extend([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you pull this out of the if variables block? Wouldn't this do nothing because the if var in variables would always be False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-added the if variables statement

@nlenssen2013 nlenssen2013 requested a review from skorper March 23, 2022 16:37
if original_type != new_type:
new_dataset[variable_name] = xr.apply_ufunc(cast_type, new_dataset[variable_name],
str(original_type), dask='allowed',
keep_attrs=True)

if partial_dim_in_in_vars and (indexers.keys() - dataset[variable_name].dims) and set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions:

  • Why is this only applicable to cases where there is no _FillValue?
  • If there is a partial overlap with subset dims, will we lose that subset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It is not, to my knowledge. The logic in line 201 needs to be applied to cases with and without _FillValue. A few SNDR variables were not getting picked up in this logic, thus the excess dimensions
  2. If I understand correctly you mean the case when a variable has only one of the dimensions that are being subset? /sat_vel in SNDR has dimensions of (atrack,spatial) and after a bbox subset the atrack dimension is lower as expected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not, to my knowledge. The logic in line 201 needs to be applied to cases with and without _FillValue. A few SNDR variables were not getting picked up in this logic, thus the excess dimensions

Are you planning on making that change before we merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not, to my knowledge. The logic in line 201 needs to be applied to cases with and without _FillValue. A few SNDR variables were not getting picked up in this logic, thus the excess dimensions

Are you planning on making that change before we merge?

The line that you commented on is the change. Was just explaining what the change was doing and what was lacking before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the following 3 lines where this same logic is applied in the missing FillValue case?

new_dataset[variable_name] = indexed_var
new_dataset[variable_name].attrs = indexed_var.attrs
variable.attrs = indexed_var.attrs

My concern is that the block above is only supposed to be applied in cases where this is no FillValue. When there is FillValue, data outside the provided bounds are replaced with FillValue. If there is no FillValue, what we use the original data, meaning data outside the bounds will still be in the subsetted result, which results in an incomplete subset. It seems like you're applying this logic to cases where there is a FillValue, meaning we will get an incomplete subset even when we don't have to.

after a bbox subset the atrack dimension is lower as expected

This doesn't necessarily mean the data is fully subset. Our subset operation has two steps:

  1. Cut dimensions as much as possible, meaning dimension size itself will shrink. This is the "indexers" logic you see here
  2. Within the remaining data, mask values that are not within the bounds. This means data within the now reduced dimensions will be replaced with "nan" if they are outside of the bounds.
    • If the variable has FillValue, replace nan with FillValue
    • If the variable has no FillValue, replace nan with original values

Sorry for the essay, please let me know if this makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks for this. The .where function adds the xtrack and atrack dimension to all variables, and variables that do not have these dimensions and have no intersection need to have the extra dimensions removed. My understanding is that there is no logic for this case and it really should just be exactly the same as the original variable before subsetting.
SNDR /air_pres is that
if partial_dim_in_in_vars and (indexers.keys() - dataset[variable_name].dims) and set( indexers.keys()).intersection(dataset[variable_name].dims)

returns false and:
if partial_dim_in_in_vars and (indexers.keys() - dataset[variable_name].dims) and set( indexers.keys()).intersection(new_dataset[variable_name].dims)
returns true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nlenssen2013 Thanks for the explanation. I think my concerns have been alleviated and I will approve once you fix the conflict 🙂

@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 13, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

92.9% 92.9% Coverage
0.0% 0.0% Duplication

@skorper
Copy link
Contributor

skorper commented Apr 13, 2022

@frankinspace SonarCloud is reporting code smell for this PR: https://sonarcloud.io/project/issues?resolved=false&severities=CRITICAL&sinceLeakPeriod=true&types=CODE_SMELL&pullRequest=68&id=podaac_l2ss-py&open=AX-3X9FKvbEZitqyh0cy

It seems the issue is cognitive complexity in the subset function. This PR only lightly touches this code so I'm thinking we should open a separate issue for simplifying this function and merge this PR/dismiss this SolarCloud failure. Thoughts?

@frankinspace
Copy link
Member

I was thinking the same thing @skorper. We need to simplify that function but probably not in this PR

@skorper
Copy link
Contributor

skorper commented Apr 13, 2022

#81

@skorper skorper merged commit aaad841 into develop Apr 13, 2022
@skorper skorper deleted the feature/issue-61 branch April 13, 2022 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants