-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/issue-61: Resolve Dimensions with MetaData #68
Conversation
podaac/subsetter/subset.py
Outdated
@@ -1125,7 +1123,7 @@ def subset(file_to_subset, bbox, output_file, variables=None, # pylint: disable | |||
) for lat_var_name in lat_var_names | |||
] | |||
chunks_dict = calculate_chunks(dataset) | |||
|
|||
print (lat_var_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this print statement and any others?
tests/test_subset.py
Outdated
@@ -1688,6 +1687,98 @@ def test_get_time_OMI(self): | |||
assert "Time" in time_var_names[0] | |||
assert "Latitude" in lat_var_names[0] | |||
|
|||
def test_sndr_dims(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this test contains a lot of duplicate code from the subsetter itself -- can you call a function from the subsetter and make assertions on the result rather than doing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to minimize calling the entire subset.subset method and just calling the subset_with_bbox method. Will just add assertions to the dimensions input vs output
podaac/subsetter/subset.py
Outdated
while not var_dims: | ||
var_group_parent = var_group_parent.parent | ||
var_dims = list(var_group_parent.dimensions.keys()) | ||
var_dims = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't var_dims
already be an empty list? It seems like this line does nothing. Also I'm a little concerned about this breaking functionality with our existing datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I removed the var_dims reassign. Do you have an idea of what existing datasets might break because all the unittests pass?
podaac/subsetter/subset.py
Outdated
var for var in dataset.data_vars.keys() | ||
if var in variables and var not in group_vars and not var.startswith(tuple(lat_var_prefix)) | ||
]) | ||
group_vars.extend([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you pull this out of the if variables
block? Wouldn't this do nothing because the if var in variables
would always be False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-added the if variables statement
if original_type != new_type: | ||
new_dataset[variable_name] = xr.apply_ufunc(cast_type, new_dataset[variable_name], | ||
str(original_type), dask='allowed', | ||
keep_attrs=True) | ||
|
||
if partial_dim_in_in_vars and (indexers.keys() - dataset[variable_name].dims) and set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple questions:
- Why is this only applicable to cases where there is no _FillValue?
- If there is a partial overlap with subset dims, will we lose that subset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It is not, to my knowledge. The logic in line 201 needs to be applied to cases with and without _FillValue. A few SNDR variables were not getting picked up in this logic, thus the excess dimensions
- If I understand correctly you mean the case when a variable has only one of the dimensions that are being subset? /sat_vel in SNDR has dimensions of (atrack,spatial) and after a bbox subset the atrack dimension is lower as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not, to my knowledge. The logic in line 201 needs to be applied to cases with and without _FillValue. A few SNDR variables were not getting picked up in this logic, thus the excess dimensions
Are you planning on making that change before we merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not, to my knowledge. The logic in line 201 needs to be applied to cases with and without _FillValue. A few SNDR variables were not getting picked up in this logic, thus the excess dimensions
Are you planning on making that change before we merge?
The line that you commented on is the change. Was just explaining what the change was doing and what was lacking before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the following 3 lines where this same logic is applied in the missing FillValue case?
l2ss-py/podaac/subsetter/xarray_enhancements.py
Lines 225 to 228 in 94f1630
new_dataset[variable_name] = indexed_var | |
new_dataset[variable_name].attrs = indexed_var.attrs | |
variable.attrs = indexed_var.attrs |
My concern is that the block above is only supposed to be applied in cases where this is no FillValue. When there is FillValue, data outside the provided bounds are replaced with FillValue. If there is no FillValue, what we use the original data, meaning data outside the bounds will still be in the subsetted result, which results in an incomplete subset. It seems like you're applying this logic to cases where there is a FillValue, meaning we will get an incomplete subset even when we don't have to.
after a bbox subset the atrack dimension is lower as expected
This doesn't necessarily mean the data is fully subset. Our subset operation has two steps:
- Cut dimensions as much as possible, meaning dimension size itself will shrink. This is the "indexers" logic you see here
- Within the remaining data, mask values that are not within the bounds. This means data within the now reduced dimensions will be replaced with "nan" if they are outside of the bounds.
- If the variable has FillValue, replace nan with FillValue
- If the variable has no FillValue, replace nan with original values
Sorry for the essay, please let me know if this makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks for this. The .where function adds the xtrack and atrack dimension to all variables, and variables that do not have these dimensions and have no intersection need to have the extra dimensions removed. My understanding is that there is no logic for this case and it really should just be exactly the same as the original variable before subsetting.
SNDR /air_pres is that
if partial_dim_in_in_vars and (indexers.keys() - dataset[variable_name].dims) and set( indexers.keys()).intersection(dataset[variable_name].dims)
returns false and:
if partial_dim_in_in_vars and (indexers.keys() - dataset[variable_name].dims) and set( indexers.keys()).intersection(new_dataset[variable_name].dims)
returns true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nlenssen2013 Thanks for the explanation. I think my concerns have been alleviated and I will approve once you fix the conflict 🙂
Kudos, SonarCloud Quality Gate passed! |
@frankinspace SonarCloud is reporting code smell for this PR: https://sonarcloud.io/project/issues?resolved=false&severities=CRITICAL&sinceLeakPeriod=true&types=CODE_SMELL&pullRequest=68&id=podaac_l2ss-py&open=AX-3X9FKvbEZitqyh0cy It seems the issue is cognitive complexity in the |
I was thinking the same thing @skorper. We need to simplify that function but probably not in this PR |
Github Issue: #61
Description
SNDR and TROPOMI have variables without dimensions that need to be transfered
Overview of work done
Resolve issues that xarray_enhancements wasn't completely fixing dimensions
Overview of verification done
Summarize the testing and verification you've done. This includes unit tests or testing with specific data
Overview of integration done
Explain how this change was integration tested. Provide screenshots or logs if appropriate. An example of this would be a local Harmony deployment.
PR checklist:
See Pull Request Review Checklist for pointers on reviewing this pull request