Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected creation of subsetVariables attribute when loading a dataset #180

Open
honzaflash opened this issue Jul 29, 2024 · 3 comments · May be fixed by #256
Open

Unexpected creation of subsetVariables attribute when loading a dataset #180

honzaflash opened this issue Jul 29, 2024 · 3 comments · May be fixed by #256
Assignees

Comments

@honzaflash
Copy link

honzaflash commented Jul 29, 2024

Describe the bug
When using EDDTableFromNcCFFiles ERDDAP unexpectedly shows subsetVariables attribute in dataset's metadata even though the attribute is not added in the datasets.xml configuration nor is it present in the source netCDF file.
Furthermore, it seems to use source variable names - this means that if you rename the variables by using a different <destinationName> the dataset won't load.
Also when using the GenerateDatasetsXml tool the subsetVariables attribute also shows up in the commented "sourceAttributes" section (despite not being an attribute in the source file).

To Reproduce
I will add a separate comment with example files, xml, and instructions.

Expected behavior
subsetVariables attribute continues to be generated but using variable destination names.
subsetVariables will be printed under <addAttributes> and not in the source attributes section when using GenerateDatasetsXml scripts.
Also this behavior should be documented. I did not find any mention of the attribute being generated besides for SOS datasets.

Additional context
I have traced the problem for dataset xml generation.
I believe the "sourceAttributes" in the output xml come from here:

sb.append(writeAttsForDatasetsXml(false, dataSourceTable.globalAttributes(), " "));

Table class is used for the dataSourceTable:
dataSourceTable.readNcCF(sampleFileName, null, tStandardizeWhat, null, null, null);

readNcCF sets the global attribute to a value computed from other attributes:
globalAttributes.set("subsetVariables", subsetVars.toString()); // may be "", that's okay

EDITED: corrected sourceVariables -> subsetVariables

@honzaflash
Copy link
Author

Steps to reproduce

  1. Run the ERDDAP container with necessary volumes pointing to the stuff above
  2. Inspect the dataset in ERDDAP - it has the global subsetVariables attribute
  3. Inspect the source file - the attribute is not present there.

Also

  1. With ERDDAP container running, connect to a shell in it using docker exec -it erddap-container bash
  2. Navigate to the necessary dir and run the GenerateDatasetsXml.sh script
  3. Choose EDDTableFromNcCFFiles, input the path to the linked nc file, and generate a xml template
  4. See that the output template has susbsetVariables in the commented "sourceAttributes" section

@ChrisJohnNOAA
Copy link
Contributor

@honzaflash Thanks for the detailed report and the example files!

I believe most of the places you said sourceVariables was intended to be subsetVariables. Let me know if that's wrong.

During GenerateDatasetsXml, ERDDAP automatically generates subsetVariables for all table dataset types (unless a dataset type has a special exception that I don't know of) if there are no source subsetVariables. This should be covered in the documentation.

Currently for EDDTableFromNcCFFiles, there's a complication (as you noted) that suggested subsetVariables are generated during the reading of the NcCF file (in Table.java). This makes the suggested subsetVariables appear to the rest of ERDDAP (and generateDatasets) to be indistinguishable from source attributes. It also means the code that generates the subsetVariables only knows the source variable names (not the destination names), which is why they use source variable names.

Based on checking code history, this has been ERDDAP's behavior for at least 12 years. That doesn't make it right, but it does make it likely folks are relying on this behavior. I'm looking into the best way to move towards the desired behavior.

I know you mentioned expected behavior is to continue generating the subsetVariables, but using destination names (instead of the current behavior to use source names). Is your desired behavior to generate the subsetVariables automatically at runtime (current behavior), during GenerateDatasetsXml (other Table dataset type behavior), or both?

@honzaflash
Copy link
Author

@ChrisJohnNOAA Thank you for the response. (Yes, it appears I confused source and subset for some reason, I am glad you figured it out. I edited the post for anyone else discovering this in the future.)

I suppose I listed the expected behavior as just that -- what I would expect. Not necessarily a behavior that we need.

I was working on a python script that generates the dataset.xml for hundreds of datasets like the one linked above. Some of the variables needed to be renamed (sourceName != destinationName) and that made ERDDAP quite upset when it happened to one of the subset variables because its source name would no longer match any of the existing variables. I ended up working around this by generating the subsetVariables attribute myself and using the renaming on it. I don't think the attribute is super important to us anyway so it works ok for us this way.

It was just unexpected extra work and it took me a while to figure out what was even happening as I was somewhat new to the netCDF format and all the metadata conventions.

So to actually answer your question:

  • it was the runtime behavior that caused the actual issues but we worked around it
  • and the GenerateDatasetsXml behavior confused me when trying to investigate (by including the subsetVariables in the comment with source attributes) but we don't rely on its output
  • But, I understand that changing a behavior like this could be a breaking change for many people. Really, I just wanted to put this on your radar and I think that documenting the behavior better would be a sufficient way to resolve this. At least for now.

@ChrisJohnNOAA ChrisJohnNOAA self-assigned this Dec 9, 2024
@ChrisJohnNOAA ChrisJohnNOAA linked a pull request Feb 11, 2025 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants