Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent capitalization in metadata files and schema #112

Open
donutbrew opened this issue Feb 26, 2025 · 5 comments
Open

[BUG] Inconsistent capitalization in metadata files and schema #112

donutbrew opened this issue Feb 26, 2025 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@donutbrew
Copy link
Collaborator

Describe the bug

In v1.3.2 the fields src-isolate and src-host have been changed to capitalized src-Isolate and src-Host which break preparation scripts upstream. It is very difficult to deal with changing schemas.

Version

  • Version Number: [e.g. v1.2.0.]
  • SeqSender Version: Docker
  • OS: Linux

Additional context

See:

https://github.com/search?q=repo%3ACDCgov%2Fseqsender%20src-host&type=code
https://github.com/search?q=repo%3ACDCgov%2Fseqsender%20src-isolate&type=code

@donutbrew donutbrew added the bug Something isn't working label Feb 26, 2025
@leebrian
Copy link
Member

I think it's a good idea to fix these back to their original values and also to establish a convention where everything is always lower case, no spaces, using dashes, not underscores.

@dthoward96
Copy link
Collaborator

dthoward96 commented Feb 26, 2025

@donutbrew @leebrian The capitalization requirement has been present for those two fields since v1.2.0. This was also included in the release notes at the time. So, I think you might still be on version v1.1.0 as that capitalization has been a requirement since v1.2.0.

v1.2.0 was when I started making release notes and this is where I standardized the convention for naming things with the reason being that capitalization, underscores, etc. does matter for NCBI. NCBI also does not clarify where they check for capitalization and where they do not. So for some fields it would error out depending on capitalization or not while other fields they would automatically correct the capitalization for you. This would also change as errors for when submitting FLU was different from COV.

So I standardized everything 1:1 based on what documents they make available for when SeqSender is running the metadata validation tools added in v1.2.0. in order to reduce the chances of that happening. So I use a prefix with a dash to designate for SeqSender to understand what field each database goes for while everything after that is the expected capitalization and punctuation since almost all the fields by NCBI and GISAID use underscores.

"SeqSender Database Identifier"-"Database Column"

Where I can I've tried to keep everything lowercase and underscore but GenBank has weird modifier capitalizations for its source and comment file and they can throw weird errors. Its also not consistent across GUI/FTP/organism for when its checking and erroring out on these files. So, the capitalization requires for those two files are from how they're documented here.

With all that being said, you can still use the lowercase versions with SeqSender, when you run the submit command the --skip_validation flag can be added to skip SeqSender's metadata validation which was added in v1.2.0 to try and reduce the amount of submission errors that can happen.

@leebrian
Copy link
Member

Thanks for the details. So this is really a case of NCBI being inconsistent with their capitalization and we're just copying them? So we're consistent in our copying field names and cappitalization from upstream? (or do you consider NCBI downstream?)

@donutbrew
Copy link
Collaborator Author

At the same time, this is just the SeqSender metadata file we're talking about.... We've already got one header now called "gs-covv_sex" which is not a GISAID field name. It gets translated back to covv_gender for GISAID. I would argue that the user not having to know about these capitalizations or oddities of the different repositories is a value add of SeqSender. I guess my request is to drop the requirements on the user side and fix them for submission. We can also discuss if this isn't easily doable.

I do agree that it is frustrating that NCBI has these odd capitalizations. This would be good to put on a future agenda with them.

@dthoward96
Copy link
Collaborator

@leebrian @donutbrew Yeah, NCBI isn't really consistent (specifically GenBank). In v1.1.0 a lot of the metadata validation pieces were hand-made based on how we submitted them and what we were used to working. A good reference example of experiences we regularly dealt with around this was the "Sequencing Technology" field for SRA. Capitalization, characters, and spacing matters, so for example "Illumina Hiseq" would not be recognized because it was not spelled "Illumina HiSeq".

Since v1.2.0 we've been keeping consistent with copying from their references. That update also is what updated the BioSample section of SeqSender to actually query the BioSample metadata requirements page from the XML version and updates it based off that criteria to get an exact copy of what NCBI uses. I'm trying to be consistent with what the databases use upstream in order to try and not have a back and forth with NCBI/GISAID. The NCBI reference page for the source file that holds the fields for "Host" and "Isolate" has numerous mistakes. Here's just two examples:

  • "Isolation_source" appears as "Isolation source" on the webpage. In the test COV file they provide, it uses "isolation-source" while in the test FLU file they provide, it uses "Isolation_source". "Isolation_source" works for all, the other options are hit or miss if they work and not only that but some only work for some organisms.
  • "Country (geo_loc_name)" does not work. Its "geo_loc_name" only, required to be lowercase.

I agree that changing the GISAID fields for "gender" to "sex" is misleading based on how I'm trying to match the upstream to the downstream and that was only done in that specific case to meet the requirements of the EO. Otherwise, I'm trying to avoid having SeqSender be the one to have to figure out the capitalization nuance of NCBI because that can then quickly become pointing the blame at SeqSender for not having the correct mapping for every organism when submitting to GenBank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants