-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Inconsistent capitalization in metadata files and schema #112
Comments
I think it's a good idea to fix these back to their original values and also to establish a convention where everything is always lower case, no spaces, using dashes, not underscores. |
@donutbrew @leebrian The capitalization requirement has been present for those two fields since v1.2.0. This was also included in the release notes at the time. So, I think you might still be on version v1.1.0 as that capitalization has been a requirement since v1.2.0. v1.2.0 was when I started making release notes and this is where I standardized the convention for naming things with the reason being that capitalization, underscores, etc. does matter for NCBI. NCBI also does not clarify where they check for capitalization and where they do not. So for some fields it would error out depending on capitalization or not while other fields they would automatically correct the capitalization for you. This would also change as errors for when submitting FLU was different from COV. So I standardized everything 1:1 based on what documents they make available for when SeqSender is running the metadata validation tools added in v1.2.0. in order to reduce the chances of that happening. So I use a prefix with a dash to designate for SeqSender to understand what field each database goes for while everything after that is the expected capitalization and punctuation since almost all the fields by NCBI and GISAID use underscores. "SeqSender Database Identifier"-"Database Column" Where I can I've tried to keep everything lowercase and underscore but GenBank has weird modifier capitalizations for its source and comment file and they can throw weird errors. Its also not consistent across GUI/FTP/organism for when its checking and erroring out on these files. So, the capitalization requires for those two files are from how they're documented here. With all that being said, you can still use the lowercase versions with SeqSender, when you run the submit command the |
Thanks for the details. So this is really a case of NCBI being inconsistent with their capitalization and we're just copying them? So we're consistent in our copying field names and cappitalization from upstream? (or do you consider NCBI downstream?) |
At the same time, this is just the SeqSender metadata file we're talking about.... We've already got one header now called "gs-covv_sex" which is not a GISAID field name. It gets translated back to I do agree that it is frustrating that NCBI has these odd capitalizations. This would be good to put on a future agenda with them. |
@leebrian @donutbrew Yeah, NCBI isn't really consistent (specifically GenBank). In v1.1.0 a lot of the metadata validation pieces were hand-made based on how we submitted them and what we were used to working. A good reference example of experiences we regularly dealt with around this was the "Sequencing Technology" field for SRA. Capitalization, characters, and spacing matters, so for example "Illumina Hiseq" would not be recognized because it was not spelled "Illumina HiSeq". Since v1.2.0 we've been keeping consistent with copying from their references. That update also is what updated the BioSample section of SeqSender to actually query the BioSample metadata requirements page from the XML version and updates it based off that criteria to get an exact copy of what NCBI uses. I'm trying to be consistent with what the databases use upstream in order to try and not have a back and forth with NCBI/GISAID. The NCBI reference page for the source file that holds the fields for "Host" and "Isolate" has numerous mistakes. Here's just two examples:
I agree that changing the GISAID fields for "gender" to "sex" is misleading based on how I'm trying to match the upstream to the downstream and that was only done in that specific case to meet the requirements of the EO. Otherwise, I'm trying to avoid having SeqSender be the one to have to figure out the capitalization nuance of NCBI because that can then quickly become pointing the blame at SeqSender for not having the correct mapping for every organism when submitting to GenBank. |
Describe the bug
In v1.3.2 the fields
src-isolate
andsrc-host
have been changed to capitalizedsrc-Isolate
andsrc-Host
which break preparation scripts upstream. It is very difficult to deal with changing schemas.Version
Additional context
See:
https://github.com/search?q=repo%3ACDCgov%2Fseqsender%20src-host&type=code
https://github.com/search?q=repo%3ACDCgov%2Fseqsender%20src-isolate&type=code
The text was updated successfully, but these errors were encountered: