-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for GISAID data #63
Comments
For folks who are interested in this approach to using GISAID data with the modern mpox repo layout, you should run the commands above from inside the Note that the workflow is currently broken for GISAID data until #273 is resolved. |
Locally resolving #273 by adding the missing column to the metadata did not fix the workflow because there are still several hardcoded columns in other rules or scripts of the workflow that the metadata doesn't have. The bigger issue is that the workflow expects the data to have been passed through the |
I'd like us to avoid this need. I wrote "I don't think we expect (m)any users to write ingest pipelines, I see them as a framework for how the nextstrain team separates concerns for production builds." Spiking data into a nextstrain workflow is something we should support, with certain constraints. I think a good entry point to (any) workflow is to ask users to provide a merged metadata / sequences file (leveraging Mpox may be a great repo to push on this vision - there's lots of non-NCBI data and the workflow is relatively complex. |
For users who want to use GISAID data with this workflow, the following steps work nearly as expected.
These steps assume you have downloaded:
Note, the biggest issue with the implementation above is that there is no transform command to convert GISAID's location field to the standard Nextstrain geographic columns (region, country, division, and location). This means the default Augur filter logic that groups by country and year prints a warning message that it cannot find a "country" column and only groups. In Augur 16.0.0, this missing group-by column will produce an error message, so we should consider implementing the transform for GISAID locations.
Given the commands above, however, I get the following tree from the workflow:
The very long branches also indicate that users will need to manage their own list of strains to exclude, since strain names will not match GenBank accessions.
The text was updated successfully, but these errors were encountered: