Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GISAID data #63

Open
huddlej opened this issue Jun 16, 2022 · 3 comments
Open

Support for GISAID data #63

huddlej opened this issue Jun 16, 2022 · 3 comments
Labels
documentation Improvements or additions to documentation

Comments

@huddlej
Copy link
Contributor

huddlej commented Jun 16, 2022

For users who want to use GISAID data with this workflow, the following steps work nearly as expected.

These steps assume you have downloaded:

  • all sequences in FASTA format with whitespace replaced by underscore
  • patient metadata
# Change into phylogenetic workflow directory.
cd phylogenetic/

# Create a data directory to download files into.
mkdir -p data/

# Download sequences: data/gisaid_pox_2022_06_16_19.fasta
# Download patient metadata: data/gisaid_pox_2022_06_16_19.tsv
# Note: patient metadata lacks submitting/originating lab.

# Parse out metadata from sequence deflines.
augur parse \
  --sequences data/gisaid_pox_2022_06_16_19.fasta \
  --fields strain gisaid_epi_isl date \
  --output-sequences data/sequences.fasta \
  --output-metadata data/sequence_metadata.tsv

# Join sequence metadata with patient metadata.
csvtk --tabs join -f 1 \
  data/sequence_metadata.tsv \
  data/gisaid_pox_2022_06_16_19.tsv > data/metadata.tsv

# TODO: Need a transform for GISAID locations like the one we have for GenBank.

# Run workflow.
nextstrain build \
  --docker \
  --cpus 1 \
  . \
  --configfile defaults/mpxv/config.yaml \
  --config strain_id_field=strain display_strain_field=strain

Note, the biggest issue with the implementation above is that there is no transform command to convert GISAID's location field to the standard Nextstrain geographic columns (region, country, division, and location). This means the default Augur filter logic that groups by country and year prints a warning message that it cannot find a "country" column and only groups. In Augur 16.0.0, this missing group-by column will produce an error message, so we should consider implementing the transform for GISAID locations.

Given the commands above, however, I get the following tree from the workflow:

image

The very long branches also indicate that users will need to manage their own list of strains to exclude, since strain names will not match GenBank accessions.

@huddlej
Copy link
Contributor Author

huddlej commented Aug 20, 2024

For folks who are interested in this approach to using GISAID data with the modern mpox repo layout, you should run the commands above from inside the phylogenetic directory of this repository. I have updated the nextstrain build command in the example above to reflect updates in the Nextstrain ecosystem.

Note that the workflow is currently broken for GISAID data until #273 is resolved.

@huddlej
Copy link
Contributor Author

huddlej commented Aug 22, 2024

Locally resolving #273 by adding the missing column to the metadata did not fix the workflow because there are still several hardcoded columns in other rules or scripts of the workflow that the metadata doesn't have. The bigger issue is that the workflow expects the data to have been passed through the ingest workflow which hints that maybe the better solution to the problem is to pass GISAID data through ingest first.

@jameshadfield
Copy link
Member

The bigger issue is that the workflow expects the data to have been passed through the ingest workflow which hints that maybe the better solution to the problem is to pass GISAID data through ingest first.

I'd like us to avoid this need. I wrote "I don't think we expect (m)any users to write ingest pipelines, I see them as a framework for how the nextstrain team separates concerns for production builds." Spiking data into a nextstrain workflow is something we should support, with certain constraints. I think a good entry point to (any) workflow is to ask users to provide a merged metadata / sequences file (leveraging augur merge, augur curate, whatever they are comfortable with) and then assert at the start of the workflow any requirements of that data (e.g. column X is needed). I think Cornelius' comment is a sensible guideline when building phylo workflows to avoid needing so many specific columns: "check for presence of that column and make the filter dependent on whether it's present or not."

Mpox may be a great repo to push on this vision - there's lots of non-NCBI data and the workflow is relatively complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
No open projects
Development

No branches or pull requests

2 participants