Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutation analysis #348

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Mutation analysis #348

wants to merge 12 commits into from

Conversation

MoiraZuber
Copy link
Collaborator

This PR aims at introducing a mutation analysis step to covariants. The goal is, for a given mutation and country, to obtain an estimate of what percentage of the population has encountered this mutation before. This is done by combining case counts data (which estimates the number of cases per bi-weekly interval for each cluster in a given country) with mutation proportions taken from covSPECTRUM (which mutation has been seen at what percentage in a given cluster).

@vercel
Copy link

vercel bot commented Sep 30, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
covariants ✅ Ready (Inspect) Visit Preview Mar 23, 2023 at 4:52PM (UTC)

@MoiraZuber
Copy link
Collaborator Author

MoiraZuber commented Sep 30, 2022

The current script produces output by multiplying the bi-weekly case counts data with the mutation proportions queried from covSPECTRUM and adding up the results across different clusters. There are two output files created in web/data: perCountryMutationAnalysis.json which sorts the mutation proportions by country, and perMutationAnalysis.json, which sorts by mutation.

Points of discussion:

  • Which clusters to query for. Currently, the list of clusters is produced from all entries in clusters.py that have "type" == "variant" and "graphing" == True.
  • How to query from covSPECTRUM: Currently, mutations are obtained for a certain cluster by querying via snps. Is this reliable enough or would switching to Pango lineages be better? Are there other methods with which to query by cluster?
  • Mutations are currently queried with an arbitrary threshold (10%).
  • Defining mutations from clusters.py are currently not used.
    -> Which mutations to we use and how do we sort mutations by importance?
  • Which countries to use? Could we average over neighboring countries?
  • Visualize results -> Plot by country or mutation? Currently, both files are produced.
  • Possible alternative ways of plotting?
  • Discuss whether results appear to be realistic estimate

Advanced points:

  • Could we introduce an estimate for reinfections and/or untercounting of cases?
  • Could we include Jesse Bloom Lab escape calculator?

@emmahodcroft
Copy link
Collaborator

emmahodcroft commented Oct 7, 2022

Thank you for this Moira! This is a great start and I really appreciate your work!

Which clusters to query for. Currently, the list of clusters is produced from all entries in clusters.py that have "type" == "variant" and "graphing" == True.

I think this is the right set to query for ✔️

How to query from covSPECTRUM: Currently, mutations are obtained for a certain cluster by querying via snps. Is this reliable enough or would switching to Pango lineages be better? Are there other methods with which to query by cluster?

We should actually be able to query Nextstrain clades, which I think might be a nice starting point as we can avoid the nitty-gritty for the time being and see if we need to wade into it later. Should be in the format like "21L" (without the '(Omicron)') - here's a web link as an example: https://cov-spectrum.org/explore/Switzerland/AllSamples/Past6M/variants?nextstrainClade=21L&
We can pull this directly from nextstrain_name in clusters.py, which also omits the extra bit...... ok apparently I've switched format here over time... 🙁
I think we should convert nextstrain_name to just be the number-letter name and then we can use this. For example for Alpha, it's currently 20I (Alpha, V1) and we should change this to 20I. This also lines us up better for upcoming changes in Nextstrain.

Mutations are currently queried with an arbitrary threshold (10%).

This seems fine for now, I think we're most interested in the more common mutations people encountered

Defining mutations from clusters.py are currently not used.

I actually think this should be ok too.... since hopefully all defining mutations are in there and at high percents!

-> Which mutations to we use and how do we sort mutations by importance?

Let's get a list of the positions in S in the receptor binding domain (RBD) and N terminal domain (NTD) and pull just these out - they should be the most critical.

Which countries to use? Could we average over neighboring countries?

Ohhh, interesting point. Let's keep that one on the burner. For now how about starting with the list I put in the slack?

Visualize results -> Plot by country or mutation? Currently, both files are produced.

Hm, good to produce both as I'm not sure what we may want long term, but intuitively I think it would be interesting to plot by country first - what's the estimated breakdown of which mutations have been seen?

At the moment I'm not sure whether this will go on the site or not, so would you mind adding some simple plots of the files you're generating - so it outputs them as image files? No need for anything too fancy - but this will give us an idea as we go of what we're looking at!

I see you've got a bi-weekly mutation percent, I wonder if we should also try querying CoV-Spectrum by intervals (you can specify dates in the query) or we are doing good enough already - what do you think? Could try both and see if it makes a difference, perhaps?

Do you think that gives some good next steps?

@MoiraZuber
Copy link
Collaborator Author

MoiraZuber commented Oct 26, 2022

When querying with nextstrainClade (extracted with clusters[clus]["display_name"].split(" ")[0] to remove the part in parentheses) instead of snps, the following clusters yield valid responses:

Alpha (nextstrainClade 20I), Beta (nextstrainClade 20H), Gamma (nextstrainClade 20J), Delta (nextstrainClade 21A), 21I.Delta (nextstrainClade 21I), 21J.Delta (nextstrainClade 21J), 21K.Omicron (nextstrainClade 21K), 21L.Omicron (nextstrainClade 21L), 22A (nextstrainClade 22A), 22B (nextstrainClade 22B), 22C (nextstrainClade 22C), 22D (nextstrainClade 22D), Kappa (nextstrainClade 21B), Eta (nextstrainClade 21D), Iota (nextstrainClade 21F), 21GLambda (nextstrainClade 21G), 21H (nextstrainClade 21H), EU1 (nextstrainClade 20E), Epsilon (nextstrainClade 21C)

These are not found on covSPECTRUM:

20BS732 (nextstrainClade 20B/S:732A), 20AS126 (nextstrainClade 20A/S:126A), S439 (nextstrainClade 20A/S:439K), S677HRobin1 (nextstrainClade S:677H.Robin1), S677PPelican (nextstrainClade S:677P.Pelican), EU2 (nextstrainClade 20A.EU2), S98 (nextstrainClade 20A/S:98F), S80 (nextstrainClade 20C/S:80Y), S626 (nextstrainClade 20B/S:626S), S1122 (nextstrainClade 20B/S:1122L)

@emmahodcroft Do these missing ones have different nextstrainClades than what is in clusters[clus]["display_name"] or do they just not have a counterpart on covSPECTRUM?

@MoiraZuber
Copy link
Collaborator Author

Let's get a list of the positions in S in the receptor binding domain (RBD) and N terminal domain (NTD) and pull just these out - they should be the most critical.

Where could I get this list from?

@MoiraZuber
Copy link
Collaborator Author

I see you've got a bi-weekly mutation percent, I wonder if we should also try querying CoV-Spectrum by intervals (you can specify dates in the query) or we are doing good enough already - what do you think? Could try both and see if it makes a difference, perhaps?

I got this running (querying by 2-week intervals), however, this takes a lot of time. I was wondering whether we could get the data downloaded directly sorted by date instead of querying for each interval separately, but I couldn't figure out whether this is possible. Do you by chance know?

@MoiraZuber
Copy link
Collaborator Author

Here's a few example plots:

S:Y505H
S:Y144-
S:T478K
S:S982A

@emmahodcroft
Copy link
Collaborator

Sorry Moira, somehow I missed this completely. Trying to come back to it a little now!

I think one step forward that's probably worth making it simplifying the Nextstrain names. However, I can't seem to get around to doing that so I've done the next best thing - I've added nextstrain_clade which is guaranteed to just be the two-digits-and-letter (ex: 21A). This should give the same results but it's just a bit tidier and more reliable.

For the clusters that don't have Nextstrain names and so don't return hits, we can try to use Pango linages... I've added this and am seeing if that works. Probably not so critical to include these as they probably had few mutations that we're really interested in now, but if there's a simple way to include them, why not...
I think this works.

Where could I get this list from?

I think to keep it simple for now, I'm happy to go with a legit-looking-paper's definition. For RBD this article includes "Thr333–Gly526 of the SARS-CoV-2 RBD", this article says "we identified the region of SARS-CoV-2 RBD at residues 331 to 524 of S protein".

For NTD, this article says "residues 15–305 of the SARS-CoV-2 spike protein ... were aligned against the NTD sequences", whereas this one says "NTD (aa 13–318)"

So I think anything in this ranges works! I have randomly picked 333-526 and 15-305 for the moment... So currently it only stores mutations in these two locations.

I got this running (querying by 2-week intervals), however, this takes a lot of time

Hm, I've now forgotten what the original get_mutation_proportions_biweekly did... But I can see how doing 2-weekly queries takes a lot of time. I will see if I can figure out how this worked originally, if there's already a way to break things down by some kind of time element to see proportions changing over time... I suppose we know the proportion of the variant itself over time already, from CoV, right - so perhaps we can just use this? Multiplying the variant % by the mut % for that variant...
The only downside is if a variant got a particular mutation later on, and case numbers changed, then this will change...

I can also try to ask Chaoran if there's a better way, but I feel like I need to try and wrap my head around the scripts a little more before I formulate a question!


Going through the script I'm a little confused about the role country is playing in get_mutation_proportions -- it doesn't seem to go into the URL building, so are we getting different data for each country? Your plots show differences between countries so I figure something is happening, but I'm not 100% piecing it together.... but when I'm running pieces of the script, I'm getting the same proportion for each country 😅
I wonder if perhaps there's one more edit that didn't get pushed? (this might be messy now, since I'm about to push!)
I added something to include country in the build_url query which seems to work, I think...!

@emmahodcroft
Copy link
Collaborator

I got this running with the changes, through to making the plots (🎉)

I think unfortunately my take at the moment is that the case counting is probably messing us up quite bad. The proportion of cases detected vs real cases will vary from country to country, but perhaps it's more than I was anticipating.

For example, Denmark & Sweden had less 21K/BA.1 than most other countries, and bigger 21L/BA.2 waves. 21K has 144-, whereas 21L (and basically everything except 21K in Omicron) has 213G.

But if we look at 144- (proxy for 21K), you can see that Sweden/Denmark have as much exposure as on average in other countries... But I guess this may still be possible even if they had a smaller wave...? It's sensible that Portugal has more (from looking at waves/cases) and S Africa has less....
SY144-

If we look at 213G (proxy for 21L + then everything after that), we'd expect Denmark and Sweden to be on top. Denmark is - and Sweden maybe was pretty high in early 2022 (see the bump in early 2022) - but it's quite far from Denmark. I suppose the later rises are the waves of subsequent Omicron in other countries.
image

Interestingly one can compare this to mutation 213G (all Omicron after 21K/BA.1) with 142D (same, except it was also in 21A.Delta...) - you can see how much longer the tails are to the right, due to Delta waves.
SG142D

So this tells me we are on the right track - I can see the patterns - but I think the scaling up of cases is possibly more important than I thought... as otherwise we presume that Denmark is basically the best-prepared country out there for every mutation 🙃
Will try and think if there's a way to tackle this without it becoming completely unfeasible...

@MoiraZuber
Copy link
Collaborator Author

I really did forget to query for country. 🙈 How silly of me. Thanks for catching that! Interestingly it doesn't seem to make too much of a difference. I compared some of the plots, and most of them look rather similar. Here's one with a stronger difference I found: (pic above is new, pic below is old queries)

S:G142D_new
S:G142D_old

This means that a given variant won't vary strongly in between countries, right? I'm just wondering if we could somehow make use of this to make things more efficient... 🤔 But I guess it would still be too potentially interesting to catch those differences to discard per country queries entirely...

@emmahodcroft
Copy link
Collaborator

@MoiraZuber in the above push I just added the one thing I (apparently) hadn't pushed up before - an additional catch to check that pango_lineages is present before trying to use it in the URL - if you could just check this seems sensible that would be great! I'm not even sure if we need it, but seemed better safe than sorry in case some don't have the Pango name.

@emmahodcroft
Copy link
Collaborator

emmahodcroft commented Jan 23, 2023

From chatting with Richard he suggests we modify how we're storing the cases -- instead of storing the cumulative number of cases that have seen mutation X, he suggests storing, for each 2 week period the absolute number of cases that have seen mutation X. (So just the cases in that 2 week period, calculated from the mutation frequency * case numbers).

The advantage of this is that from this we can get everything else (such as cumulative, just by summing) and have more flexibility in how we can incorporate reinfection (and even potentially in future immune fading) in the future - as the absolute number shouldn't change - just then how many we end up putting to "cumulative" or similar. (So all more complex calculations can be post-hoc to the calculation of how many people saw mutation X in a week as a very raw value - hope that makes sense!)


Building on this, we can expect, certainly if we're working through 2022/Omicron, that we'll end up with more than 100% of the population infected - or that each person is infected more than once. (A rough value according to some studies in South Africa would be ~1.7 infections per person - but will vary per country a bit.)

Infection is Poisson distributed (rare event that happens randomly*) so this then allows us to calculate the probability of reinfection as e^-1.7 (Or whatever the overall infection rate is for any given point in time) - to figure out how many of the "seeing mutation X" in a given week is "new" vs "reinfection" (aka don't add to the cumulative).
*If we're excluding that having previously seen a mutation protects you from infection, which we are for now!!

I hope this makes sense, but we can also chat through it! I'm still parsing it a bit in my brain too, but I think it sounds like a good idea - as we can then be much more flexible in how we 'sum up' the weekly 'absolute exposures'!

@emmahodcroft
Copy link
Collaborator

emmahodcroft commented Jan 23, 2023

Finally, thought it might be worth putting our checklist from Friday 20th Jan 23 in here so things could actually be checked off!

Top to-dos:

  • Check query for 22D as most of the defining mutations seem empty [Moira] (Seems to be fine)

  • Modify how we're storing # of cases to be absolute rather than cumulative (see comment) ?

  • Integrate the estimated case numbers from OWID

    • Figure out how to get data file
    • Figure out format
    • Pull out each model (IHME, ICL)
    • Adjust calculations to take in the number of cases from this rather than pre-calculated CoV data
    • Find new model that continues into 2023?? (Other two have stopped) [Emma] (Emma has asked in some Slacks but no response yet)
    • Re-run plots with new case data
  • Add India, Brazil, Thailand, Australia to the countries

  • Do the testing on the 3 test cases to see if we can see the difference we should see - I think (?) we can just use the plots we generate above? (Possible exception 21L)

  • Ask Chaoran about whether better way to get time data from CoV-Spec [Emma] (No way to do this, faster querying possible in a couple months)

@MoiraZuber
Copy link
Collaborator Author

Cases are now not accumulated before storing to json.

Also, I downloaded the IHME model csv and ran the pipeline with the estimated case counts (instead of confirmed cases). This needs a new script (include_case_counts_estimates.py) which does essentially the same as include_case_counts.py, but for the estimated cases. Here's some example output:
S:Y144-
S:V213G
S:G142D

The plots look very different now! Many curves go over 100%, so dealing with reinfection will be relevant.

TODOS:

  • Test other models? (currently only IHME since it is the most complete)
  • Automatic download of estimated cases csv file (currently manual download suffices since the file is not updated anymore)
  • Discuss whether results make sense
  • Tackle reinfection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants