Mutation analysis #348

MoiraZuber · 2022-09-30T10:02:45Z

This PR aims at introducing a mutation analysis step to covariants. The goal is, for a given mutation and country, to obtain an estimate of what percentage of the population has encountered this mutation before. This is done by combining case counts data (which estimates the number of cases per bi-weekly interval for each cluster in a given country) with mutation proportions taken from covSPECTRUM (which mutation has been seen at what percentage in a given cluster).

vercel · 2022-09-30T10:02:47Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
covariants	✅ Ready (Inspect)	Visit Preview	Mar 23, 2023 at 4:52PM (UTC)

MoiraZuber · 2022-09-30T10:17:47Z

The current script produces output by multiplying the bi-weekly case counts data with the mutation proportions queried from covSPECTRUM and adding up the results across different clusters. There are two output files created in web/data: perCountryMutationAnalysis.json which sorts the mutation proportions by country, and perMutationAnalysis.json, which sorts by mutation.

Points of discussion:

Which clusters to query for. Currently, the list of clusters is produced from all entries in clusters.py that have "type" == "variant" and "graphing" == True.
How to query from covSPECTRUM: Currently, mutations are obtained for a certain cluster by querying via snps. Is this reliable enough or would switching to Pango lineages be better? Are there other methods with which to query by cluster?
Mutations are currently queried with an arbitrary threshold (10%).
Defining mutations from clusters.py are currently not used.
-> Which mutations to we use and how do we sort mutations by importance?
Which countries to use? Could we average over neighboring countries?
Visualize results -> Plot by country or mutation? Currently, both files are produced.
Possible alternative ways of plotting?
Discuss whether results appear to be realistic estimate

Advanced points:

Could we introduce an estimate for reinfections and/or untercounting of cases?
Could we include Jesse Bloom Lab escape calculator?

emmahodcroft · 2022-10-07T13:57:36Z

Thank you for this Moira! This is a great start and I really appreciate your work!

Which clusters to query for. Currently, the list of clusters is produced from all entries in clusters.py that have "type" == "variant" and "graphing" == True.

I think this is the right set to query for ✔️

How to query from covSPECTRUM: Currently, mutations are obtained for a certain cluster by querying via snps. Is this reliable enough or would switching to Pango lineages be better? Are there other methods with which to query by cluster?

We should actually be able to query Nextstrain clades, which I think might be a nice starting point as we can avoid the nitty-gritty for the time being and see if we need to wade into it later. Should be in the format like "21L" (without the '(Omicron)') - here's a web link as an example: https://cov-spectrum.org/explore/Switzerland/AllSamples/Past6M/variants?nextstrainClade=21L&
We can pull this directly from nextstrain_name in clusters.py, which also omits the extra bit...... ok apparently I've switched format here over time... 🙁
I think we should convert nextstrain_name to just be the number-letter name and then we can use this. For example for Alpha, it's currently 20I (Alpha, V1) and we should change this to 20I. This also lines us up better for upcoming changes in Nextstrain.

Mutations are currently queried with an arbitrary threshold (10%).

This seems fine for now, I think we're most interested in the more common mutations people encountered

Defining mutations from clusters.py are currently not used.

I actually think this should be ok too.... since hopefully all defining mutations are in there and at high percents!

-> Which mutations to we use and how do we sort mutations by importance?

Let's get a list of the positions in S in the receptor binding domain (RBD) and N terminal domain (NTD) and pull just these out - they should be the most critical.

Which countries to use? Could we average over neighboring countries?

Ohhh, interesting point. Let's keep that one on the burner. For now how about starting with the list I put in the slack?

Visualize results -> Plot by country or mutation? Currently, both files are produced.

Hm, good to produce both as I'm not sure what we may want long term, but intuitively I think it would be interesting to plot by country first - what's the estimated breakdown of which mutations have been seen?

At the moment I'm not sure whether this will go on the site or not, so would you mind adding some simple plots of the files you're generating - so it outputs them as image files? No need for anything too fancy - but this will give us an idea as we go of what we're looking at!

I see you've got a bi-weekly mutation percent, I wonder if we should also try querying CoV-Spectrum by intervals (you can specify dates in the query) or we are doing good enough already - what do you think? Could try both and see if it makes a difference, perhaps?

Do you think that gives some good next steps?

MoiraZuber · 2022-10-26T12:32:03Z

When querying with nextstrainClade (extracted with clusters[clus]["display_name"].split(" ")[0] to remove the part in parentheses) instead of snps, the following clusters yield valid responses:

Alpha (nextstrainClade 20I), Beta (nextstrainClade 20H), Gamma (nextstrainClade 20J), Delta (nextstrainClade 21A), 21I.Delta (nextstrainClade 21I), 21J.Delta (nextstrainClade 21J), 21K.Omicron (nextstrainClade 21K), 21L.Omicron (nextstrainClade 21L), 22A (nextstrainClade 22A), 22B (nextstrainClade 22B), 22C (nextstrainClade 22C), 22D (nextstrainClade 22D), Kappa (nextstrainClade 21B), Eta (nextstrainClade 21D), Iota (nextstrainClade 21F), 21GLambda (nextstrainClade 21G), 21H (nextstrainClade 21H), EU1 (nextstrainClade 20E), Epsilon (nextstrainClade 21C)

These are not found on covSPECTRUM:

20BS732 (nextstrainClade 20B/S:732A), 20AS126 (nextstrainClade 20A/S:126A), S439 (nextstrainClade 20A/S:439K), S677HRobin1 (nextstrainClade S:677H.Robin1), S677PPelican (nextstrainClade S:677P.Pelican), EU2 (nextstrainClade 20A.EU2), S98 (nextstrainClade 20A/S:98F), S80 (nextstrainClade 20C/S:80Y), S626 (nextstrainClade 20B/S:626S), S1122 (nextstrainClade 20B/S:1122L)

@emmahodcroft Do these missing ones have different nextstrainClades than what is in clusters[clus]["display_name"] or do they just not have a counterpart on covSPECTRUM?

MoiraZuber · 2022-10-26T13:00:42Z

Let's get a list of the positions in S in the receptor binding domain (RBD) and N terminal domain (NTD) and pull just these out - they should be the most critical.

Where could I get this list from?

MoiraZuber · 2022-10-26T13:02:14Z

I see you've got a bi-weekly mutation percent, I wonder if we should also try querying CoV-Spectrum by intervals (you can specify dates in the query) or we are doing good enough already - what do you think? Could try both and see if it makes a difference, perhaps?

I got this running (querying by 2-week intervals), however, this takes a lot of time. I was wondering whether we could get the data downloaded directly sorted by date instead of querying for each interval separately, but I couldn't figure out whether this is possible. Do you by chance know?

MoiraZuber · 2022-10-26T13:37:41Z

Here's a few example plots:

emmahodcroft · 2023-01-04T15:42:53Z

Sorry Moira, somehow I missed this completely. Trying to come back to it a little now!

I think one step forward that's probably worth making it simplifying the Nextstrain names. However, I can't seem to get around to doing that so I've done the next best thing - I've added nextstrain_clade which is guaranteed to just be the two-digits-and-letter (ex: 21A). This should give the same results but it's just a bit tidier and more reliable.

For the clusters that don't have Nextstrain names and so don't return hits, we can try to use Pango linages... I've added this and am seeing if that works. Probably not so critical to include these as they probably had few mutations that we're really interested in now, but if there's a simple way to include them, why not...
I think this works.

Where could I get this list from?

I think to keep it simple for now, I'm happy to go with a legit-looking-paper's definition. For RBD this article includes "Thr333–Gly526 of the SARS-CoV-2 RBD", this article says "we identified the region of SARS-CoV-2 RBD at residues 331 to 524 of S protein".

For NTD, this article says "residues 15–305 of the SARS-CoV-2 spike protein ... were aligned against the NTD sequences", whereas this one says "NTD (aa 13–318)"

So I think anything in this ranges works! I have randomly picked 333-526 and 15-305 for the moment... So currently it only stores mutations in these two locations.

I got this running (querying by 2-week intervals), however, this takes a lot of time

Hm, I've now forgotten what the original get_mutation_proportions_biweekly did... But I can see how doing 2-weekly queries takes a lot of time. I will see if I can figure out how this worked originally, if there's already a way to break things down by some kind of time element to see proportions changing over time... I suppose we know the proportion of the variant itself over time already, from CoV, right - so perhaps we can just use this? Multiplying the variant % by the mut % for that variant...
The only downside is if a variant got a particular mutation later on, and case numbers changed, then this will change...

I can also try to ask Chaoran if there's a better way, but I feel like I need to try and wrap my head around the scripts a little more before I formulate a question!

Going through the script I'm a little confused about the role country is playing in get_mutation_proportions -- it doesn't seem to go into the URL building, so are we getting different data for each country? Your plots show differences between countries so I figure something is happening, but I'm not 100% piecing it together.... but when I'm running pieces of the script, I'm getting the same proportion for each country 😅
I wonder if perhaps there's one more edit that didn't get pushed? (this might be messy now, since I'm about to push!)
I added something to include country in the build_url query which seems to work, I think...!

emmahodcroft · 2023-01-05T17:37:59Z

I got this running with the changes, through to making the plots (🎉)

I think unfortunately my take at the moment is that the case counting is probably messing us up quite bad. The proportion of cases detected vs real cases will vary from country to country, but perhaps it's more than I was anticipating.

For example, Denmark & Sweden had less 21K/BA.1 than most other countries, and bigger 21L/BA.2 waves. 21K has 144-, whereas 21L (and basically everything except 21K in Omicron) has 213G.

But if we look at 144- (proxy for 21K), you can see that Sweden/Denmark have as much exposure as on average in other countries... But I guess this may still be possible even if they had a smaller wave...? It's sensible that Portugal has more (from looking at waves/cases) and S Africa has less....

If we look at 213G (proxy for 21L + then everything after that), we'd expect Denmark and Sweden to be on top. Denmark is - and Sweden maybe was pretty high in early 2022 (see the bump in early 2022) - but it's quite far from Denmark. I suppose the later rises are the waves of subsequent Omicron in other countries.

Interestingly one can compare this to mutation 213G (all Omicron after 21K/BA.1) with 142D (same, except it was also in 21A.Delta...) - you can see how much longer the tails are to the right, due to Delta waves.

So this tells me we are on the right track - I can see the patterns - but I think the scaling up of cases is possibly more important than I thought... as otherwise we presume that Denmark is basically the best-prepared country out there for every mutation 🙃
Will try and think if there's a way to tackle this without it becoming completely unfeasible...

MoiraZuber · 2023-01-18T10:49:25Z

I really did forget to query for country. 🙈 How silly of me. Thanks for catching that! Interestingly it doesn't seem to make too much of a difference. I compared some of the plots, and most of them look rather similar. Here's one with a stronger difference I found: (pic above is new, pic below is old queries)

This means that a given variant won't vary strongly in between countries, right? I'm just wondering if we could somehow make use of this to make things more efficient... 🤔 But I guess it would still be too potentially interesting to catch those differences to discard per country queries entirely...

emmahodcroft · 2023-01-23T09:37:00Z

@MoiraZuber in the above push I just added the one thing I (apparently) hadn't pushed up before - an additional catch to check that pango_lineages is present before trying to use it in the URL - if you could just check this seems sensible that would be great! I'm not even sure if we need it, but seemed better safe than sorry in case some don't have the Pango name.

emmahodcroft · 2023-01-23T09:45:09Z

From chatting with Richard he suggests we modify how we're storing the cases -- instead of storing the cumulative number of cases that have seen mutation X, he suggests storing, for each 2 week period the absolute number of cases that have seen mutation X. (So just the cases in that 2 week period, calculated from the mutation frequency * case numbers).

The advantage of this is that from this we can get everything else (such as cumulative, just by summing) and have more flexibility in how we can incorporate reinfection (and even potentially in future immune fading) in the future - as the absolute number shouldn't change - just then how many we end up putting to "cumulative" or similar. (So all more complex calculations can be post-hoc to the calculation of how many people saw mutation X in a week as a very raw value - hope that makes sense!)

Building on this, we can expect, certainly if we're working through 2022/Omicron, that we'll end up with more than 100% of the population infected - or that each person is infected more than once. (A rough value according to some studies in South Africa would be ~1.7 infections per person - but will vary per country a bit.)

Infection is Poisson distributed (rare event that happens randomly*) so this then allows us to calculate the probability of reinfection as e^-1.7 (Or whatever the overall infection rate is for any given point in time) - to figure out how many of the "seeing mutation X" in a given week is "new" vs "reinfection" (aka don't add to the cumulative).
*If we're excluding that having previously seen a mutation protects you from infection, which we are for now!!

I hope this makes sense, but we can also chat through it! I'm still parsing it a bit in my brain too, but I think it sounds like a good idea - as we can then be much more flexible in how we 'sum up' the weekly 'absolute exposures'!

emmahodcroft · 2023-01-23T10:03:24Z

Finally, thought it might be worth putting our checklist from Friday 20th Jan 23 in here so things could actually be checked off!

Top to-dos:

MoiraZuber · 2023-02-16T14:17:11Z

Cases are now not accumulated before storing to json.

Also, I downloaded the IHME model csv and ran the pipeline with the estimated case counts (instead of confirmed cases). This needs a new script (include_case_counts_estimates.py) which does essentially the same as include_case_counts.py, but for the estimated cases. Here's some example output:

The plots look very different now! Many curves go over 100%, so dealing with reinfection will be relevant.

TODOS:

Test other models? (currently only IHME since it is the most complete)
Automatic download of estimated cases csv file (currently manual download suffices since the file is not updated anymore)
Discuss whether results make sense
Tackle reinfection

Moira added 2 commits September 27, 2022 16:42

add groundwork

b8e62e9

add to_json

a4b4e28

MoiraZuber requested a review from emmahodcroft September 30, 2022 10:18

update URL for cov spectrum

77a9ec6

vercel bot deployed to Preview October 8, 2022 14:36 View deployment

add plotting, filter for S, query by nextstrainClade

54a9760

vercel bot deployed to Preview October 26, 2022 13:38 View deployment

emmahodcroft added 3 commits January 4, 2023 05:51

add nextstrain_clade as simplified nextstrain id

9442e37

add country to URL, use new nextstrain clade, store only NTD/RBD

759d174

new comment

b14407b

vercel bot deployed to Preview January 4, 2023 15:54 View deployment

Moira added 2 commits January 20, 2023 15:39

include bi-weekly querying

276540c

update perCountryData & CaseCounts jsons

06fb999

vercel bot deployed to Preview January 20, 2023 14:46 View deployment

add catch if no pango-lineage in clusters[clus]

1a50a49

vercel bot deployed to Preview January 23, 2023 09:40 View deployment

include case estimates

5f79e67

vercel bot deployed to Preview February 16, 2023 14:20 View deployment

fix bug wrong population, add stable legend

b473c0b

vercel bot deployed to Preview March 23, 2023 16:52 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutation analysis #348

Mutation analysis #348

MoiraZuber commented Sep 30, 2022

vercel bot commented Sep 30, 2022 •

edited

MoiraZuber commented Sep 30, 2022 •

edited

emmahodcroft commented Oct 7, 2022 •

edited

MoiraZuber commented Oct 26, 2022 •

edited

MoiraZuber commented Oct 26, 2022

MoiraZuber commented Oct 26, 2022

MoiraZuber commented Oct 26, 2022

emmahodcroft commented Jan 4, 2023

emmahodcroft commented Jan 5, 2023

MoiraZuber commented Jan 18, 2023

emmahodcroft commented Jan 23, 2023

emmahodcroft commented Jan 23, 2023 •

edited

emmahodcroft commented Jan 23, 2023 •

edited by MoiraZuber

MoiraZuber commented Feb 16, 2023

Mutation analysis #348

Are you sure you want to change the base?

Mutation analysis #348

Conversation

MoiraZuber commented Sep 30, 2022

vercel bot commented Sep 30, 2022 • edited

MoiraZuber commented Sep 30, 2022 • edited

emmahodcroft commented Oct 7, 2022 • edited

MoiraZuber commented Oct 26, 2022 • edited

MoiraZuber commented Oct 26, 2022

MoiraZuber commented Oct 26, 2022

MoiraZuber commented Oct 26, 2022

emmahodcroft commented Jan 4, 2023

emmahodcroft commented Jan 5, 2023

MoiraZuber commented Jan 18, 2023

emmahodcroft commented Jan 23, 2023

emmahodcroft commented Jan 23, 2023 • edited

emmahodcroft commented Jan 23, 2023 • edited by MoiraZuber

MoiraZuber commented Feb 16, 2023

vercel bot commented Sep 30, 2022 •

edited

MoiraZuber commented Sep 30, 2022 •

edited

emmahodcroft commented Oct 7, 2022 •

edited

MoiraZuber commented Oct 26, 2022 •

edited

emmahodcroft commented Jan 23, 2023 •

edited

emmahodcroft commented Jan 23, 2023 •

edited by MoiraZuber