Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to test the hypothesis that the significant variants are in a set of SNPs? #196

Open
garyzhubc opened this issue Jul 11, 2023 · 17 comments

Comments

@garyzhubc
Copy link

garyzhubc commented Jul 11, 2023

Is it possible to test the hypothesis that the significant variants are in a set of SNPs?

@garyzhubc
Copy link
Author

Not just sets from the susie_get_cs()

@garyzhubc garyzhubc changed the title Is it possible to test the probability that the significant variants are in a group of SNPs? Is it possible to test the probability that the significant variants are in a set of SNPs? Jul 18, 2023
@stephens999
Copy link
Contributor

sorry I don't think we understand the question.

@garyzhubc garyzhubc changed the title Is it possible to test the probability that the significant variants are in a set of SNPs? Is it possible to test the hypothesis that the significant variants are in a set of SNPs? Jul 18, 2023
@garyzhubc
Copy link
Author

Can I test the hypothesis like "beta_2,beta_3\ne0, beta_1,beta_4=0"?

@stephens999
Copy link
Contributor

perhaps the simplest way would be to sample from the (approximate) posterior distribution on beta, and then use that sample to compute the probability of any particular combination
of 0/non-zero beta values.

@garyzhubc
Copy link
Author

garyzhubc commented Jul 19, 2023

Something like this (test the hypothesis that significant variants are in the middle)?

posterior<-unlist(lapply(1:1000, function(i) (max(samp$b[40:60,i])>0 | min(samp$b[40:60,i])<0) & (samp$b[1,1:39] == 0 & samp$b[1,61:101] == 0)))

result in:

> mean(posterior)
[1] 0.9756098

@stephens999
Copy link
Contributor

I think the code doesn't look quite right, but something like that, yes

@garyzhubc
Copy link
Author

I made an edit on the code. Could you tell me why it doesn't look right?

@stephens999
Copy link
Contributor

you have samp$b[1,1:39]
but samp$b[40:60,i]

so the indices of b don't look consistent through your code

@garyzhubc
Copy link
Author

garyzhubc commented Jul 20, 2023

Thanks for spotting this mistake. I'm now using:

> posterior<-unlist(lapply(1:num_samples, function(i) (max(samp$b[40:60,i])>0 | min(samp$b[40:60,i])<0) & all(c(samp$b[1:39,i] == 0, samp$b[61:101,i] == 0))))
> mean(posterior)
[1] 0
> posterior<-unlist(lapply(1:num_samples, function(i) (max(samp$b[1:39,i])>0 | min(samp$b[1:39,i])<0) & all(samp$b[40:101,i] == 0)))
> mean(posterior)
[1] 0
> posterior<-unlist(lapply(1:num_samples, function(i) (max(samp$b[61:101,i])>0 | min(samp$b[61:101,i])<0) & all(samp$b[1:60,i] == 0)))
> mean(posterior)
[1] 0

which I believe is consistent. However, it looks like these all have very small probabilities. Do you recommend instead of testing all(samp$b[1:60,i] == 0) try something like all(abs(samp$b[1:60,i]) < episilon), or do you think what I'm doing is okay?

@garyzhubc
Copy link
Author

Can anyone respond to this issue please? @pcarbo

@pcarbo
Copy link
Member

pcarbo commented Oct 11, 2023

The issue of working with very small probabilities is a common issue and there are some ways to help with this. If you can share with us a reproducible example illustrating exactly what you are trying to do, I might be able to help you.

As a general piece of advice, I recommend starting with an example that is as simple as possible, e.g., an example with exactly 4 variables X.

@garyzhubc
Copy link
Author

garyzhubc commented Oct 11, 2023

An example of four variables b1, b2, b3, b4: suppose I want to calculate P(b1=0 and b4 =0 and (b2!=0 or b3!=0)) as the probability that the causal variant is in {b2,b3}, shall I run this program below to get the probability?

samp<-susie_get_posterior_samples(res_, num_samples)
posterior23<-unlist(lapply(1:num_samples, function(i) (max(samp$b[2:3,i])>0 | min(samp$b[2:3,i])<0) & all(c(samp$b[1,i] == 0, samp$b[4,i] == 0))))
mean(posterior23)

If I rank posterior1, posterior2, posterior3, posterior12, posterior23, posterior13, posterior123 and select the one with probability greater than 0.95, will I get the same credible interval (default coverage = 0.95):

susie_get_cs(res_)

@pcarbo
Copy link
Member

pcarbo commented Oct 12, 2023

@garyzhubc This is not a reproducible example because some variables in your code (e.g., res_) are not defined.

@stephens999
Copy link
Contributor

However, it looks like these all have very small probabilities. Do you recommend instead of testing all(samp$b[1:60,i] == 0) try something like all(abs(samp$b[1:60,i]) < episilon), or do you think what I'm doing is okay?

I think what you are doing looks OK, and you are just getting the answer that there is a very small probability of the event you are looking at. You could also try the epsilon approach you suggested.

@garyzhubc
Copy link
Author

garyzhubc commented Oct 21, 2023

I also tried using PIP directly instead of sampling.

prod(1-res$pip[1:34])*(1-prod(1-res$pip[35:68]))*prod(1-res$pip[69:101])

still gives zero probabilities. See #203 (comment)

@garyzhubc
Copy link
Author

garyzhubc commented Oct 21, 2023

@garyzhubc This is not a reproducible example because some variables in your code (e.g., res_) are not defined.

I could do the same on this example https://stephenslab.github.io/susieR/articles/sparse_susie_eval.html:

create_sparsity_mat = function(sparsity, n, p) {
  nonzero          <- round(n*p*(1-sparsity))
  nonzero.idx      <- sample(n*p, nonzero)
  mat              <- numeric(n*p)
  mat[nonzero.idx] <- 1
  mat              <- matrix(mat, nrow=n, ncol=p)
}
n <- 1000
p <- 1000
beta <- rep(0,p)
beta[c(1,300,400,1000)] <- 10 
X.dense  <- create_sparsity_mat(0.99,n,p)
X.sparse <- as(X.dense,"CsparseMatrix")
y <- c(X.dense %*% beta + rnorm(n))
susie.sparse <- susie(X.sparse,y)

Using Monte Carlo sample from posterior:

num_samples<-10000
samp<-susie_get_posterior_samples(susie.sparse, num_samples)
posterior<-unlist(lapply(1:num_samples, function(i) (max(samp$b[341:680,i])>0 | min(samp$b[341:680,i])<0) & all(c(samp$b[1:340,i] == 0, samp$b[681:1000,i] == 0))))
mean(posterior)

Using PIP:

prod(1-susie.sparse$pip[1:340])*(1-prod(1-susie.sparse$pip[341:680]))*prod(1-susie.sparse$pip[681:1000])

Both gives probability zero.

@pcarbo
Copy link
Member

pcarbo commented Oct 23, 2023

@garyzhubc I think the issue is that in your example all the inclusion probabilities are either 1 or very, very small, so it may be tricky to use a naive Monte Carlo sampling approach:

hist(log10(susie.sparse$alpha),n = 64)

One idea that comes to mind is importance sampling, but you might want to start with an example where the probabilities are less extreme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants