Fine_Mapping_AD.Rmd

---
title: "<center><h1>Fine Mapping:</h1>Alzheimer's Disease</h1></center>" 
author: 
    "<div class='container'>
     <h3>Brian M. Schilder, Bioinformatician II<br>
     Raj Lab<br>
     Department of Neuroscience<br>
     Icahn School of Medicine at Mount Sinai<br>
     NYC, New York<br>
     </h3> 
     <a href='https://github.com/RajLabMSSM/Fine_Mapping' target='_blank'><img src='./echolocatoR/images/echo_logo_sm.png'></a> 
     <a href='https://github.com/RajLabMSSM' target='_blank'><img src='./web/images/github.png'></a> 
     <a class='item' href='https://rajlabmssm.github.io/RajLab_website/' target='_blank'>
        <img src='./web/images/brain-icon.png'>
        <span class='caption'>RAJ LAB</span>
     <a href='https://icahn.mssm.edu/' target='_blank'><img src='./web/images/sinai.png'></a>
     </div>"
date: "<br>Most Recent Update:<br> `r Sys.Date()`"
output:
 html_document:
    theme: cerulean
    highlight: zenburn
    code_folding: show
    toc: true
    toc_float: true
    smooth_scroll: true
    number_sections: false
    self_contained: true
    css: ./web/css/style.css
editor_options: 
  chunk_output_type: inline
--- 
 
```{r setup, message=F, warning=F, dpi = 600} 
source("echolocatoR/R/MAIN.R") 
# reticulate::conda_install("echoR",packages=c("fastparquet","tqdm","scikit-learn","bitarray","networkx")) 
reticulate::use_condaenv("echoR") 
finemap_results <- list()  
```


# Kunkle et. al. (2019)

## PTK2B/CLU

The PTK2B/CLU is a tricky locus with multiple peaks (see Table S6 of Kunkle et al. (2019) for GCTA-COJO results).
We therefore fine-mapped it a little differently:  
  1. Identify the minimum and maximum span of SNPs that are in LD (r2) with the lead common SNP within the LRRK2 gene.
  3. Fine-map each region separately

```{r PTK2B/CLU, eval=F}
quickstart_AD("CLU", dataset_name = "Kunkle_2019")
lead.snp <- subset(finemap_DT, leadSNP)
merged <-data.table:::merge.data.table(finemap_DT, 
                              data.table::data.table(r2=LD_matrix[lead.snp$SNP,]^2,
                                                     SNP=names(LD_matrix[lead.snp$SNP,])), 
                              by="SNP") %>% 
  subset(r2>0.1) 
dim(merged)
min_POS <- min(merged$POS) # 27373865
max_POS <- max(merged$POS)
```

## Fine-mapping {.tabset .tabset-fade .tabset-pills} 

```{r Kunkle_2019, results = 'asis', fig.height=10, fig.width=7, fig.show='hold'} 
dataset_name <- "Kunkle_2019"
top_SNPs <- import_topSNPs(
              topSS_path = Directory_info(dataset_name, "topSS"),
              sheet = "Supplementary Table 8",
              chrom_col = "CHR", 
              position_col = "POS", 
              snp_col="Top Associated SNV",
              pval_col="P_placeholder", 
              effect_col="Beta_placeholder", 
              gene_col="Lead SNV Gene",
              locus_col = "Lead SNV Gene",
              caption= "Kunkle et al. (2019) GWAS Summary Stats",
              group_by_locus = T) 
# Remove loci with notoriously difficult LD structure.
no_no_loci <- c("HLA-DRB1","WWOX")
top_SNPs <- subset(top_SNPs, !(Locus %in% no_no_loci))
# completed <- list.files("./Data/GWAS/Kunkle_2019/", 
#                         pattern = "ggbio.png", recursive = T)
# loci <- top_SNPs$Locus[!top_SNPs$Locus %in% dirname(dirname(completed))]

finemap_results[[dataset_name]] <- finemap_loci(top_SNPs = top_SNPs,
                                               loci = top_SNPs$Locus,
                                               trim_gene_limits = F,
                                               dataset_name = dataset_name,
                                               dataset_type = "GWAS",
                                               query_by ="tabix",
                                               subset_path = "auto", 
                                               finemap_methods = c("ABF","SUSIE","POLYFUN_SUSIE","FINEMAP"),
                                               force_new_subset = F,
                                               force_new_LD = F,
                                               force_new_finemap = F,  
                 fullSS_path = Directory_info(dataset_name, "fullSS.local"),
                 chrom_col = "Chromosome", 
                 position_col = "Position", 
                 snp_col = "MarkerName",
                 pval_col = "Pvalue", 
                 effect_col = "Beta", 
                 stderr_col = "SE",
                 A1_col = "Effect_allele",
                 A2_col = "Non_Effect_allele",
                 N_cases = 21982,
                 N_controls = 41944, 
                 proportion_cases = "calculate",
                 
                 # min_POS = min_POS,
                 # max_POS = min_POS,
                 
                 bp_distance = 500000*2, 
                 # plot_window = 500000,
                 download_reference = T, 
                 LD_reference = "UKB",
                 superpopulation = "EUR", 
                 plot_types = "simple",
                 LD_block = F,  
                 min_MAF = 0.001,
                 PP_threshold = .95,
                 n_causal = 5, 
                 remove_tmps = T,
                 server = F)  
```


# Marioni _2018

## Preprocess

Supp materials don't assign gene names, which makes it hard to compare loci across studies. 
Assigning each locus a name using the assigned genes from Kunkle_2019.
```{r Preprocess Marioni_2018, eval=F}
# kunkle <- readxl::read_excel(Directory_info("Kunkle_2019", "topSS"),  sheet = 9)
# kunkle <- kunkle %>% tidyr::separate("LD Block (GRCh37)",by=":|-", into=c("chr","start","end"))
# 
# marioni <- readxl::read_excel(Directory_info("Marioni_2018", "topSS"), sheet="Table S3")
# 
# merged <- merge(kunkle, marioni, by='chr')
# 
# merged <- merged %>% dplyr::mutate(within_block = ifelse(pos>=start & pos<=end,T,F))
# merged <- subset(merged, within_block, 
#                  select=c("chr","Top Associated SNV","Lead SNV Gene","GenomicLocus","uniqID","rsID"))top_SNPs$Gene
 
for(locus in unique(top_SNPs$Locus)){
  print(locus)
  top.sub <- subset(top_SNPs, Locus==locus)
  gene <- top.sub$Gene
  # rename folder
  file.rename(from = file.path("Data/GWAS/Marioni_2018",locus),
              to = file.path("Data/GWAS/Marioni_2018",gene))
  # rename multifinemap
  file.rename(from = file.path("Data/GWAS/Marioni_2018",gene,"Multi-finemap",
                               paste0(locus,"_Marioni_2018_Multi-finemap.tsv.gz")),
              to = file.path("Data/GWAS/Marioni_2018",gene,"Multi-finemap",
                               paste0(gene,"_Marioni_2018_Multi-finemap.tsv.gz")))
  # rename ggbio
  file.rename(from = file.path("Data/GWAS/Marioni_2018",gene,"Multi-finemap",
                               paste0(locus,"_ggbio.png")),
              to = file.path("Data/GWAS/Marioni_2018",gene,"Multi-finemap",
                               paste0(gene,"_ggbio.png")))
}
```

### PTK2B/CLU

```{r PTK2B/CLU, eval=F}
quickstart_AD("CLU", dataset_name = "Marioni_2018")
lead.snp <- subset(finemap_DT, leadSNP)
merged <-data.table:::merge.data.table(finemap_DT, 
                              data.table::data.table(r2=LD_matrix[lead.snp$SNP,]^2,
                                                     SNP=names(LD_matrix[lead.snp$SNP,])), 
                              by="SNP") %>% 
  subset(r2>0.1) 
dim(merged)
min_POS <- min(merged$POS) # 27373865
max_POS <- max(merged$POS)
# PTK2B range is 8:27195121-27238052 according to Kunkle et al. (2019)
```

## Fine-mapping {.tabset .tabset-fade .tabset-pills} 

```{r Marioni_2018, results = 'asis', fig.height=10, fig.width=7, fig.show='hold'} 
dataset_name <- "Marioni_2018"
top_SNPs <- import_topSNPs(
              topSS_path = Directory_info(dataset_name, "topSS"),
              sheet = "Table S3",
              chrom_col = "chr", 
              position_col = "pos", 
              snp_col="rsID",
              pval_col="p UKB + IGAP meta",    
              locus_col = "Kunkle_Gene",
              gene_col = "Kunkle_Gene",
              caption= "Marioni et al. (2018) GWAS Summary Stats",
              group_by_locus = T) 
# completed <- list.files("./Data/GWAS/Marioni_2018/",
#                         pattern = "_ggbio.png", recursive = T)
# top_SNPs <- subset(top_SNPs, !Locus %in% dirname(dirname(completed)))

finemap_results[[dataset_name]] <- finemap_loci(top_SNPs = top_SNPs,
                                               loci = top_SNPs$Locus,
                                               trim_gene_limits = F,
                                               dataset_name = dataset_name,
                                               dataset_type = "GWAS",
                                               query_by ="tabix",
                                               subset_path = "auto", 
                                               finemap_methods = c("ABF","SUSIE","POLYFUN_SUSIE","FINEMAP"),
                                               # finemap_methods = "FINEMAP",
                                               force_new_subset = F,
                                               force_new_LD = F,
                                               force_new_finemap = T,  
                 fullSS_path = Directory_info(dataset_name, "fullSS.local"),
                 chrom_col = "CHROM", 
                 position_col = "POS", 
                 snp_col = "SNP",
                 pval_col = "PVAL", 
                 effect_col = "BETA", 
                 stderr_col = "SE",
                 A1_col = "EFF",
                 A2_col = "NONEFF",
                 N_cases = 27696+14338+25580, # MATERNAL + PATERNAL PROXY CASES + AD CASES
                 N_controls = 260980, 
                 proportion_cases = "calculate",
                 
                 # min_POS = 27238052+1,
                 # max_POS = 27238052,
                 
                 bp_distance = 500000*2, 
                 # plot_window = 500000,
                 download_reference = T, 
                 LD_reference = "UKB",
                 superpopulation = "EUR", 
                 plot_types = "simple",
                 LD_block = F,  
                 min_MAF = 0.001,
                 PP_threshold = .95,
                 n_causal = 5, 
                 remove_tmps = T,
                 server = F)  
```


# Posthuma et al. (2018)

## Fine-mapping {.tabset .tabset-fade .tabset-pills} 

```{r Marioni_2018, results = 'asis', fig.height=10, fig.width=7, fig.show='hold'}  
dataset_name <- "Posthuma_2018"
top_SNPs <- import_topSNPs(
              topSS_path = Directory_info(dataset_name, "topSS"),
              sheet = "Table1_melted_phase3",
              chrom_col = "Chr", 
              position_col = "bp", 
              snp_col="SNP",
              pval_col="P", 
              effect_col = "Z",   
              locus_col = "Gene", 
              gene_col = "Gene",
              caption= "Posthuma et al. (2018) - Phase 3 - GWAS Summary Stats",
              group_by_locus = T) 
# completed <- list.files("./Data/GWAS/Posthuma_2018/",
#                         pattern = "_ggbio.png", recursive = T)
# top_SNPs <- subset(top_SNPs, !Locus %in% dirname(dirname(completed)))
top_SNPs <- subset(top_SNPs, !Locus %in% c("CLU","PTK2B"))

finemap_results[[dataset_name]] <- finemap_loci(top_SNPs = top_SNPs,
                                               loci = top_SNPs$Locus,
                                               trim_gene_limits = F,
                                               dataset_name = dataset_name,
                                               dataset_type = "GWAS",
                                               query_by ="tabix",
                                               subset_path = "auto",  
                                               finemap_methods = c("ABF","SUSIE","POLYFUN_SUSIE"),
                                               # finemap_methods = "FINEMAP",
                                               force_new_subset = T,
                                               force_new_LD = F,
                                               force_new_finemap = T,  
                 fullSS_path = Directory_info(dataset_name, "fullSS.local"),
                 chrom_col = "CHR", 
                 position_col = "BP", 
                 snp_col = "SNP",
                 pval_col = "P", 
                 effect_col = "BETA", 
                 stderr_col = "SE",
                 A1_col = "A1",
                 A2_col = "A2", 
                 freq_col = "EAF", 
                 MAF_col = "calculate",
                 N_cases = 71880 , # PROXY CASES + CASES
                 N_controls = 383378, 
                 proportion_cases = "calculate",
                 
                 # min_POS = 27238052+1,
                 # max_POS = 27238052,
                 
                 bp_distance = 500000*2, 
                 # plot_window = 500000,
                 download_reference = T, 
                 LD_reference = "UKB",
                 superpopulation = "EUR", 
                 plot_types = "simple",
                 LD_block = F,  
                 min_MAF = 0.001,
                 PP_threshold = .95,
                 n_causal = 5, 
                 remove_tmps = T,
                 server = F)  
```


## Summarise Results

### Biotypes

(1) Gather biotypes for each Credible Set SNPs
  - Identify any missense variants within CS SNPs.

```{r Biotypes}
merged_DT <- merge_finemapping_results(xlsx_path = "Data/GWAS/Kunkle_2019/Kunkle_2019_results.xlsx",
                                       biomart_annotation = T, dataset = "./Data/GWAS/Kunkle_2019")
# Consensus SNPs
createDT(subset(merged_DT, Consensus_SNP==T))
# Missense variants
createDT(subset(merged_DT, Support>0 & consequence_type_tv=="missense_variant"))
```

### sQTLs

(2) Splicing variant (look up splicing QTL datasets that _WE_ generated). sQTLs  

  - [Paper](https://www.nature.com/articles/s41588-018-0238-1#Sec30)
    + Table S10 contains the top sQTLs from ROSMAP.
  - [sQTLviz](https://rajlab.shinyapps.io/sQTLviz_ROSMAP/)
  - [Raw Data](https://www.radc.rush.edu)  
  
```{r sQTLs}
sqtls <- readxl::read_excel("./Data/QTL/sQTLviz/Raj2018_TableS10.xlsx")  
merged_DT <- merged_DT %>% dplyr::mutate(snp_id=paste0(CHR,":",POS)) %>% data.table::data.table()
sQTL_DT <- data.table:::merge.data.table(subset(merged_DT, Support>0),
                              data.table::data.table(sqtls), 
                              all.x = T,
                              by="snp_id") 
createDT(sQTL_DT)
```

## Regulatory Variants

(3) Regulatory - does it overlap chromatin accessibility in Glass data (or single cell) or histone modification and PLAC-seq? This class of variants will overlap TF binding sites, etc.  

```{r Regulatory Variants}

```


### QTLs

(4) Link to QTLs. any credible set SNPs in published QTL dataset.
  - Use eQTL Catalogue API
```{r QTLs}
# qtl_DT <- top_finemapped_loci(dataset="./Data/GWAS/Kunkle_2019/",
#                                 save_results=T, 
#                                 biomart=F)
```

### Deep Learning Predictions
(5) Check to see DeepSEA and SpliceAI scores.  

#### DeepSEA  
- Either use their server, or get pre-computed values from in silico mutagenesis?
```{r}
```

#### SpliceAI
```{r}
```

#### TFBS  
(6) identify Transcription Factor Binding motifs for known TFs. Not sure which database to use.  
  - Does he mean TFBS? SNPs aren't good for identifying motifs bc they're not the full sequence.
```{r}
```