There are two versions of DBSLMM: the tuning version and the deterministic version. The tuning version examines three different heritability choices and requires a validation data to tune the heritability hyper-parameter. The deterministic version uses one heritability estimate and directly fit the model in the training data without a separate validation data. Both versions requires a reference data to compute the SNP correlation matrix.
For binary traits, especially for psychiatry diseases (i.e. major depression), DBSLMM usually outperforms some existing PGS construction methods.
Method Paper: Yang S#, Zhou X. Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. Am J Hum Genet. 2020 May 7;106(5):679-693.
Bechmarking Paper: Yang S#*, Zhou X. PGS-server: accuracy, robustness and transferability of polygenic score methods for biobank scale studies. Brief Bioinform. 2022 Mar 10;23(2):bbac039.
Database paper: Cao C, Zhang S, Wang J, Tian M, Ji X, Huang D*, Yang S*, Gu N. PGS-Depot: a comprehensive resource for polygenic scores constructed by summary statistics based methods. Nucleic Acids Res. 2024 Jan 5;52(D1):D963-D971.
Webserver paper: Yang S#*, Ye X#, Ji X#, Li Z, Tian M, Huang P, Cao C. PGSFusion streamlines polygenic score construction and epidemiological applications in biobank-scale cohorts. bioRxiv.
Psycho-metabolic nexus: Guo X, Feng Y, Ji X, Jia N, Mainaiti A, Lai J, Wang Z*, Yang S*, Hu S*. Shared genetic architecture and bidirectional clincial risks with psycho-metabolic nexus. eBioMedicine. (in press)
PTRS for Stroke: Cai X, Li H, Cao X, Ma X, Zhu W, Xu L, Yang S, Yu R, Huang P. Integrating transcriptomic and polygenic risk scores to enhance predictive accuracy for ischemic stroke subtypes. Hum Genet. 2024 Nov 18. doi: 10.1007/s00439-024-02717-7.
We update the software/DBSLMM.R
and software/TUNE.R
.
- integrates bigsnpr to estiamte the heritability, select large effect SNPs by clumping, and calculate the
$\hat{y}$ for the validation set - provides 'map_hm3_plus.rds' and corresponding PLINK files for three ancestries from 1000 Genomes Project
- fits the whole genomoe for three DBSLMM models with one R script
- provides the online server PGSFsuion constructing PGS and performing epidemiological application in UKBB
- provides the online version https://www.pgs-server.com/.
- fixes the bug of model fiting
- fits DBSLMM in different heritabiliies with one clumping
- uses a script to make the usage of DBSLMM-tuning version more easier.
- validates the flods of heritability: 0.8, 1 and 1.2
- fits the LMM, when the chromosome without any large effect SNPs
- fits the external validation all by R code.
In this version, we support DBSLMM and LMM models in automatic and tuning versions. The download link of dbslmm
is <>.
- download execution file dbslmm
- download the reference panel from 1kg
- ensure the summary statistics in GEMMA format
Summary_stat=/summary/statistics/in/GEMMA/fromat.txt
anc=EUR
BLOCK=${PACK_DIR}/block_data/${anc}
outpath=${DATA_DIR}output/
model=DBSLMM
Rscript ${PACK_DIR}/software/DBSLMM.R --summary ${Summary_stat} --dbslmm ${PACK_DIR}/dbslmm --type auto --model ${model} \
--reference ${ref_panel} --block ${BLOCK} --N ${N} --outPath ${outpath}/
Tips
- If you set
model=LMM
, we can fit DBSLMM-LMM model. - If the summary statistics is not EUR, you can set
anc=AFR
oranc=EAS
. - If the trait is quantitative, you can set
N=n_sample
. - If the trait is binary, you can set
N=${n_case},${n_control}
. DBSLMM will use effective sample size to estimate heritability. You CAN also regard the binary trait as the quantitative trait.
Summary_stat=/summary/statistics/in/GEMMA/fromat.txt
anc=EUR
BLOCK=${PACK_DIR}/block_data/${anc}
outpath=${DATA_DIR}output/
model=DBSLMM
Rscript ${PACK_DIR}/software/DBSLMM.R --summary ${Summary_stat} --dbslmm ${PACK_DIR}/dbslmm --type tuning --model DBSLMM
--reference ${ref_panel} --block ${BLOCK} --N ${N} --outPath ${outpath}/ --h2f 0.8,1,1.2
Summary_prefix=`echo "$Summary_stat" | awk -F'.txt' '{print $1}'`
Rscript ${PACK_DIR}/software/TUNE.R --dbslmm_eff ${Summary_prefix} --h2f 0.8,1,1.2 \
--validation_g ${val_genotype} --validation_p ${val_phenotype}
Tips
h2f
inDBSLMM.R
andTUNE.R
MUST be the same.- If you have covarites, you can specify
cov
flag.
We should use reference panel to construct SNP correlation matrix. You can treat 1000 Genome Project as reference panel. We also need the block information. The code is MAKELD.R
.
mkld=/your/path/MAKELD.R
# construct SNP corrlation matrix for each chromosome
for chr in `seq 1 22`
do
bfile=/your/path/chr${chr}
blockfile=/your/path/chr${chr}.bed
plink=/your/path/plink-1.9
outpath=/your/path/outpath/
Rscript ${mkld} --bfile ${bfile} --blockfile ${blockfile} --plink ${plink} --outpath ${outpath} --chr ${chr}
done
The format of extsumm
is the same as that of LDSC
.
SNP N Z A1 A2
rs1983865 253135 3.79310344827586 T C
rs1983864 251364 -4.51612903225806 T G
rs12411954 253213 0.413793103448276 T C
rs7077266 250092 -0.92 T G
We should use the standardized SNP effect size. You have two options. First, the variant ID is read from column 1, an allele code is read from the following column, the standardized effect size associated with the named allele is read from the column after the allele column. E.g.
esteff=/your/path/esteff.txt,1,2,3
Second, the variant ID is read from column 1, an allele code is read from the following column, the standardized effect size associated with the named allele is read from the column after the allele column, the MAF with the named allele is read from the last column. E.g.
esteff=/your/path/esteff.txt,1,2,3,4
The example of EXTERNAL.R
is as following:
external=/your/path/EXTERNAL.R
extsumm=/your/path/extsumm.ldsc.gz
esteff=/your/path/esteff.txt,1,2,3
LDpath=/your/path/LDpath/
outpath=/your/path/outpath/
Rscript ${external} --extsumm ${extsumm} --esteff ${esteff} --LDpath ${LDpath} --outpath ${outpath}