Skip to content
/ CONGA Public

CONGA: COpy Number Genotyping in Ancient genomes and low-coverage sequencing data

License

Notifications You must be signed in to change notification settings

asylvz/CONGA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

58d4fe4 · Apr 14, 2023

History

87 Commits
Nov 24, 2022
Oct 11, 2019
Apr 14, 2023
Oct 11, 2019
Oct 11, 2019
Dec 21, 2022
Nov 24, 2022
Mar 8, 2023
Mar 9, 2022
Apr 27, 2020
Mar 17, 2022
Nov 27, 2021
Mar 17, 2022
Mar 17, 2022
Jun 1, 2020
Dec 9, 2019
Jul 30, 2022
May 28, 2020
Feb 15, 2022
Feb 20, 2022
Nov 24, 2021
Nov 24, 2021
Mar 17, 2022
Dec 9, 2019
Feb 15, 2022
Feb 15, 2022

Repository files navigation

CONGA

(COpy Number variation Genotyping in Ancient genomes and low-coverage sequencing data)

CONGA is a genotyping algorithm for Copy Number Variations (large deletions and duplications) in ancient genomes. It is tailored for calling homozygous and heterozygous CNV genotypes at low depths of coverage using read-depth and read-pair information from a BAM file with Illumina short single-end reads.

Please feel free to send me an e-mail (asoylev@gmail.com), or better yet open an issue for your questions.

Requirements

CONGA is developed and tested using Linux Ubuntu operating system

Installing development libraries (requires sudo access): "sudo apt-get install zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev"

Downloading, compiling and running

git clone https://github.com/asylvz/CONGA --recursive
cd CONGA && make libs && make

./conga -i myinput.bam --ref human_g1k_v37.fasta --sonic human_g1k_v37.sonic  \
	--dels known_dels.bed --dups known_dups.bed --out myoutput
  • If you use a X86_64 Linux machine, you can directly use our binary file (under "Relases") after "sudo chmod 755 conga_v1.0_X86_64"

Compiling and running without sudo access

If you do not have root access to install liblzma and/or libbz2, you can compile htslib without CRAM support. However, the libz library is still required, please talk to your admin if it is not available on your system.

make nocram

./conga-nocram -i myinput.bam --ref human_g1k_v37.fasta --sonic human_g1k_v37.sonic  \
	--dels known_dels.bed --dups known_dups.bed --out myoutput

Docker Usage

Another alternative to run CONGA is using Docker

cd docker
docker build . -t conga:latest

Your image named "conga" should be ready. You can run CONGA using this image by

docker run --user=$UID -v /home/projects/conga:/input -v /home/projects/conga:/output conga -i /input/myinput.bam --sonic /input/human_g1k_v37.sonic --ref /input/human_g1k_v37.fasta --dels /input/known_dels.bed --dups /input/known_dups.bed --out /output/mydockertest

Alternatively, you can pull from Docker Hub:

docker pull asylvz/conga

SONIC file (required)

You need to input a SONIC file as input to CONGA (--sonic). This file contains some annotation based on the reference genome that you use. You can use one of the already created ones from: https://github.com/BilkentCompGen/sonic-prebuilt

  • human_g1k_v37.sonic: SONIC file for Human Reference Genome GRCh37 (1000 Genomes Project version)
  • ucsc_hg19.sonic: SONIC file for the human reference genome, UCSC version build hg19.
  • ucsc_hg38.sonic: SONIC file for the human reference genome build 38.

If you are working with a different reference genome, you need to create the SONIC file yourself. This is a straightforward process; please refer to the SONIC development repository: https://github.com/calkan/sonic/

Sample Genotype file (required)

You can use the "svcalls.sh" script under /scripts to generate CNV calls from the 1K Phase 3 SV call set

1	668630		850204
1	963826		974172
1	1171539		1179729
1	1249799		1265722
1	2374226		2379823
...
  • The columns are "Chromosome Name" (TAB) "Start Position of a CNV" (TAB) "End Postion of a CNV"
  • This file should be seperate for duplications and deletions if both are to be genotyped.

Sample Mappability File (optional)

1	63913643	63913648	0.2
1	63913648	63913649	0.25
1	63913649	63913653	0.5
1	63913653	63913659	0.333333
...

Using a mappability file (--mappability) increases the accuracy of CONGA's predictions. We used the 100-mer mappability file from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeMapability/ and converted the bigWig file into a BED file using "bigWigToBedGraph".

  • The columns are "Chromosome Name" (TAB) "Start Position of a CNV" (TAB) "End Postion of a CNV" (TAB) "Mappability value"
    • Note that the mappability value should be between [0,1], where lower values indicate lower mappability intervals, i.e., repeat-rich regions, etc.

All parameters

--input 		[BAM file]         : Input files in sorted and indexed BAM format. (required)
--out   		[output prefix]    : Prefix for the output file names. (required)
--ref   		[reference genome] : Reference genome in FASTA format. (required)
--sonic 		[sonic file]       : SONIC file that contains assembly annotations. (required)
--dels          	[bed file]         : Known deletion SVs in bed format
--dups          	[bed file]         : Known duplication SVs in bed format
--mappability   	[bed file]         : Mappability file in BED format
--first-chr     	[chromosome index] : The index of the first chromosome for genotyping in your BAM.
--last-chr      	[chromosome index] : The index of the last chromosome for genotyping in your BAM.
--min-read-length	[integer]	   : Minimum length of a read to be processed for RP (default: 60 bps)
--min-sv-size		[integer]	   : Minimum length of a CNV (default: 1000 bps)
--min-mapq		[integer]	   : Minimum mapping quality threshold for reads (default: -1)
--c-score               [float]            : Minimum c-score to filter variants (More conservative with lower values, default: 0.5).
--rp                    [integer]          : Enable split-read and set minimum read-pair support for a duplication (Suggested for >5x only).

Information:
--version                  		   : Print version and exit.
--help 		                           : Print this help screen and exit.

Citation

Arda Söylev, Sevim Seda Çokoglu, Dilek Koptekin, Can Alkan, and Mehmet Somel. "CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data." PLOS Computational Biology 18, no. 12 (2022): e1010788. https://doi.org/10.1371/journal.pcbi.1010788