Phase a VCF into 1000 genomes SNPs by NYCG.
The Snakefile takes as input a series of phased SNP VCFs (one per chromosome), and the NW output folders. It checks all aligned reads, assigns them a phase, and determines statistics for the most likely phase. It then carries out this re-phasing.
Final output, is a re-phased vcf in res/vcf.
Given the different post-processing steps, running NWs in its entirety requires some steps:
-
Run the
ntk-scan-snakemake
pipeline which repeatedly calls Nahrwhals (found in another repo):ssh seneca cd /g/korbel/hoeps/projects/nahr/ntk_interesting_loci/ntk-scan-snakemake ## Edit configs conda activate nahrwhals-big
This will produce an
all.tsv
. -
Convert
all.tsv
toall.vcf
using:R nahrwhals::write_vcf('/path/to/all.tsv', '/path/to/all.vcf')
If you also want to phase with respect to an external phasing, edit the config/Snake.config.json
to indicate binaries for a lot of tools, and links to the reference fasta, NYGC VCF, and NAHRwhals output path and vcf. Files that have to be custom-made for this workflow are:
-
region_file
: (can sometimes be recycled from "intervals.txt" from Nahrwhals)chr1-103543499-103821114 chr1-108148088-108537579 chr1-109662781-109726958 ...
-
sample_file
:GM19129 GM19434 HG00171 ...
Ensure you have a working R environment loaded. If on Seneca, use:
```
conda deactivate # (a few times if necessary)
conda activate nahrwhals-big
```
Run the workflow with:
```
snakemake --cores n
```
After running the Snakefile, you need to post-process the results to merge the BCF files:
- First, convert the
NW_flipped_50000.vcf
produced from the Snakemake part to a BCF file:bcftools view -O b NW_flipped_50000.vcf -o NW_flipped_50000.bcf tabix -p bcf NW_flipped_50000.bcf
- Concatenate the individual
allsample.vcfs
produced from the Snakemake. (This is untested, maybe need some more flags or so)bcftools concat /path/to/nygc_subsets/*_allsamples* -O z -o allregions_allsamples.vcf.gz
- Run the BCF merging script with the generated BCF files:
./merge_bcfs.sh custom_header.txt NW_flipped_50000.bcf allregions_allsamples.vcf.gz merged_output.vcf.gz