Skip to content

baozg/phased-assembly-check

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 

Repository files navigation

phased-assembly-check

1. Haplotype-resolved ctg

  • Trio
#! Two identical name

yak count -b37 -t16 -o pat.yak <(zcat pat_1.fq.gz pat_2.fq.gz) <(zcat pat_1.fq.gz pat_2.fq.gz)
yak count -b37 -t16 -o mat.yak <(zcat mat_1.fq.gz mat_2.fq.gz) <(zcat mat_1.fq.gz mat_2.fq.gz)
hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz
  • HiC
hifiasm -o HG002.asm -t32 --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz
  • Evaluation
## yak (https://github.com/lh3/yak)

yak trioeval pat.yak mat.yak HG002.asm.trio.hap1.p_ctg.fa > HG002.asm.hap1.trioeval
yak trioeval pat.yak mat.yak HG002.asm.trio.hap2.p_ctg.fa > HG002.asm.hap2.trioeval

grep "^S"  HG002.asm.hap1.trioeval|cut -f2-4,9|awk '{print $0"\tpat"}' >> F1.trio.tsv
grep "^S"  HG002.asm.hap2.trioeval|cut -f2-4,9|awk '{print $0"\tmat"}' >> F1.trio.tsv

## Merqury

Follow the https://github.com/marbl/merqury/wiki/3.-Phasing-assessment-with-hap-mers

Plot kmer with yak result


data <- read.table("./F1.trio.tsv",header=F,sep="\t")
colnames(data) <- c("ctg","Pat","Mat","Len","Type")
data$Len <- data$Len/1e6

library(ggplot2)
library(ggsci)

pal <- pal_locuszoom()(5)

p <- ggplot()+
  geom_point(data=data,aes(x=Pat,y=Mat,size=Len,color=Type),alpha=0.5)+
  scale_color_manual(values=c("#D43F3AFF","#357EBD"))+
  theme_bw()

2. Haplotype-aware scaffolding

Remove Haplotype-specific HiC reads (with Trio data)

VGP case

meryl-lookup -memory 2 -exclude -mers pat.meryl -sequence $read1 -sequence2 $read2 -r2
mat.R2.fastq.gz | pigz -c > mat.R1.fastq.gz
meryl-lookup -memory 2 -exclude -mers mat.meryl -sequence $read1 -sequence2 $read2 -r2
pat.R2.fastq.gz | pigz -c > pat.R1.fastq.gz 

# Using https://github.com/marbl/merqury/blob/master/trio/exclude_reads.sh

bash /data/software/Merqury/merqury/trio/exclude_reads.sh mat.hapmer.meryl f1.hic.R1.fq.gz f1.hic.R2.fq.gz pat

Scaffold two haplotype together and check (recommend for HiC only)

Update: Now you could scaffold haplotypes with HapHiC for any level of heterzygosity and curated based on HiC contact map, genome alignment, repeat annotation and assembly graph if possible

Medium heterozygosity or uneven heterzygoisty distribution

If your sample's heterozygosity is medium (ex. < 1%) or have long ROH (backcrossing or seg), combining two haplotype and run HiC scaffolding will have lots of empty region of HiC heatmap (abminguity mapping of homozygous region). You may need to scaffolding sepreate using HiC reads for chromsome first,then combine haplotype for haplotype-specific inversion check. Or you can use all HiC reads (no MQ filter) for chrosome scaffolding and check the haplotype-specfic inversion with filtering (MQ>1 or MQ>10). This is how we scaffold the tetraploid potato Genome architecture and tetrasomic inheritance of autotetraploid potato

High heterozygosity

  1. run juicer + 3d-dna / AllHiC (MQ>1) Run HapHiC
  2. Loading into JuiceBox for visualization
    • False duplication (Examples are based on the hifiasm 0.15.4 for high het plant genome with trio data)
    • Misplaced haplotype / uncomplete haplotype scaffolding If you find the hap1 ctg have more strong interaction with the hap2, and trioeval support it, you need manually move this hap1 ctg to hap2 image
    • Haplotype-specific inversion
      • MQ>1 filter for haplotype interaction, kepp the heatmap inside the haplotype is normal.

Examples

Fig1 Move the h1tg000041l to maternal hap

seqName #matKmer #patKmer #pat-pat #pat-mat #mat-pat #mat-mat seqLen
h1tg000041l 160 10644 77 83 82 10561 6152553
h1tg000053l 50105 62 50059 45 45 17 5159680
h1tg000050l 17945 27 17924 20 20 7 2082689

Fig2 Conclusion: Remove the h1tg000015l and h1tg000028l

seqName #patKmer #patKmer #pat-pat #pat-mat #mat-pat #mat-mat seqLen
h1tg000015l 87 164 51 35 36 128 2770227
h1tg000078l 42 66 24 18 18 47 1476449
h1tg000028l 7532 87 7478 53 53 34 3799565

Fig3 Conclusion: Remove the h1tg000059l and h1tg000056l, 52l vs (54l+13l) maybe real

seqName #patKmer #patKmer #pat-pat #pat-mat #mat-pat #mat-mat seqLen
h1tg000054l 6908 10 6899 8 8 2 1004624
h1tg000013l 25643 330 25547 95 96 234 8177266
h1tg000059l 57 245 25 32 32 212 1963293
h1tg000056l 12 29 4 7 7 22 685156
h1tg000052l 25905 219 25816 88 88 131 5467035

Fig4 Conclusion: utg overlap. Find the utg in the gfa and remove rerun the utg trioeval and decide.

seqName #patKmer #patKmer #pat-pat #pat-mat #mat-pat #mat-mat seqLen
h1tg000031l 17138 12244 17075 63 63 12180 4988700
h1tg000029l 44167 139 44067 99 99 40 7648617

Fig5 Conclusion: utg overlap. Find the utg in the gfa and remove.

seqName #patKmer #patKmer #pat-pat #pat-mat #mat-pat #mat-mat seqLen
h1tg000017l 39991 89 39929 61 61 28 5312873
h1tg000064l 1572 5187 1540 31 32 5155 1725115

3. Quality check

  • Trioeval for hamming error and switch error
Type Hap Switch Hamming
Trio (0.14.2) hap1 0.49% 0.59%
Trio (0.14.2) hap2 0.56% 1.13%
Trio-check hap1 0.44% 0.39%
Trio-check hap2 0.55% 1.12%
Trio(0.16.1) hap1 0.46% 0.46%
Trio (0.16.1) hap2 0.55% 0.57%
HiC (0.16.1) hap1 0.51% 0.46%
HiC (0.16.1) hap2 0.49% 0.51%
dual (0.16.1) hap1 0.51% 8.51%
dual (0.16.1) hap2 0.50% 13.69%
  • QV estimation for two haplotype
  • Check the haplotype synteny and confirm the utg (chhylp123/hifiasm#159)
  • Check the coverage of two haplotype
  • KAT spectrum and completeness for boths haps and combianed one
  • Run read-mapping to two haplotype assembly with flagger for evulation duplication / collapsed /

4.Citation

Scaffold separately first then mapping all reads for haplotype specific

Bao, Z., Li, C., Li, G., Wang, P., Peng, Z., Cheng, L., ... & Zhou, Q. (2022). Genome architecture and tetrasomic inheritance of autotetraploid potato. Molecular Plant, 15(7), 1211-1226.

Merge all haplotype and mapping all reads

Huang, Y., Wang, H., Zhu, Y. et al. THP9 enhances seed protein content and nitrogen-use efficiency in maize. Nature (2022). https://doi.org/10.1038/s41586-022-05441-2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published