Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement checking purge_dups output for size and telomer motifs #157

Open
MartinPippel opened this issue Jan 22, 2025 · 1 comment
Open

Comments

@MartinPippel
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Purge_dups removes to much sequence. Mostly repetitive contigs are removed e.g. those can be full microchromosomes.

Describe the solution you'd like

  • there is no automatic solution yet - just manual inspection
  • example: a bird with many microchromosomes
  • here are the steps to check:
  1. have a look at the longest contigs and why (column 5) they were removed. Usually JUNK, HAPLOTIG and HIGHCOV are fine. But the OVL and REPEAT can be problematic.
## get the 10-longest contigs/contig-parts that were removed from the dups.bed file
awk '{s=$3-$2; print s"\t"$0}' ../purge_dups/Alectoris_graeca_hifiasm-purged-default_hap0.dups.bed | sort -k1,1n | tail
944826  ptg000109l      0       944826  REPEAT  ptg000051l
970055  ptg000007l      3798988 4769043 OVLP    ptg000004l
977989  ptg000110l      0       977989  REPEAT  ptg000049l
990078  ptg000126l      0       990078  REPEAT  ptg000117l
1097851 ptg000038l      0       1097851 REPEAT  ptg000036l
1292995 ptg000028l      0       1292995 OVLP    ptg000021l
1446387 ptg000049l      2677086 4123473 OVLP    ptg000046l
1605877 ptg000106l      0       1605877 REPEAT  ptg000049l
2810732 ptg000093l      0       2810732 REPEAT  ptg000019l
2903137 ptg000026l      0       2903137 REPEAT  ptg000036l
  1. check for the telomere motif at the start and end of the those contigs
## 
## assume asm.fa is a link to the primary assembly that went into purge_dups 
## 1.) split assembly by contigs - not really necessary, but the telomer check will be quicker 
seqkit split -i asm.fa 
## 2.) check for telomer motif 
c=asm.fa.split/asm.part_ptg000093l.fa && seqkit seq -w 180 $c | head -n 4 && echo && seqkit seq -w 180 $c | tail -3
>ptg000093l
TGGTGGGGGACAGCGTGGGGACGTTGTCAGGGTGTTGTTGGGGACGTCACGGGGACGTCTGTATGTAACGATTGCCGTGGGGACGTCCGTGGGGAAAGCCATGGAGACGTGGATGCTGGGGACGTCCATAAGAAGGTCACGAGGGATATTGTGGGGACATCCGTGGGGACGTCGTTGGGA
AGTTGTGGGGGGACGTTGTGGGGGTGTTATGGGGACGGCGTGGGGACGGCGTCGGGGAGTGACATGGGGACATCCGCGGACGCGCTGTCGGTGTATGAGCGGGGACAGCTGTAGGAGCGTCATTGGGATGTTGGTGGGGACATCCATAGGGATGTCGTGAGAGACGTTGTGGGGACATCG
TGGGTACAGCGTAGGGACATCCGAAGGATATAGTTAGGGACGGCACGGGGACATCTGTGGGGACATCCGTAGGGATGTTGTGGGGACGGCACGGGGACATCCATGGGGACAGCGTTGGAATATCGCTGGGGACGTCGTGGGGACATTGTCAGGACATGCGTGGGGTCGTCATGGGGACAT

GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTAAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG
TTAGGGTTAGGGTTAGGGTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT
AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG
  1. check if the alignments paf file for the OVL classified contigs make sense. You can grep the contigs from the PAF file and sort them by column 12, which is the Alignment-QV column. If all alignments have a QV of 0, then it is likely that the contig overlaps are just repeat induced and should probably not be trimmed. You can can also sort by column 10 Number of residue matches and 11 Alignment block length. If there is a big difference in those numbers as in the following example (113098 vs 1056612), then the ptg000007l should probably not be trimmed by 970Kb.
# e.g. for 970055  ptg000007l      3798988 4769043 OVLP    ptg000004l
grep -e "ptg000004l" ../../../nobackup/01_ebp-assembly-workflow/nxf-work/24/7ccafbd3bbdc5d549891992f42e4f4/Alectoris_graeca_hifiasm-purged-default_hap0.paf | grep -e "ptg000007l" | sort -k11,11rn | head | column -t -s $'\t'
ptg000004l:1-59435011  59435011  145      1034146  +  ptg000007l:1-4769043   4769043   3798988  4765267  113098  1056612  0  tp:A:S  cm:i:8481  s1:i:95693  dv:f:0.0144  rl:i:3412112
ptg000007l:1-4769043   4769043   3798988  4765267  +  ptg000004l:1-59435011  59435011  145      1034146  115151  1055292  0  tp:A:S  cm:i:8624  s1:i:98152  dv:f:0.0100  rl:i:1182923
ptg000007l:1-4769043   4769043   3909544  4046873  +  ptg000004l:1-59435011  59435011  2643     143031   6213    140437   0  tp:A:S  cm:i:386   s1:i:5722   dv:f:0.0458  rl:i:1182923
ptg000007l:1-4769043   4769043   3801503  3937854  +  ptg000004l:1-59435011  59435011  110701   239953   5959    136396   0  tp:A:S  cm:i:360   s1:i:4854   dv:f:0.0518  rl:i:1182923
ptg000004l:1-59435011  59435011  114740   243979   +  ptg000007l:1-4769043   4769043   3801503  3937842  6043    136384   0  tp:A:S  cm:i:375   s1:i:4938   dv:f:0.0431  rl:i:3412112
ptg000007l:1-4769043   4769043   3801503  3937149  +  ptg000004l:1-59435011  59435011  114740   243286   6202    135690   0  tp:A:S  cm:i:379   s1:i:5096   dv:f:0.0491  rl:i:1182923
ptg000004l:1-59435011  59435011  1741     134781   +  ptg000007l:1-4769043   4769043   3920758  4046706  5977    133086   0  tp:A:S  cm:i:367   s1:i:4872   dv:f:0.0495  rl:i:3412112
ptg000004l:1-59435011  59435011  2643     135516   +  ptg000007l:1-4769043   4769043   3909544  4035329  6033    132920   0  tp:A:S  cm:i:368   s1:i:4929   dv:f:0.0493  rl:i:3412112
ptg000004l:1-59435011  59435011  118599   243979   +  ptg000007l:1-4769043   4769043   3801324  3933805  5676    132525   0  tp:A:S  cm:i:353   s1:i:4570   dv:f:0.0446  rl:i:3412112
ptg000007l:1-4769043   4769043   3801503  3933638  +  ptg000004l:1-59435011  59435011  118778   243812   5818    132178   0  tp:A:S  cm:i:359   s1:i:4712   dv:f:0.0502  rl:i:1182923

Additional context
I am not aware of an easy fix for this - we might need to further file the PAF file and create proper alignment chains with the purge_dups pipeline.
If the contig numbers are managable (<1000?), then I usually keep the REPEAT-tagged contigs (up to a certain minimum threshold e.g. 50Kb or 100Kb ..?) and the untrimmed OVL-tagged contigs. Later in the manual curation I decide if those contigs should have been removed and then dups.bed file becomes handy to select the correct cutting thresholds.

@MartinPippel
Copy link
Contributor Author

MartinPippel commented Jan 24, 2025

here is a quick and dirty script to exclude the HAPLOTIGS, HIGHCOV and JUNK contigs. It keeps all OVLP-tagged contigs (i.e. the full length, nothing is cut) and removes all REPEAT-tagged contigs below a certain threshold (in this example 100K).

  1. link the assembly into the current working directory (cwd) to asm.fa
  2. link the purge_dups bed file into the cwd to dups.bed
  3. split asm.fa into contigs via seqkit split -i asm.fa
  4. run the following script, modify the variable MIN_LEN to your needs, which sets the minimum contig length for REPEAT-tagged contigs, i.e. smaller contigs will be removed
MIN_LEN=100000
IFS=$'\n'; for x in $(cat dups.bed); 
do 
    IFS=$'\t' read -a array <<< "$x"; 
    len=$((${array[2]}-${array[1]}))


    if [[ ${array[3]} == 'HAPLOTIG' || ${array[3]} == 'HIGHCOV' || ${array[3]} == 'JUNK' ]] # add a max length filter for HAPLOTIG TOO??
    then 
        echo -e "REMOVE1 $len\t$x"
        if [[ -f asm.fa.split/asm.part_${array[0]}.fa ]]
        then 
            mv asm.fa.split/asm.part_${array[0]}.fa asm.fa.split/asm.part_${array[0]}.fa.discard
        fi
    elif [[ ${array[3]} == 'OVLP' ]]   # OVL keep all overlaps ?
    then 
        echo -e "KEEP_IT1 $len\t$x"    
    elif [[ ${len} -lt ${MIN_LEN} ]]  # REPEAT - get rid of the small ones  
    then 
        echo -e "REMOVE2 $len\t$x"
        if [[ -f asm.fa.split/asm.part_${array[0]}.fa ]]
        then 
            mv asm.fa.split/asm.part_${array[0]}.fa asm.fa.split/asm.part_${array[0]}.fa.discard
        fi
    else 
        echo -e "KEEP_IT2 $len\t$x"
    fi
done
cat asm.fa.split/*fa > asm2.fa
cat asm.fa.split/*fa.discard > asm2_discard.fa

seqkit faidx asm2.fa 
seqkit faidx asm2_discard.fa

seqkit stats asm2.fa asm2_discard.fa
  1. check the stats and look at the files asm2.fa.fai and asm2_discard.fa.fai to validate your expected results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant