You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Purge_dups removes to much sequence. Mostly repetitive contigs are removed e.g. those can be full microchromosomes.
Describe the solution you'd like
there is no automatic solution yet - just manual inspection
example: a bird with many microchromosomes
here are the steps to check:
have a look at the longest contigs and why (column 5) they were removed. Usually JUNK, HAPLOTIG and HIGHCOV are fine. But the OVL and REPEAT can be problematic.
check for the telomere motif at the start and end of the those contigs
## ## assume asm.fa is a link to the primary assembly that went into purge_dups ## 1.) split assembly by contigs - not really necessary, but the telomer check will be quicker
seqkit split -i asm.fa
## 2.) check for telomer motif
c=asm.fa.split/asm.part_ptg000093l.fa && seqkit seq -w 180 $c| head -n 4 &&echo&& seqkit seq -w 180 $c| tail -3
>ptg000093l
TGGTGGGGGACAGCGTGGGGACGTTGTCAGGGTGTTGTTGGGGACGTCACGGGGACGTCTGTATGTAACGATTGCCGTGGGGACGTCCGTGGGGAAAGCCATGGAGACGTGGATGCTGGGGACGTCCATAAGAAGGTCACGAGGGATATTGTGGGGACATCCGTGGGGACGTCGTTGGGA
AGTTGTGGGGGGACGTTGTGGGGGTGTTATGGGGACGGCGTGGGGACGGCGTCGGGGAGTGACATGGGGACATCCGCGGACGCGCTGTCGGTGTATGAGCGGGGACAGCTGTAGGAGCGTCATTGGGATGTTGGTGGGGACATCCATAGGGATGTCGTGAGAGACGTTGTGGGGACATCG
TGGGTACAGCGTAGGGACATCCGAAGGATATAGTTAGGGACGGCACGGGGACATCTGTGGGGACATCCGTAGGGATGTTGTGGGGACGGCACGGGGACATCCATGGGGACAGCGTTGGAATATCGCTGGGGACGTCGTGGGGACATTGTCAGGACATGCGTGGGGTCGTCATGGGGACAT
GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTAAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG
TTAGGGTTAGGGTTAGGGTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT
AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG
check if the alignments paf file for the OVL classified contigs make sense. You can grep the contigs from the PAF file and sort them by column 12, which is the Alignment-QV column. If all alignments have a QV of 0, then it is likely that the contig overlaps are just repeat induced and should probably not be trimmed. You can can also sort by column 10 Number of residue matches and 11 Alignment block length. If there is a big difference in those numbers as in the following example (113098 vs 1056612), then the ptg000007l should probably not be trimmed by 970Kb.
Additional context
I am not aware of an easy fix for this - we might need to further file the PAF file and create proper alignment chains with the purge_dups pipeline.
If the contig numbers are managable (<1000?), then I usually keep the REPEAT-tagged contigs (up to a certain minimum threshold e.g. 50Kb or 100Kb ..?) and the untrimmed OVL-tagged contigs. Later in the manual curation I decide if those contigs should have been removed and then dups.bed file becomes handy to select the correct cutting thresholds.
The text was updated successfully, but these errors were encountered:
here is a quick and dirty script to exclude the HAPLOTIGS, HIGHCOV and JUNK contigs. It keeps all OVLP-tagged contigs (i.e. the full length, nothing is cut) and removes all REPEAT-tagged contigs below a certain threshold (in this example 100K).
link the assembly into the current working directory (cwd) to asm.fa
link the purge_dups bed file into the cwd to dups.bed
split asm.fa into contigs via seqkit split -i asm.fa
run the following script, modify the variable MIN_LEN to your needs, which sets the minimum contig length for REPEAT-tagged contigs, i.e. smaller contigs will be removed
MIN_LEN=100000
IFS=$'\n';forxin$(cat dups.bed);do
IFS=$'\t'read -a array <<<"$x";
len=$((${array[2]}-${array[1]}))if [[ ${array[3]}=='HAPLOTIG'||${array[3]}=='HIGHCOV'||${array[3]}=='JUNK' ]] # add a max length filter for HAPLOTIG TOO??thenecho -e "REMOVE1 $len\t$x"if [[ -f asm.fa.split/asm.part_${array[0]}.fa ]]
then
mv asm.fa.split/asm.part_${array[0]}.fa asm.fa.split/asm.part_${array[0]}.fa.discard
fielif [[ ${array[3]}=='OVLP' ]] # OVL keep all overlaps ?thenecho -e "KEEP_IT1 $len\t$x"elif [[ ${len}-lt${MIN_LEN} ]] # REPEAT - get rid of the small ones thenecho -e "REMOVE2 $len\t$x"if [[ -f asm.fa.split/asm.part_${array[0]}.fa ]]
then
mv asm.fa.split/asm.part_${array[0]}.fa asm.fa.split/asm.part_${array[0]}.fa.discard
fielseecho -e "KEEP_IT2 $len\t$x"fidone
cat asm.fa.split/*fa > asm2.fa
cat asm.fa.split/*fa.discard > asm2_discard.fa
seqkit faidx asm2.fa
seqkit faidx asm2_discard.fa
seqkit stats asm2.fa asm2_discard.fa
check the stats and look at the files asm2.fa.fai and asm2_discard.fa.fai to validate your expected results
Is your feature request related to a problem? Please describe.
Purge_dups removes to much sequence. Mostly repetitive contigs are removed e.g. those can be full microchromosomes.
Describe the solution you'd like
JUNK
,HAPLOTIG
andHIGHCOV
are fine. But theOVL
andREPEAT
can be problematic.OVL
classified contigs make sense. You can grep the contigs from the PAF file and sort them by column 12, which is the Alignment-QV column. If all alignments have a QV of 0, then it is likely that the contig overlaps are just repeat induced and should probably not be trimmed. You can can also sort by column 10Number of residue matches
and 11Alignment block length
. If there is a big difference in those numbers as in the following example (113098 vs 1056612), then the ptg000007l should probably not be trimmed by 970Kb.Additional context
I am not aware of an easy fix for this - we might need to further file the PAF file and create proper alignment chains with the purge_dups pipeline.
If the contig numbers are managable (<1000?), then I usually keep the REPEAT-tagged contigs (up to a certain minimum threshold e.g. 50Kb or 100Kb ..?) and the untrimmed OVL-tagged contigs. Later in the manual curation I decide if those contigs should have been removed and then dups.bed file becomes handy to select the correct cutting thresholds.
The text was updated successfully, but these errors were encountered: