Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heliano in EDTA? #524

Open
chestnutbt opened this issue Dec 5, 2024 · 4 comments
Open

Heliano in EDTA? #524

chestnutbt opened this issue Dec 5, 2024 · 4 comments

Comments

@chestnutbt
Copy link

Hi Prof. Ou. Thank you very much for creating this tool! I would like to ask you, would you consider including Heliano (https://github.com/Zhenlisme/heliano) in the EDTA workflow? I have been doing some tests and it seems that Heliano and HelitronScanner both find very different Helitron sets in a genome. The overlap between the results of Heliano and HelitronScanner outputs is very low.

@oushujun
Copy link
Owner

oushujun commented Dec 5, 2024 via email

@chestnutbt
Copy link
Author

Hi!
I honestly cannot judge. As far as I understand, HelitronScanner first scans for 5' and 3' Helitron terminal sequences and then it pairs them assuming that the sequence between two Helitron ends is a Helitron (with a maximum distance of 20 kb between ends). Heliano first scans for Helitrons transposase ORFs, then searches 5' and 3' ends close to the transposase ORFs and uses that 5' and 3' end sequences to find more Helitron elements. In principle the Heliano strategy sounds more reliable, because it first finds autonomous elements and then searches for non-autonomous elements. However, I don’t know if the Helitron transposases ORFs scanning can make mistakes, or if the fact that there may be not Helitron transposases in a genome (according to Heliano) strictly means that that genome doesn´t have Helitrons. What I have seen is: if I run Heliano and HelitronScanner (using EDTA_raw.pl) to find Helitrons in Arabidopsis TAIR10, Heliano finds around 200 helitrons and HelitronScanner find around 300. 95 are common (not all of them are exactly the same sequences, there are some discrepancies). In the TAIR10 TE annotation there are more than 10k sequences annotated as Helitrons (annotation made using RepeatMasker and the RepBase database, if I am not wrong). Both 94% and 97% of the Helitrons found by HelitronScanner and Heliano, respectively, are annotated as Helitrons in the TAIR10 TEs GFF file.

@oushujun
Copy link
Owner

There are pros and cons for both approaches. The assumption that two adjacent ends make a Heliton in HelitronScanner is too bold and easy to include non-Helitron sequences in the prediction. The approach to first finding Helitron ORFs and then extending to full-length elements helps to reduce false identifications but assumes all Helitrons in a genome have at least one full-length element with ORFs, which is another extreme and most likely underestimates the Helitron contents in a genome.

@chestnutbt
Copy link
Author

Hi Prof Ou. Thank you very much for your comment. This is why I wonder if these two tools could be complementary, rather than competing with each other. Perhaps the two together could offer a more accurate prediction? Or would it be the opposite, would the Helitron prediction be worse?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants