Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does the -r option do in gffread? #133

Open
zigge90 opened this issue Mar 21, 2024 · 1 comment
Open

What does the -r option do in gffread? #133

zigge90 opened this issue Mar 21, 2024 · 1 comment

Comments

@zigge90
Copy link

zigge90 commented Mar 21, 2024

Hi,

Can someone explain to me what the -r option does in gffread?

I have read that it:
"only shows transcripts overlapping coordinate range .. (on chromosome/contig , strand if provided)"

So I thought that an additional file was needed to be provided in order for this option to work. However when I run the command without providing anything other than my annotation and my reference genome, gffread filters down from ~37 000 genes to ~19 000 genes. But I don't understand how this works since I am not providing any information about what chromosome positions to filter out.

The reason I ask is because I have an annotation that seems to have a lot of pseudo genes or isoforms that I want to get rid of. I have tried the options -V, -J and --no-pseudo but they don't really do much difference. However, when i tried the -r option just as a random test, this made a huge difference to my list of transcripts but now I don't really understand why?

This is the simple command that I ran:

gffread my_gff.gtf -r -R -F -g my_genome.fasta -y RF_prots.fa -w RF_exons.fa

Grateful for any support!

@gpertea
Copy link
Owner

gpertea commented Mar 21, 2024

-r should take a chromosome:start-end genomic location specification, with an optional strand prefix or suffix, e.g. -r chr1:9000-12000 should only show transcripts overlapping that genomic range on chr1 (no matter the strand), but if you only want the transcripts on the reverse strand in that region, you can write -r -chr1:9000-12000 instead.
There is an example usage for that option at the bottom of this wiki page: https://github.com/gpertea/stringtie/wiki/Extracting-bundle-data-for-debugging

As for what you got there, I am afraid the way I handle the argument there is not that smart, when using -r without intentionally providing any genomic region specification, like you did, the code takes whatever token that follows there as the genomic region and attempts to parse it, but it fails silently if it could not make sense of it (it was -R in your example command line).

This is a very basic region filter and not very efficient, and as you can see it only takes one genomic region currently. You could use bedtools intersect or other specialized tools if you want, say, an intersection with a set of genomic regions specified in a BED file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants