Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in the start and end positions of some sequences in the output gff file cause frameshift mutation #15

Open
fangrunzhao opened this issue Oct 8, 2024 · 3 comments

Comments

@fangrunzhao
Copy link

Hi Vizueta,

Thank you for developing a favorable tool for gene family annotation, but now I have some problems.

Some sequences of this *_genomic_and_annotated_proteins_trimmed_idseqsclustered.gff3 file will have the wrong open reading frame.

Here is part of this *_genomic_and_annotated_proteins_trimmed_idseqsclustered.gff3 file:

_chr001 AnnotGFF gene 12515613 12515776 . - . ID=_chr001.341;blastphmmer;annot;Pos:1-54
_chr001 AnnotGFF mRNA 12515613 12515776 . - . ID=_chr001.341.1;Parent=_chr001.341;blastphmmer;annot;Pos:1-54
_chr001 AnnotGFF CDS 12515613 12515773 . - 2 ID=_chr001.341.1_cds_1;Parent=
_chr001.341.1;blastphmmer;annot;Pos:1-54

The following is part of the original genome gff file (the input gff file needed to run):

_chr001 . gene 12515609 12515775 . - . ID=_chr001.341
_chr001 . mRNA 12515609 12515775 . - . ID=_chr001.341.1;Parent=_chr001.341
_chr001 . exon 12515609 12515775 . - . ID=_chr001.341.1_exon_1;Parent=
_chr00>
_chr001 . CDS 12515609 12515775 . - 2 ID=_chr001.341.1_cds_1;Parent=*_chr001>

You can see that these two identical sequences have different start and end positions. This will result in errors when extracting proteins or CDS using gffread, such as frameshift mutation. I think this is caused by the program after trimming the sequence, is this a bug please?

Best,
Run Zhao Fang

@Vizueta
Copy link
Collaborator

Vizueta commented Oct 9, 2024

Hi again,

I need more insights to see if it is a bug or a problem with your input gff, which might be corrected by BITACORA and thus the difference in the positions.

Can you check that the predicted protein sequence (in the fasta file), is the same or different from the protein in your original annotation?

Also, using the new BITACORA GFF3 file, have you tried to predict the protein sequence using "gffread", if yes, does the resulting protein have the incorrect reading frame resulting in stop codons in the sequence?

Finally, I use the following script to extract protein sequences from the gff
perl Scripts/Tools/gff2fasta_v3.pl genome.fasta anotation.gff OutPrefix

Could you see if the correct reading frame is extracted from your input gff and BITACORA gff file?

Best,
Joel

@fangrunzhao
Copy link
Author

Hi Vizueta,

I checked the predicted protein sequence and some of the protein sequence output by BITACORA is different from the original annotated sequence. Specifically, some of the protein sequences are shorter than the original protein sequences, perhaps trimmed? such as,

predicted fasta file

rna-XM_042113013.1
QSVLGEKGRRIRELTSVVQKRFNIPENSVDLYAEKVATRGLCAIAQAESLRYKLIGGLAVRRACYGVLRFIMESGARGCEVVVSGKLRGQRAKSMKFVDGLMIHSGDPCNDYVNTATRHVLLRQGVLGIKVKIMLPWDQQGKNGPKKPQPDHILVTEPKDEPVPLEPTSDVRSIAPVPIPQAVPAVA*
original annotation fasta file
rna-XM_042113013.1
MAVNNISKKRKFVGDGVFKAELNEFLTRELAEDGYSGVEVRVTPTKSEIIIMATRTQSVLGEKGRRIRELTSVVQKRFNIPENSVDLYAEKVATRGLCAIAQAESLRYKLIGGLAVRRACYGVLRFIMESGARGCEVVVSGKLRGQRAKSMKFVDGLMIHSGDPCNDYVNTATRHVLLRQGVLGIKVKIMLPWDQQGKNGPKKPQPDHILVTEPKDEPVPLEPTSDVRSIAPVPIPQAVPAVA*

I used gffread to extract proteins and CDS from BITACORA GFF file and this extracts the sequences successfully. However, some of the protein sequences have incorrect reading frame resulting in stop codons within the sequence. I looked at the extracted CDS sequences and some of the CDS sequences are not multiples of 3 in length. Therefore, I compared the BITACORA gff file with the original GFFfile and as mentioned earlier the errors.

I have also used the script you mentioned to extract protein sequences and CDS sequences, which likewise will have stop codons in the protein sequences. However, the protein sequences extracted using the script and gffread do not have the same error sequences.

Best,
Run Zhao Fang

@fangrunzhao
Copy link
Author

Here's the data I used for testing.
my_test_data.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants