-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors in the start and end positions of some sequences in the output gff file cause frameshift mutation #15
Comments
Hi again, I need more insights to see if it is a bug or a problem with your input gff, which might be corrected by BITACORA and thus the difference in the positions. Can you check that the predicted protein sequence (in the fasta file), is the same or different from the protein in your original annotation? Also, using the new BITACORA GFF3 file, have you tried to predict the protein sequence using "gffread", if yes, does the resulting protein have the incorrect reading frame resulting in stop codons in the sequence? Finally, I use the following script to extract protein sequences from the gff Could you see if the correct reading frame is extracted from your input gff and BITACORA gff file? Best, |
Hi Vizueta, I checked the predicted protein sequence and some of the protein sequence output by BITACORA is different from the original annotated sequence. Specifically, some of the protein sequences are shorter than the original protein sequences, perhaps trimmed? such as, predicted fasta file
I used gffread to extract proteins and CDS from BITACORA GFF file and this extracts the sequences successfully. However, some of the protein sequences have incorrect reading frame resulting in stop codons within the sequence. I looked at the extracted CDS sequences and some of the CDS sequences are not multiples of 3 in length. Therefore, I compared the BITACORA gff file with the original GFFfile and as mentioned earlier the errors. I have also used the script you mentioned to extract protein sequences and CDS sequences, which likewise will have stop codons in the protein sequences. However, the protein sequences extracted using the script and gffread do not have the same error sequences. Best, |
Here's the data I used for testing. |
Hi Vizueta,
Thank you for developing a favorable tool for gene family annotation, but now I have some problems.
Some sequences of this *_genomic_and_annotated_proteins_trimmed_idseqsclustered.gff3 file will have the wrong open reading frame.
Here is part of this *_genomic_and_annotated_proteins_trimmed_idseqsclustered.gff3 file:
_chr001 AnnotGFF gene 12515613 12515776 . - . ID=_chr001.341;blastphmmer;annot;Pos:1-54
_chr001 AnnotGFF mRNA 12515613 12515776 . - . ID=_chr001.341.1;Parent=_chr001.341;blastphmmer;annot;Pos:1-54
_chr001 AnnotGFF CDS 12515613 12515773 . - 2 ID=_chr001.341.1_cds_1;Parent=_chr001.341.1;blastphmmer;annot;Pos:1-54
The following is part of the original genome gff file (the input gff file needed to run):
_chr001 . gene 12515609 12515775 . - . ID=_chr001.341
_chr001 . mRNA 12515609 12515775 . - . ID=_chr001.341.1;Parent=_chr001.341
_chr001 . exon 12515609 12515775 . - . ID=_chr001.341.1_exon_1;Parent=_chr00>
_chr001 . CDS 12515609 12515775 . - 2 ID=_chr001.341.1_cds_1;Parent=*_chr001>
You can see that these two identical sequences have different start and end positions. This will result in errors when extracting proteins or CDS using gffread, such as frameshift mutation. I think this is caused by the program after trimming the sequence, is this a bug please?
Best,
Run Zhao Fang
The text was updated successfully, but these errors were encountered: