-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what's the problem with duplicated CDS ? #35
Comments
the latter, ie removing all but the first duplicate, can be done using system "grep '>' all.fas | uniq -d > duplicated_CDS.txt";
# remove duplicated CDS to the first (hopefully longest)
if (-s "duplicated_CDS.txt") {
# grep all duplicated CDS entries from FASTA file
system "for P in `cat duplicated_CDS.txt`; do grep -A1 -m1 \"\$P\" all.fas; done > duplicated_CDS.first.fa";
# compile a pattern to match all duplicated gene ids
# remove all duplicated entries from all.fas
system "PAT=\$(cat duplicated_CDS.txt | tr '\\n' '|'); cat all.fas | tr \"\\n\" \"#\" | sed \"s/#>/\\n>/g\" | grep -v -P \"\${PAT%|}\" | tr '#' '\\n' > all.fas.no-duplicates";
# join both files into a new all.fas
system "cat all.fas.no-duplicates duplicated_CDS.first.fa > all.fas";
} which should be done BEFORE doing the domclust call, ie. after all.fas was created! |
Hi Martin, its been years now but off the top of my head I think this may have been an issue with domclust. i.e. if duplicate CDS exist, then domclust fails. This is just a best guess. |
Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes... |
is that a sequence based check or an identifier based check?
…On Thu, Jan 13, 2022 at 10:12 AM Martin Raden ***@***.***> wrote:
Hi @PatrickRWright <https://github.com/PatrickRWright> (happy new year,
btw), thanks for the quick reply. So pruning duplicated entries to one
should fix the issue right? Otherwise, we are a bit doomed, since even
E.coli nowadays shows multiple CDS for some genes...
—
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH44J4MZG4VTFJXCZPTMSADUV2JQRANCNFSM5LZNFFMQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
ID-based... it greps the CDS-FASTA-IDs and checks for duplicates within the |
eventually, it seems to me (at least for e.coli) that the genome file lists sometimes after the full CDS also subsequences as CDS (based on alternative start codons?) |
I would guess so but I'm not sure. How was this triggered? Were there errored runs? |
we are currently migrating the webserver to a new cloud-based computation platform and thus have to reinstall everything. with that, I also updated the genome files, which now provides e.coli runs with an error due to the CDS duplicates. thus, I am currently working around it and would suggest to incorporate the change into the dev branch for integration into CopraRNA3... |
Ok, so I think it depends on the extent of sequences that will need to be removed for CopraRNA to work from the technical side. How many duplicates are we talking about for E. coli? I think relevance of simply removing many duplicates can best be evaluated by @JensGeorg |
they are few (6 genes) and the suggestion from above doesnt remove all CDS from these genes but keeps the first CDS version (which is from first inspection the longest CDS covering the whole gene and not only parts of it) so I would assume the impact small to non-existing but it keeps the workflow running and more robust. @JensGeorg what do you think? |
I think keeping one is better than removing them completely. |
Hi @PatrickRWright @JensGeorg
CopraRNA checks for duplicated CDS within the all.fas file:
CopraRNA/coprarna_aux/homology_intaRNA.pl
Line 341 in fdbca79
why? what is the problem with them? I find that even the recent E.coli genome has duplicated CDS, like
can this be ignored? if not: can we just prune the all.fas to the first occurrence of each CDS?
thanks,
Martin
The text was updated successfully, but these errors were encountered: