what's the problem with duplicated CDS ? #35

martin-raden · 2022-01-12T16:25:42Z

CopraRNA checks for duplicated CDS within the all.fas file:

CopraRNA/coprarna_aux/homology_intaRNA.pl

Line 341 in fdbca79

system "grep '>' all.fas | uniq -d > duplicated_CDS.txt";

why? what is the problem with them? I find that even the recent E.coli genome has duplicated CDS, like

bash-4.2$ bzgrep -A10 -B10 b0149 ~/data/ncbi-refseq-gbk/NC_000913.gb.bz2
                     LQMLGALEGERLSAQGQKMAALGNDPRLAAMLVSAKNDDEAATAAKIAAILEEPPRMG
                     NSDLGVAFSRNQPAWQQRSQQLLKRLNVRGGEADSSLIAPLLAGAFADRIARRRGQDG
                     RYQLANGMGAMLDANDALSRHEWLIAPLLLQGSASPDARILLALLVDIDELVQRCPQL
                     VQQSDTVEWDDAQGTLKAWRRLQIGQLTVKVQPLAKPSEDELHQAMLNGIRDKGLSVL
                     NWTAEAEQLRLRLLCAAKWLPEYDWPAVDDESLLAALETWLLPHMTGVHSLRGLKSLD
                     IYQALRGLLDWGMQQRLDSELPAHYTVPTGSRIAIRYHEDNPPALAVRMQEMFGEATN
                     PTIAQGRVPLVLELLSPAQRPLQITRDLSDFWKGAYREVQKEMKGRYPKHVWPDDPAN
                     TAPTRRTKKYS"
     gene            164730..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
                     /db_xref="GeneID:944843"
     CDS             164730..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /EC_number="2.4.1.129"
                     /codon_start=1
                     /transl_table=11
                     /product="peptidoglycan glycosyltransferase/peptidoglycan
                     DD-transpeptidase MrcB"
                     /protein_id="NP_414691.1"
                     /db_xref="UniProtKB/Swiss-Prot:P02919"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
--
                     MLSARPLGVQPRGGVISPQPAFMQLVRQELQAKLGDKVKDLSGVKIFTTFDSVAQDAA
                     EKAAVEGIPALKKQRKLSDLETAIVVVDRFSGEVRAMVGGSEPQFAGYNRAMQARRSI
                     GSLAKPATYLTALSQPKIYRLNTWIADAPIALRQPNGQVWSPQNDDRRYSESGRVMLV
                     DALTRSMNVPTVNLGMALGLPAVTETWIKLGVPKDQLHPVPAMLLGALNLTPIEVAQA
                     FQTIASGGNRAPLSALRSVIAEDGKVLYQSFPQAERAVPAQAAYLTLWTMQQVVQRGT
                     GRQLGAKYPNLHLAGKTGTTNNNVDTWFAGIDGSTVTITWVGRDNNQPTKLYGASGAM
                     SIYQRYLANQTPTPLNLVPPEDIADMGVDYDGNFVCSGGMRILPVWTSDPQSLCQQSE
                     MQQQPSGNPFDQSSQPQQQPQQQPAQQEQKDSDGVAGWIKDMFGSN"
     CDS             164865..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /codon_start=1
                     /transl_table=11
                     /product="PBP1Bgamma"
                     /protein_id="YP_010051172.1"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
                     /db_xref="GeneID:944843"
                     /translation="MPRKGKGKGKGRKPRGKRGWLWLLLKLAIVFAVLIAIYGVYLDQ
                     KIRSRIDGKVWQLPAAVYGRMVNLEPDMTISKNEMVKLLEATQYRQVSKMTRPGEFTV

can this be ignored? if not: can we just prune the all.fas to the first occurrence of each CDS?

thanks,
Martin

The text was updated successfully, but these errors were encountered:

martin-raden · 2022-01-12T16:45:12Z

the latter, ie removing all but the first duplicate, can be done using

system "grep '>' all.fas | uniq -d > duplicated_CDS.txt";
# remove duplicated CDS to the first (hopefully longest)
if (-s "duplicated_CDS.txt") {
  # grep all duplicated CDS entries from FASTA file
  system "for P in `cat duplicated_CDS.txt`; do grep -A1 -m1 \"\$P\" all.fas; done > duplicated_CDS.first.fa";
  # compile a pattern to match all duplicated gene ids
  # remove all duplicated entries from all.fas
  system "PAT=\$(cat duplicated_CDS.txt | tr '\\n' '|'); cat all.fas | tr \"\\n\" \"#\" | sed \"s/#>/\\n>/g\" | grep -v -P  \"\${PAT%|}\" | tr '#' '\\n'  > all.fas.no-duplicates";
  # join both files into a new all.fas
  system "cat all.fas.no-duplicates duplicated_CDS.first.fa > all.fas";
}

which should be done BEFORE doing the domclust call, ie. after all.fas was created!

PatrickRWright · 2022-01-13T07:39:21Z

Hi Martin, its been years now but off the top of my head I think this may have been an issue with domclust. i.e. if duplicate CDS exist, then domclust fails. This is just a best guess.

martin-raden · 2022-01-13T09:12:29Z

Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...

JensGeorg · 2022-01-13T09:43:09Z

is that a sequence based check or an identifier based check?

…

On Thu, Jan 13, 2022 at 10:12 AM Martin Raden ***@***.***> wrote: Hi @PatrickRWright <https://github.com/PatrickRWright> (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes... — Reply to this email directly, view it on GitHub <#35 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH44J4MZG4VTFJXCZPTMSADUV2JQRANCNFSM5LZNFFMQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

martin-raden · 2022-01-13T09:53:21Z

ID-based... it greps the CDS-FASTA-IDs and checks for duplicates within the all.fas file (which is used in domclust etc)

martin-raden · 2022-01-13T09:55:32Z

eventually, it seems to me (at least for e.coli) that the genome file lists sometimes after the full CDS also subsequences as CDS (based on alternative start codons?)

PatrickRWright · 2022-01-13T13:41:56Z

Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...

I would guess so but I'm not sure. How was this triggered? Were there errored runs?

martin-raden · 2022-01-13T14:27:03Z

we are currently migrating the webserver to a new cloud-based computation platform and thus have to reinstall everything. with that, I also updated the genome files, which now provides e.coli runs with an error due to the CDS duplicates. thus, I am currently working around it and would suggest to incorporate the change into the dev branch for integration into CopraRNA3...

PatrickRWright · 2022-01-13T14:40:25Z

Ok, so I think it depends on the extent of sequences that will need to be removed for CopraRNA to work from the technical side. How many duplicates are we talking about for E. coli? I think relevance of simply removing many duplicates can best be evaluated by @JensGeorg
I would suggest that you have a look at what is duplicated and why and then assess what the consequences of "straight forward" removal would be. This requires biological domain knowledge.
If the duplicates are few (<10) I assume the relevance in regard to results is minimal.

martin-raden · 2022-01-13T14:43:22Z

they are few (6 genes) and the suggestion from above doesnt remove all CDS from these genes but keeps the first CDS version (which is from first inspection the longest CDS covering the whole gene and not only parts of it)

so I would assume the impact small to non-existing but it keeps the workflow running and more robust.

@JensGeorg what do you think?

PatrickRWright · 2022-01-13T14:44:40Z

I think keeping one is better than removing them completely.

This was referenced Jan 13, 2022

prune duplicated CDS from all.fas #36

Merged

domclust crashes ... #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what's the problem with duplicated CDS ? #35

what's the problem with duplicated CDS ? #35

martin-raden commented Jan 12, 2022

martin-raden commented Jan 12, 2022 •

edited

Loading

PatrickRWright commented Jan 13, 2022

martin-raden commented Jan 13, 2022

JensGeorg commented Jan 13, 2022 via email

martin-raden commented Jan 13, 2022

martin-raden commented Jan 13, 2022

PatrickRWright commented Jan 13, 2022

martin-raden commented Jan 13, 2022

PatrickRWright commented Jan 13, 2022

martin-raden commented Jan 13, 2022

PatrickRWright commented Jan 13, 2022

what's the problem with duplicated CDS ? #35

what's the problem with duplicated CDS ? #35

Comments

martin-raden commented Jan 12, 2022

martin-raden commented Jan 12, 2022 • edited Loading

PatrickRWright commented Jan 13, 2022

martin-raden commented Jan 13, 2022

JensGeorg commented Jan 13, 2022 via email

martin-raden commented Jan 13, 2022

martin-raden commented Jan 13, 2022

PatrickRWright commented Jan 13, 2022

martin-raden commented Jan 13, 2022

PatrickRWright commented Jan 13, 2022

martin-raden commented Jan 13, 2022

PatrickRWright commented Jan 13, 2022

martin-raden commented Jan 12, 2022 •

edited

Loading