Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's the problem with duplicated CDS ? #35

Open
martin-raden opened this issue Jan 12, 2022 · 11 comments
Open

what's the problem with duplicated CDS ? #35

martin-raden opened this issue Jan 12, 2022 · 11 comments

Comments

@martin-raden
Copy link
Collaborator

Hi @PatrickRWright @JensGeorg

CopraRNA checks for duplicated CDS within the all.fas file:

system "grep '>' all.fas | uniq -d > duplicated_CDS.txt";

why? what is the problem with them? I find that even the recent E.coli genome has duplicated CDS, like

bash-4.2$ bzgrep -A10 -B10 b0149 ~/data/ncbi-refseq-gbk/NC_000913.gb.bz2
                     LQMLGALEGERLSAQGQKMAALGNDPRLAAMLVSAKNDDEAATAAKIAAILEEPPRMG
                     NSDLGVAFSRNQPAWQQRSQQLLKRLNVRGGEADSSLIAPLLAGAFADRIARRRGQDG
                     RYQLANGMGAMLDANDALSRHEWLIAPLLLQGSASPDARILLALLVDIDELVQRCPQL
                     VQQSDTVEWDDAQGTLKAWRRLQIGQLTVKVQPLAKPSEDELHQAMLNGIRDKGLSVL
                     NWTAEAEQLRLRLLCAAKWLPEYDWPAVDDESLLAALETWLLPHMTGVHSLRGLKSLD
                     IYQALRGLLDWGMQQRLDSELPAHYTVPTGSRIAIRYHEDNPPALAVRMQEMFGEATN
                     PTIAQGRVPLVLELLSPAQRPLQITRDLSDFWKGAYREVQKEMKGRYPKHVWPDDPAN
                     TAPTRRTKKYS"
     gene            164730..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
                     /db_xref="GeneID:944843"
     CDS             164730..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /EC_number="2.4.1.129"
                     /codon_start=1
                     /transl_table=11
                     /product="peptidoglycan glycosyltransferase/peptidoglycan
                     DD-transpeptidase MrcB"
                     /protein_id="NP_414691.1"
                     /db_xref="UniProtKB/Swiss-Prot:P02919"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
--
                     MLSARPLGVQPRGGVISPQPAFMQLVRQELQAKLGDKVKDLSGVKIFTTFDSVAQDAA
                     EKAAVEGIPALKKQRKLSDLETAIVVVDRFSGEVRAMVGGSEPQFAGYNRAMQARRSI
                     GSLAKPATYLTALSQPKIYRLNTWIADAPIALRQPNGQVWSPQNDDRRYSESGRVMLV
                     DALTRSMNVPTVNLGMALGLPAVTETWIKLGVPKDQLHPVPAMLLGALNLTPIEVAQA
                     FQTIASGGNRAPLSALRSVIAEDGKVLYQSFPQAERAVPAQAAYLTLWTMQQVVQRGT
                     GRQLGAKYPNLHLAGKTGTTNNNVDTWFAGIDGSTVTITWVGRDNNQPTKLYGASGAM
                     SIYQRYLANQTPTPLNLVPPEDIADMGVDYDGNFVCSGGMRILPVWTSDPQSLCQQSE
                     MQQQPSGNPFDQSSQPQQQPQQQPAQQEQKDSDGVAGWIKDMFGSN"
     CDS             164865..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /codon_start=1
                     /transl_table=11
                     /product="PBP1Bgamma"
                     /protein_id="YP_010051172.1"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
                     /db_xref="GeneID:944843"
                     /translation="MPRKGKGKGKGRKPRGKRGWLWLLLKLAIVFAVLIAIYGVYLDQ
                     KIRSRIDGKVWQLPAAVYGRMVNLEPDMTISKNEMVKLLEATQYRQVSKMTRPGEFTV

can this be ignored? if not: can we just prune the all.fas to the first occurrence of each CDS?

thanks,
Martin

@martin-raden
Copy link
Collaborator Author

martin-raden commented Jan 12, 2022

the latter, ie removing all but the first duplicate, can be done using

system "grep '>' all.fas | uniq -d > duplicated_CDS.txt";
# remove duplicated CDS to the first (hopefully longest)
if (-s "duplicated_CDS.txt") {
  # grep all duplicated CDS entries from FASTA file
  system "for P in `cat duplicated_CDS.txt`; do grep -A1 -m1 \"\$P\" all.fas; done > duplicated_CDS.first.fa";
  # compile a pattern to match all duplicated gene ids
  # remove all duplicated entries from all.fas
  system "PAT=\$(cat duplicated_CDS.txt | tr '\\n' '|'); cat all.fas | tr \"\\n\" \"#\" | sed \"s/#>/\\n>/g\" | grep -v -P  \"\${PAT%|}\" | tr '#' '\\n'  > all.fas.no-duplicates";
  # join both files into a new all.fas
  system "cat all.fas.no-duplicates duplicated_CDS.first.fa > all.fas";
}

which should be done BEFORE doing the domclust call, ie. after all.fas was created!

@PatrickRWright
Copy link
Owner

Hi Martin, its been years now but off the top of my head I think this may have been an issue with domclust. i.e. if duplicate CDS exist, then domclust fails. This is just a best guess.

@martin-raden
Copy link
Collaborator Author

Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...

@JensGeorg
Copy link
Contributor

JensGeorg commented Jan 13, 2022 via email

@martin-raden
Copy link
Collaborator Author

ID-based... it greps the CDS-FASTA-IDs and checks for duplicates within the all.fas file (which is used in domclust etc)

@martin-raden
Copy link
Collaborator Author

eventually, it seems to me (at least for e.coli) that the genome file lists sometimes after the full CDS also subsequences as CDS (based on alternative start codons?)

@PatrickRWright
Copy link
Owner

Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...

I would guess so but I'm not sure. How was this triggered? Were there errored runs?

@martin-raden
Copy link
Collaborator Author

we are currently migrating the webserver to a new cloud-based computation platform and thus have to reinstall everything. with that, I also updated the genome files, which now provides e.coli runs with an error due to the CDS duplicates. thus, I am currently working around it and would suggest to incorporate the change into the dev branch for integration into CopraRNA3...

@PatrickRWright
Copy link
Owner

Ok, so I think it depends on the extent of sequences that will need to be removed for CopraRNA to work from the technical side. How many duplicates are we talking about for E. coli? I think relevance of simply removing many duplicates can best be evaluated by @JensGeorg
I would suggest that you have a look at what is duplicated and why and then assess what the consequences of "straight forward" removal would be. This requires biological domain knowledge.
If the duplicates are few (<10) I assume the relevance in regard to results is minimal.

@martin-raden
Copy link
Collaborator Author

they are few (6 genes) and the suggestion from above doesnt remove all CDS from these genes but keeps the first CDS version (which is from first inspection the longest CDS covering the whole gene and not only parts of it)

so I would assume the impact small to non-existing but it keeps the workflow running and more robust.

@JensGeorg what do you think?

@PatrickRWright
Copy link
Owner

I think keeping one is better than removing them completely.

This was referenced Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants