Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing from sRNAbench data #53

Open
lsumoy opened this issue Feb 19, 2019 · 10 comments
Open

parsing from sRNAbench data #53

lsumoy opened this issue Feb 19, 2019 · 10 comments

Comments

@lsumoy
Copy link

lsumoy commented Feb 19, 2019

Hi,

We recently detected that gff output form mirtop stats gives unexpected results. We got two entries for canonical (variant=NA for many miRNAs instead of a single one (one of the two in fact is a n add-3p=-1 isoform but the miRTop generated gff tags it as NA (canonical).

In many instances wo results come out annotated as 'NA? (i.e. canonical or mature reference in miRBase) in our parsed output:

example: in the case of miR-151a-5p
http://www.mirbase.org/cgi-bin/mature.pl?mature_acc=MIMAT0004697 the canonical reference lecenseplate ID is PhkUD0w

9589 isomiR HC_BST_RA_L001_10 hsa-mir-151a hsa-miR-151a-5p PhkUD0 TCGAGGAGCTCACAGTCT
9593 isomiR HC_BST_RA_L001_10 hsa-mir-151a hsa-miR-151a-5p PhkUD0w TCGAGGAGCTCACAGTCTAGT
expression
9589 169
9593 1551

Looking at gff files generated by mirtop (please see highlighted (**) entries that have NA as variant):

grep -w 'PhkUD0' HC_BST_RA_L001_10.gff
hsa-mir-151a miRBasev22 isomiR 10 29 . + . Read=TCGAGGAGCTCACAGTCTA; UID=PhkUD0@2; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_3p:+1; Cigar=19M; Expression=14; Filter=Pass; Hits=2
hsa-mir-151a miRBasev22 ref_miRNA 10 28 . + . Read=TCGAGGAGCTCACAGTCT; UID=PhkUD0; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=NA; Cigar=18M; Expression=169; Filter=Pass; Hits=2
hsa-mir-151b miRBasev22 isomiR 59 78 . + . Read=TCGAGGAGCTCACAGTCTA; UID=PhkUD0@2; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_3p:-2; Cigar=19M; Expression=14; Filter=Pass; Hits=2
hsa-mir-151b miRBasev22 isomiR 59 77 . + . Read=TCGAGGAGCTCACAGTCT; UID=PhkUD0; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_3p:-3; Cigar=18M; Expression=169; Filter=Pass; Hits=2
hsa-mir-151b miRBasev22 isomiR 59 80 . + . Read=TCGAGGAGCTCACAGTCTAAA; UID=PhkUD0@; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_add:+2; Cigar=19MAM; Expression=1; Filter=Pass; Hits=1

grep -w 'PhkUD0w' HC_BST_RA_L001_10.gff
hsa-mir-151a miRBasev22 isomiR 10 32 . + . Read=TCGAGGAGCTCACAGTCTAGTA; UID=PhkUD0w@2; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_3p:+1; Cigar=22M; Expression=657; Filter=Pass; Hits=1
hsa-mir-151a miRBasev22 isomiR 10 33 . + . Read=TCGAGGAGCTCACAGTCTAGTAA; UID=PhkUD0w@1; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_add:+1; Cigar=22MA; Expression=56; Filter=Pass; Hits=1
hsa-mir-151a miRBasev22 isomiR 10 34 . + . Read=TCGAGGAGCTCACAGTCTAGTAAA; UID=PhkUD0w@; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_add:+2; Cigar=22MAA; Expression=2; Filter=Pass; Hits=1
hsa-mir-151a miRBasev22 ref_miRNA 10 31 . + . Read=TCGAGGAGCTCACAGTCTAGT; UID=PhkUD0w; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=NA; Cigar=21M; Expression=1551; Filter=Pass; Hits=1

Steps to reproduce the problem.

/imppc/labs/lslab/jsanchez/python_environment/mirtop_jsanchez/bin/mirtop gff --sps hsa --hairpin /imppc/labs/lslab/aluna/opt/sRNAtoolboxDB/libs/hairpin.fa --gtf /imppc/labs/lslab/aluna/opt/hsa.gff3 --format srnabench -o /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10 /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.1.isomiR_sRNAbenchtoolbox/HC_BST_RA_L001_10 2> /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10_logfile.txt

/imppc/labs/lslab/jsanchez/python_environment/mirtop_jsanchez/bin/mirtop stats -o /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10/stats /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10/HC_BST_RA_L001_10.gff 2>> /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10_logfile.txt

Specifications like the version of the project, operating system, or hardware.

We are running this on:
mirtop (0.3.17)
python2.7
debian8.10
linux

@lpantano
Copy link
Contributor

Hi,

thanks for reporting this. This may be fixed in the devel version. Can you provide with these two files:

  • microRNAannotation.txt
  • reads.annotation

You can create them to contain only these two sequences. I can test it here, and see what is going on.

As well, you can try to use the devel version yourself.

Thanks a lot for spotting this.

Cheers

@lsumoy
Copy link
Author

lsumoy commented Feb 20, 2019

Here are the specific entries for the example in the two sRNAbench output files:

microRNAannotation.txt lines:
TCGAGGAGCTCACAGTCTAGT hsa-miR-151a-5p hsa-mir-151a exact - 1551 597.3388956586441 351.1619134905494
TCGAGGAGCTCACAGTCT hsa-miR-151a-5p$hsa-miR-151b hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3 - 169 65.08721687060661 38.263290380337104

reads.annotation.txt lines:
TCGAGGAGCTCACAGTCTAGT 1551 351.1619134905494 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,31 1
TCGAGGAGCTCACAGTCT 169 38.263290380337104 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,28$mature#hsa-miR-151b#sense#hsa-mir-151b,60,77 2

reads.annotation.txt
microRNAannotation.txt

@lpantano
Copy link
Contributor

lpantano commented Feb 20, 2019 via email

@lpantano
Copy link
Contributor

Hi,

now that I am looking better to the sRNAbench output, I see that there is where the two reads are labeled as exact for hsa-miR-151a-5p. I assume the order is relevant in the columns indicating the miRNA and the variants.

So, for instance, read:

TCGAGGAGCTCACAGTCTAGT says is hsa-mir-151a exact

and read:

TCGAGGAGCTCACAGTCT says is hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3, and I assume the first variant type (exact) is for hsa-mir-151a and the second (lv3p) for hsa-mir-151b.

mirtop is only parsing that. I assume the variants should be switched, it would be worthy to have the opinion of @mlhack on this.

Cheers

@JFsanchezherrero
Copy link

Hi Lorena,
I reply here to your first question.

It is version 22 for mirBase in this case.

##gff-version 3
##date 2018-3-5

#Chromosomal coordinates of Homo sapiens microRNAs
#microRNAs: miRBase v22
#genome-build-id: GRCh38
#genome-build-accession: NCBI_Assembly:GCA_000001405.15

Thank you very much

@lsumoy
Copy link
Author

lsumoy commented Feb 20, 2019

This was done with mirBase 22.

Looking it up it seems the shorter form is the exact for miR-151b and the longer one is the exact for 151a. The fact that the shorter could be at the same time a add-3p:-3 of miR-151a leads to the confusion.

Beyond issues with files formats and data parsing to me this means that it is tricky to perform analysis grouping by isomiR class and that it makes all the sense to do it at the individual sequence (license plate) level. If not it represents a nightmare in the few cases were there is redundancy in terms of annotation. I suppose similar issue happen with alternative transcript isoform and splicing analysis in mRNA.

@mlhack
Copy link

mlhack commented Feb 20, 2019

Hi all, thanks for detecting this. it really seems that the labels are incorrectly switched. i will check and fix asap.
Cheers,
Michael

@lpantano
Copy link
Contributor

Hi @lsumoy,

Just to follow up on your comment:

Yes, it is difficult to group by anything for these cases, that is the reason the format should give you enough information to do what it may best, like work with sequence levels, or even filter these cases.

It would be a good thing if you have somebody there to contribute in the code to create matrix with unique uid, where each sequence is represented only once and is labeling the sequences with all possible sources. If you are willing to add this feature, let me know, I can help with details!

Thanks!

@mlhack
Copy link

mlhack commented Feb 27, 2019

Hi all,
i fixed this issue. now, when one sequence was assigned to more than one reference sequence with different isomiR labels, those are in the correct order.
maybe one should try to guess the most likely scenario instead of making multiple labeling.
for example, in miRGeneDB, there is no hsa-mir-151b (correct @BastianFromm ?) and therefore the short sequence might be likely a 3' length variant of hsa-mir-151a. another probability would be to check for the abundance of the reference sequences assigning the read only to the most abundant one.
Michael

@lpantano
Copy link
Contributor

lpantano commented Feb 27, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants