parsing from sRNAbench data #53

lsumoy · 2019-02-19T17:42:17Z

Hi,

We recently detected that gff output form mirtop stats gives unexpected results. We got two entries for canonical (variant=NA for many miRNAs instead of a single one (one of the two in fact is a n add-3p=-1 isoform but the miRTop generated gff tags it as NA (canonical).

In many instances wo results come out annotated as 'NA? (i.e. canonical or mature reference in miRBase) in our parsed output:

example: in the case of miR-151a-5p
http://www.mirbase.org/cgi-bin/mature.pl?mature_acc=MIMAT0004697 the canonical reference lecenseplate ID is PhkUD0w

9589 isomiR HC_BST_RA_L001_10 hsa-mir-151a hsa-miR-151a-5p PhkUD0 TCGAGGAGCTCACAGTCT
9593 isomiR HC_BST_RA_L001_10 hsa-mir-151a hsa-miR-151a-5p PhkUD0w TCGAGGAGCTCACAGTCTAGT
expression
9589 169
9593 1551

Looking at gff files generated by mirtop (please see highlighted (**) entries that have NA as variant):

grep -w 'PhkUD0' HC_BST_RA_L001_10.gff
hsa-mir-151a miRBasev22 isomiR 10 29 . + . Read=TCGAGGAGCTCACAGTCTA; UID=PhkUD0@2; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_3p:+1; Cigar=19M; Expression=14; Filter=Pass; Hits=2
hsa-mir-151a miRBasev22 ref_miRNA 10 28 . + . Read=TCGAGGAGCTCACAGTCT; UID=PhkUD0; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=NA; Cigar=18M; Expression=169; Filter=Pass; Hits=2
hsa-mir-151b miRBasev22 isomiR 59 78 . + . Read=TCGAGGAGCTCACAGTCTA; UID=PhkUD0@2; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_3p:-2; Cigar=19M; Expression=14; Filter=Pass; Hits=2
hsa-mir-151b miRBasev22 isomiR 59 77 . + . Read=TCGAGGAGCTCACAGTCT; UID=PhkUD0; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_3p:-3; Cigar=18M; Expression=169; Filter=Pass; Hits=2
hsa-mir-151b miRBasev22 isomiR 59 80 . + . Read=TCGAGGAGCTCACAGTCTAAA; UID=PhkUD0@; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_add:+2; Cigar=19MAM; Expression=1; Filter=Pass; Hits=1

grep -w 'PhkUD0w' HC_BST_RA_L001_10.gff
hsa-mir-151a miRBasev22 isomiR 10 32 . + . Read=TCGAGGAGCTCACAGTCTAGTA; UID=PhkUD0w@2; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_3p:+1; Cigar=22M; Expression=657; Filter=Pass; Hits=1
hsa-mir-151a miRBasev22 isomiR 10 33 . + . Read=TCGAGGAGCTCACAGTCTAGTAA; UID=PhkUD0w@1; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_add:+1; Cigar=22MA; Expression=56; Filter=Pass; Hits=1
hsa-mir-151a miRBasev22 isomiR 10 34 . + . Read=TCGAGGAGCTCACAGTCTAGTAAA; UID=PhkUD0w@; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_add:+2; Cigar=22MAA; Expression=2; Filter=Pass; Hits=1
hsa-mir-151a miRBasev22 ref_miRNA 10 31 . + . Read=TCGAGGAGCTCACAGTCTAGT; UID=PhkUD0w; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=NA; Cigar=21M; Expression=1551; Filter=Pass; Hits=1

Steps to reproduce the problem.

/imppc/labs/lslab/jsanchez/python_environment/mirtop_jsanchez/bin/mirtop gff --sps hsa --hairpin /imppc/labs/lslab/aluna/opt/sRNAtoolboxDB/libs/hairpin.fa --gtf /imppc/labs/lslab/aluna/opt/hsa.gff3 --format srnabench -o /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10 /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.1.isomiR_sRNAbenchtoolbox/HC_BST_RA_L001_10 2> /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10_logfile.txt

/imppc/labs/lslab/jsanchez/python_environment/mirtop_jsanchez/bin/mirtop stats -o /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10/stats /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10/HC_BST_RA_L001_10.gff 2>> /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10_logfile.txt

Specifications like the version of the project, operating system, or hardware.

We are running this on:
mirtop (0.3.17)
python2.7
debian8.10
linux

lpantano · 2019-02-20T12:51:02Z

Hi,

thanks for reporting this. This may be fixed in the devel version. Can you provide with these two files:

microRNAannotation.txt
reads.annotation

You can create them to contain only these two sequences. I can test it here, and see what is going on.

As well, you can try to use the devel version yourself.

Thanks a lot for spotting this.

Cheers

lsumoy · 2019-02-20T13:18:00Z

Here are the specific entries for the example in the two sRNAbench output files:

microRNAannotation.txt lines:
TCGAGGAGCTCACAGTCTAGT hsa-miR-151a-5p hsa-mir-151a exact - 1551 597.3388956586441 351.1619134905494
TCGAGGAGCTCACAGTCT hsa-miR-151a-5p$hsa-miR-151b hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3 - 169 65.08721687060661 38.263290380337104

reads.annotation.txt lines:
TCGAGGAGCTCACAGTCTAGT 1551 351.1619134905494 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,31 1
TCGAGGAGCTCACAGTCT 169 38.263290380337104 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,28$mature#hsa-miR-151b#sense#hsa-mir-151b,60,77 2

reads.annotation.txt
microRNAannotation.txt

lpantano · 2019-02-20T14:59:58Z

sorry, another question: another question, is this mirbase 21 or 22?

…

On Wed, Feb 20, 2019 at 8:18 AM lsumoy ***@***.***> wrote: Here are the specific entries for the example in the two sRNAbench output files: microRNAannotation.txt lines: *TCGAGGAGCTCACAGTCTAGT hsa-miR-151a-5p hsa-mir-151a exact - 1551 597.3388956586441 351.1619134905494 TCGAGGAGCTCACAGTCT hsa-miR-151a-5p$hsa-miR-151b hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3 - 169 65.08721687060661 38.263290380337104* reads.annotation.txt lines: *TCGAGGAGCTCACAGTCTAGT 1551 351.1619134905494 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,31 1 TCGAGGAGCTCACAGTCT 169 38.263290380337104 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,28$mature#hsa-miR-151b#sense#hsa-mir-151b,60,77 2* reads.annotation.txt <https://github.com/miRTop/mirtop/files/2884720/reads.annotation.txt> microRNAannotation.txt <https://github.com/miRTop/mirtop/files/2884709/microRNAannotation.txt> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#53 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABi_HECb-0Aai86aWJa4dAh-pQeJ91tNks5vPUsIgaJpZM4bDglo> .

lpantano · 2019-02-20T15:18:14Z

Hi,

now that I am looking better to the sRNAbench output, I see that there is where the two reads are labeled as exact for hsa-miR-151a-5p. I assume the order is relevant in the columns indicating the miRNA and the variants.

So, for instance, read:

TCGAGGAGCTCACAGTCTAGT says is hsa-mir-151a exact

and read:

TCGAGGAGCTCACAGTCT says is hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3, and I assume the first variant type (exact) is for hsa-mir-151a and the second (lv3p) for hsa-mir-151b.

mirtop is only parsing that. I assume the variants should be switched, it would be worthy to have the opinion of @mlhack on this.

Cheers

JFsanchezherrero · 2019-02-20T16:37:55Z

Hi Lorena,
I reply here to your first question.

It is version 22 for mirBase in this case.

##gff-version 3
##date 2018-3-5

#Chromosomal coordinates of Homo sapiens microRNAs
#microRNAs: miRBase v22
#genome-build-id: GRCh38
#genome-build-accession: NCBI_Assembly:GCA_000001405.15

Thank you very much

lsumoy · 2019-02-20T16:40:24Z

This was done with mirBase 22.

Looking it up it seems the shorter form is the exact for miR-151b and the longer one is the exact for 151a. The fact that the shorter could be at the same time a add-3p:-3 of miR-151a leads to the confusion.

Beyond issues with files formats and data parsing to me this means that it is tricky to perform analysis grouping by isomiR class and that it makes all the sense to do it at the individual sequence (license plate) level. If not it represents a nightmare in the few cases were there is redundancy in terms of annotation. I suppose similar issue happen with alternative transcript isoform and splicing analysis in mRNA.

mlhack · 2019-02-20T22:10:45Z

Hi all, thanks for detecting this. it really seems that the labels are incorrectly switched. i will check and fix asap.
Cheers,
Michael

lpantano · 2019-02-26T22:24:33Z

Hi @lsumoy,

Just to follow up on your comment:

Yes, it is difficult to group by anything for these cases, that is the reason the format should give you enough information to do what it may best, like work with sequence levels, or even filter these cases.

It would be a good thing if you have somebody there to contribute in the code to create matrix with unique uid, where each sequence is represented only once and is labeling the sequences with all possible sources. If you are willing to add this feature, let me know, I can help with details!

Thanks!

mlhack · 2019-02-27T08:52:46Z

Hi all,
i fixed this issue. now, when one sequence was assigned to more than one reference sequence with different isomiR labels, those are in the correct order.
maybe one should try to guess the most likely scenario instead of making multiple labeling.
for example, in miRGeneDB, there is no hsa-mir-151b (correct @BastianFromm ?) and therefore the short sequence might be likely a 3' length variant of hsa-mir-151a. another probability would be to check for the abundance of the reference sequences assigning the read only to the most abundant one.
Michael

lpantano · 2019-02-27T17:18:13Z

Thank you for the fix. All of these ideas sound good to me, we are trying to test some of them with simulated data. Using other more robust databases will help as well, as miRGeneDB for sure. Cheers

JFsanchezherrero mentioned this issue Sep 30, 2019

parsing from sRNAbench data (II) #60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing from sRNAbench data #53

parsing from sRNAbench data #53

lsumoy commented Feb 19, 2019

lpantano commented Feb 20, 2019

lsumoy commented Feb 20, 2019

lpantano commented Feb 20, 2019 via email

lpantano commented Feb 20, 2019

JFsanchezherrero commented Feb 20, 2019

lsumoy commented Feb 20, 2019

mlhack commented Feb 20, 2019

lpantano commented Feb 26, 2019

mlhack commented Feb 27, 2019

lpantano commented Feb 27, 2019 via email

parsing from sRNAbench data #53

parsing from sRNAbench data #53

Comments

lsumoy commented Feb 19, 2019

Steps to reproduce the problem.

Specifications like the version of the project, operating system, or hardware.

lpantano commented Feb 20, 2019

lsumoy commented Feb 20, 2019

lpantano commented Feb 20, 2019 via email

lpantano commented Feb 20, 2019

JFsanchezherrero commented Feb 20, 2019

lsumoy commented Feb 20, 2019

mlhack commented Feb 20, 2019

lpantano commented Feb 26, 2019

mlhack commented Feb 27, 2019

lpantano commented Feb 27, 2019 via email