Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates from miraligner #45

Open
jonahcullen opened this issue Apr 26, 2023 · 2 comments
Open

Duplicates from miraligner #45

jonahcullen opened this issue Apr 26, 2023 · 2 comments

Comments

@jonahcullen
Copy link

Hello, I had a quick question/clarification re: duplicate miRNA IDs. I noticed in the duplicates matrix there were a large number of miRNAs called duplicates due to the license plate but when you look at the annotation it seems to be due to the ordering (eg iso_snv, iso_3p:-1 vs iso_3p:-1,iso_snv). Should these actually be considered different miRNAs? When I looked at a couple example ones it appears that half of the samples would be listed in one order and the other half listed in a different order, both with the same license plate. I wrote some code to adjust for that which resulted in 0 duplicates from the inputs.

@JFsanchezherrero
Copy link
Member

Hi there,

We decided to discard these isomiRs as it be potential mistakes or real duplicates. We trust in miraligner annotation and/or miRTop and we haven't further investigated.

In some real data we have analyzed we always find that these isomiRs have spurious counts so we decided to discard them but it might be worth to further check them or take them into account. In the pipeline we generate this duplicated matrix so that the final user can decide what to do about them.

Do you have any examples on how to solve this issue? It would be great to have a look.

Best regards

@jonahcullen
Copy link
Author

jonahcullen commented Apr 27, 2023

Right okay. Perhaps I am just confused about how the ordering of the annotation given the same UID could be spurious? I started digging into this as I found 2x as many duplicates compared to non-duplicates. I don't have a number but the vast majority of those duplicates are 0 counts but some are kinda high.

I modified the isomir matrix generator function by including

isomirs = [
    'iso_3p', 'iso_add3p', 'iso_5p', 'iso_add5p', 'iso_snv',
    'iso_snv_central', 'iso_snv_central_offset', 'iso_snv_central_supp',
    'iso_snv_seed', 'NA'
]

# function extract key from each element
def sort_key(element):
    return isomirs.index(element.split(':')[0])

and then instead of data['unique_id'] = data.apply(lambda data: data['miRNA'] + '&' + data['Variant'] + '&' + data['UID'], axis=1)

data['unique_id'] = data.apply(lambda data:
    data['miRNA']
    + '&'
    + ','.join(sorted(data['Variant'].split(','), key=sort_key))
    + '&'
    + data['UID'],
    axis=1
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants