Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check which exon #450

Open
primary-11 opened this issue Nov 14, 2024 · 3 comments
Open

check which exon #450

primary-11 opened this issue Nov 14, 2024 · 3 comments

Comments

@primary-11
Copy link

Hello, I would like to see which specific exons undergo alternative splicing for these genes. Taking SE.MATS.JCEC as an example, I’m currently using the Ensembl website to examine all transcripts of the corresponding gene and then check which exon of which transcript undergoes splicing. However, sometimes I can't find this information. Are there any tools that can annotate this, or other software I could use to view it?

@EricKutschera
Copy link
Contributor

You could use bedtools intersect to check the overlap of the SE events with the --gtf you used to run rMATS. The SE.MATS.JCEC.txt file can be converted to bed format with a python script copied below. You can filter the gtf down to exon rows with grep. Here are the commands I used:

python se_to_bed.py --in-mats SE.MATS.JCEC.txt --out-bed se.bed
grep "$(printf '\t')exon$(printf '\t')" gencode.v46.primary_assembly.annotation.gtf > exon.gtf
bedtools intersect -wo -a exon.gtf -b se.bed > se_results.txt

An example line of se_results.txt is:

chr1	HAVANA	exon	24738	24886	.	-	.	gene_id "ENSG00000227232.6"; transcript_id "ENST00000488147.2"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; exon_number 1; exon_id "ENSE00003507205.2"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";	chr1	18912	24891	88891	0	-	149

The last few columns show the SE event coordinates (chr1 18912 24891) and the event ID (88891). The gene, transcript, and exon info are also in that line: gene_id "ENSG00000227232.6"; transcript_id "ENST00000488147.2"; exon_number 1

Here's se_to_bed.py

import argparse


def parse_args():
    parser = argparse.ArgumentParser(
        description='Convert SE.MATS.JC[EC].txt to BED format')
    parser.add_argument('--in-mats', required=True, help='SE.MATS input file')
    parser.add_argument('--out-bed', required=True, help='BED output file')

    return parser.parse_args()


def se_to_bed(in_mats, out_bed):
    score = '0'
    with open(out_bed, 'wt') as out_handle:
        with open(in_mats, 'rt') as in_handle:
            for line_i, line in enumerate(in_handle):
                columns = line.rstrip('\n').split('\t')
                if line_i == 0:
                    headers = columns
                    continue

                row = dict(zip(headers, columns))
                out_columns = [
                    row['chr'], row['exonStart_0base'], row['exonEnd'],
                    row['ID'], score, row['strand']
                ]
                out_handle.write('{}\n'.format('\t'.join(out_columns)))


def main():
    args = parse_args()
    se_to_bed(args.in_mats, args.out_bed)


if __name__ == '__main__':
    main()

@primary-11
Copy link
Author

Thank you very much! I have successfully identified the specific locations of exon skipping events. Additionally, I would like to annotate the RI, MXE, A3SS, and A5SS files. I attempted to modify your script to achieve this, but I was unsuccessful. Could you please provide assistance with this?

@EricKutschera
Copy link
Contributor

This is the SE specific part of the script: row['exonStart_0base'], row['exonEnd'],

It selects the coordinates of the skipped exon to use when checking for overlapping exons from the gtf. For the other event types you can use other column names to get the coordinates to use for the overlap check. This post shows what the coordinate column names represent: #158

For A5SS and A3SS you could use row['longExonStart_0base'], row['longExonEnd'],
For RI: row['riExonStart_0base'], row['riExonEnd'],

For MXE there are two separate exons to check. You could write two lines to the bed file for each event:

out_columns = [
    row['chr'], row['1stExonStart_0base'], row['1stExonEnd'],
    '{}_1st'.format(row['ID']), score, row['strand']
]
out_handle.write('{}\n'.format('\t'.join(out_columns)))

out_columns = [
    row['chr'], row['2ndExonStart_0base'], row['2ndExonEnd'],
    '{}_2nd'.format(row['ID']), score, row['strand']
]
out_handle.write('{}\n'.format('\t'.join(out_columns)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants