check which exon #450

primary-11 · 2024-11-14T01:23:43Z

Hello, I would like to see which specific exons undergo alternative splicing for these genes. Taking SE.MATS.JCEC as an example, I’m currently using the Ensembl website to examine all transcripts of the corresponding gene and then check which exon of which transcript undergoes splicing. However, sometimes I can't find this information. Are there any tools that can annotate this, or other software I could use to view it?

EricKutschera · 2024-11-14T14:00:15Z

You could use bedtools intersect to check the overlap of the SE events with the --gtf you used to run rMATS. The SE.MATS.JCEC.txt file can be converted to bed format with a python script copied below. You can filter the gtf down to exon rows with grep. Here are the commands I used:

python se_to_bed.py --in-mats SE.MATS.JCEC.txt --out-bed se.bed
grep "$(printf '\t')exon$(printf '\t')" gencode.v46.primary_assembly.annotation.gtf > exon.gtf
bedtools intersect -wo -a exon.gtf -b se.bed > se_results.txt

An example line of se_results.txt is:

chr1	HAVANA	exon	24738	24886	.	-	.	gene_id "ENSG00000227232.6"; transcript_id "ENST00000488147.2"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; exon_number 1; exon_id "ENSE00003507205.2"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";	chr1	18912	24891	88891	0	-	149

The last few columns show the SE event coordinates (chr1 18912 24891) and the event ID (88891). The gene, transcript, and exon info are also in that line: gene_id "ENSG00000227232.6"; transcript_id "ENST00000488147.2"; exon_number 1

Here's se_to_bed.py

import argparse


def parse_args():
    parser = argparse.ArgumentParser(
        description='Convert SE.MATS.JC[EC].txt to BED format')
    parser.add_argument('--in-mats', required=True, help='SE.MATS input file')
    parser.add_argument('--out-bed', required=True, help='BED output file')

    return parser.parse_args()


def se_to_bed(in_mats, out_bed):
    score = '0'
    with open(out_bed, 'wt') as out_handle:
        with open(in_mats, 'rt') as in_handle:
            for line_i, line in enumerate(in_handle):
                columns = line.rstrip('\n').split('\t')
                if line_i == 0:
                    headers = columns
                    continue

                row = dict(zip(headers, columns))
                out_columns = [
                    row['chr'], row['exonStart_0base'], row['exonEnd'],
                    row['ID'], score, row['strand']
                ]
                out_handle.write('{}\n'.format('\t'.join(out_columns)))


def main():
    args = parse_args()
    se_to_bed(args.in_mats, args.out_bed)


if __name__ == '__main__':
    main()

primary-11 · 2024-11-15T12:20:26Z

Thank you very much! I have successfully identified the specific locations of exon skipping events. Additionally, I would like to annotate the RI, MXE, A3SS, and A5SS files. I attempted to modify your script to achieve this, but I was unsuccessful. Could you please provide assistance with this?

EricKutschera · 2024-11-15T18:18:28Z

This is the SE specific part of the script: row['exonStart_0base'], row['exonEnd'],

It selects the coordinates of the skipped exon to use when checking for overlapping exons from the gtf. For the other event types you can use other column names to get the coordinates to use for the overlap check. This post shows what the coordinate column names represent: #158

For A5SS and A3SS you could use row['longExonStart_0base'], row['longExonEnd'],
For RI: row['riExonStart_0base'], row['riExonEnd'],

For MXE there are two separate exons to check. You could write two lines to the bed file for each event:

out_columns = [
    row['chr'], row['1stExonStart_0base'], row['1stExonEnd'],
    '{}_1st'.format(row['ID']), score, row['strand']
]
out_handle.write('{}\n'.format('\t'.join(out_columns)))

out_columns = [
    row['chr'], row['2ndExonStart_0base'], row['2ndExonEnd'],
    '{}_2nd'.format(row['ID']), score, row['strand']
]
out_handle.write('{}\n'.format('\t'.join(out_columns)))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check which exon #450

check which exon #450

primary-11 commented Nov 14, 2024

EricKutschera commented Nov 14, 2024

primary-11 commented Nov 15, 2024

EricKutschera commented Nov 15, 2024

check which exon #450

check which exon #450

Comments

primary-11 commented Nov 14, 2024

EricKutschera commented Nov 14, 2024

primary-11 commented Nov 15, 2024

EricKutschera commented Nov 15, 2024