Skip to content

Creating Metafusion Reference Files (gene_bed and gene_info)

pintoa1-mskcc edited this page Aug 10, 2023 · 1 revision

Gene_Bed

This file is generated from our desired final ensembl version (v75). All Metafusion annotation information, scoring, and transcript identifications will be from this file. This file will be used to convert as many of our fusions to a single ensembl version for clustering.

To create this file see final_generate_v75_gene_bed.R.

Note: This script takes several hours as it is iterating over every transcript in the gtf

Overview:

  1. Add introns to gtf agat tool, output gff3
  2. gff2bed convert gff3 to bed file
  3. For every transcript id in our bed file, assign indexes to region iterating over introns and reformat region information to Metafusion accepted format
    • CDS becomes cds
    • UTR becomes either utr5 or utr3
    • exon in any transcript without a CDS region becomes a utr5 or utr3 depending on strand

Gene_info

File utilized to rename genes from different callers to a specific Symbol. The renamed genes will have an entry in "Synonyms" and become "Symbol." Original metafusion uses NCBI nomenclature symbols, however this can cause issues if your gene_bed does not have the NCBI nomenclature. For this reason, we generated gene_info file according to the script make_gene_info_for_forte.R. This script asks for a "primary" gtf, which should be the file used to make your gene_bed file, as well as FORTE callers gtf information.

Since Ensembl IDs tend to be more stable than gene names, the output gene_info file will convert gene_ids (assigned to the synonym column) to gene_name (assigned to Symbol). Your gene_info file will initially be every unique pairing of gene_name:gene_id from your primary gtf. Any gene_id from the other gtfs that DOES NOT exist within your primary gtf file will be added to the end of the gene_info file as unique gene_name:gene_ids. Any gene_ids from other GTFs which have versions will be added to synonyms by matching primary gtf gene_ids to the versioned gene_ids by parsing off the version and merging.

For this reason the output from make_cff_from_forte.R, differs from the standard CFF format, instead assigning ENSG_ID to t_gene1 and t_gene2.

In this way, if a gene_id exists in your primary GTF, the renamed gene will be in your primary GTF version. If a gene_id DOES NOT exist within your primary GTF, the ENSG_id will be converted to the caller's original GTF's gene_name.

Clone this wiki locally