Returns nearest gene names and numbers associated with one or more SNP inputs
After not seeing a simple solution to what I was looking for, I wrote this a very simple python script using the entrez library. It makes calls to the dbSNP database at The National Library of Medicine, returning two csv files. Feel free to use either the script or notebook with instructions for use below.
jupyter notebook:
-- use a list of rs###'s for each snp
-- change the output filename when instantiating
GeneOutputFile = './nearestGenes.csv'
SNPInfoOutputFile = './SNPInfo.csv'
command line script:
-- use space separated rs###'s for each as an argument
-- the output file path is an argument. However, if you would like to change the file name attached to that path, do so on line 127 and 128 when files are created
Output filepaths are parameters you provide.
default for nearest gene file = current directory './nearestGenes.csv'
default for comprehensive SNP file = current directory './SNPInfo'
Nearest gene information (as per ncbi)
GeneInfo schema:
GENE_ID : int ID#
NAME : string common name
ACC : string NC_000...
snpID : string rs###
Comprehensive information about each SNP
SNPInfo schema:
ACC : string
ALLELE: Bool
ALLELE_ORIGIN: string : rs###, however most of the time is blank
CHR : string chromosome snp found on
CHRPOS : string chr:bp
CHRPOS_PREV_ASSM : string chr:bp from previous assembly
CHRPOS_SORT : string chromosome position but with 000 in front
CITED_SORT : string string
CLINICAL_SIGNIFICANCE : string
CLINICAL_SORT : string
CREATEDATE : data record of snp was created
DOCSUM : string summary of all columns
FXN_CLASS : string comma separate (i.e. 'intron_variant,genic_downstream_transcript_variant')
GENES : list of dictionaries of genes (i.e. [{'NAME': 'LRRC8D', 'GENE_ID': '55144'}])
GLOBAL_MAFS : list of dictionary (i.e. {'STUDY': '1000Genomes', 'FREQ': 'C=0.049521/.....)
GLOBAL_POPULATION : string
GLOBAL_SAMPLESIZE : string
HANDLE : string reported biobanks with snp? (i.e. 1000GENOMES,EVA_UK10K_TWINSUK,....)
IDList : list of uid's generated from first get_idList(snp)
MERGED_SORT : string
ORIG_BUILD : string
SNP_CLASS : string (i.e. 'snv')
SNP_ID : string
SNP_ID_SORT : string
SPDI : string (i.e. 'NC_000005.10:95305173:T:C')
SS : string (i.e. '10263196,82343251,82652005,112205736,165508732')
SUSPECTED : string
TAX_ID : string
TEXT : string (i.e. 'MergedRs=6896334')
UPDATEDATE : Date
UPD_BUILD : string
VALIDATED : string (i.e. 'by-frequency,by-alfa,by-cluster')
snpID : string rs###
from Bio import Entrez
from collections import OrderedDict
import pandas as pd
You will need:
email address : ncbi requires this but you don't need an account. They like to keep track of who is using their databases
file path for snps : or you can create a list and name it snp_list so the functions work
output file path : the default will be your current directory with nearestGenes.csv and SNPInfo.csv as file names.
Run the function:
SNPInfo, GeneInfo = get_closest_genes(snp_list)
cd into the folder with snp_to_gene.py and run below note: the file path is used for both files so don't include a filename, only up to folder you would like to store them in. If you would like to change the filename from the default nearestGenes.csv and SNPInfo.csv you can do that on lines 127 and 128:
python snp_to_gene.py --h for help
python snp_to_gene.py --snplist rs11955986 rs17498135 --email [email protected] --filepath any/path/you/choose