Set2Gaussian: Embedding Gene Sets as Gaussian Distributions for Large-scale Gene Set Analysis
Set2Gaussian is a network-based embedding approach which takes a set of gene sets as input and output an embedding representation for each gene set. It could embed more than 10,000 gene sets. Instead of embedding as single points, Set2Gaussian embeds each gene set as a Gaussian distribution in order to model the diverse functions within a gene set.
Set2Gaussian: Embedding Gene Sets as Gaussian Distributions for Large-scale Gene Set Analysis. Sheng Wang, Emily Flynn, Russ B. Altman.
We provide the dataset and embeddings of 13,886 gene sets from NCI, Reactome, and MSigDB figshare A sample dataset is in the data folder. network.txt is the network in the following format:
node1 node2 weight
...
node_set.txt is the node set in the following format:
set1 node1
set1 node2
...
An example is in src/Grep_run_all_methods.py. It takes data/network.txt and data/node_set.txt as input. First replace them with your network and gene sets.
cd src
python Grep_run_all_methods.py
The embeddings will be saved in output_embed
- python 2.7 (with slight modification, python 3.6 can also be used to run our tool)
- python packages (numpy 1.14+, scipy 1.1+, networkx 2.3+, tensorflow 1.14.0)
For questions about the data and code, please contact [email protected]. We will do our best to provide support and address any issues. We appreciate your feedback!