Ward clustering to identify Internal Node branch length outliers using Gene Scores (WINGS)
Authors: Melissa R. McGuirl, Samuel Pattillo Smith, Björn Sandstede, Sohini Ramachandran
This repository contains MATLAB code for WINGS, an algorithim which clusters phenotypes using gene score as the input feature vectors with a thresholded Ward hierarchical clustering algorithm. The identified clusters are ranked by significance.
WINGS(gene_scores, outdir, scale, plotOn, saveOn, cluser_alg, cluster_dist, sz_thresh)
gene_scores is a .txt or .csv file containing a matrix of PEGASUS gene score feature vectors of size M + 1 X N + 1, where M = number of genes, N = number of phenotypes with the first column containing all the gene names/labels and the first row containing all of the phenotype names/labels.
outdir is the directory where the user wants to save the output images and files (optional, default = './').
scale is the scale on which the scores will be clustered. Options: "raw" (non-transformed) or "log10" (-log10transformed). NOTE: WE ASSUME THE INPUTTED GENE SCORE MATRIX IS NOT TRANSFORMED. (optional, default = 'log10').
plotOn is a plot indicator variable. If plotOn = 1, this function will plot and save the dendrogram and sorted branch length figure (optional, default = 1).
saveOn is a save indicator variable. If saveOn = 1, this function will save text files with (1) the estimated significant clusters with corresponding branch length threshold and (2) all clusters ranked by branch length. (optional, default = 1).
cluster_alg is the clustering method that describes how to measure the distance between clusters in the hierarchical clustering algorithm. Options include "average", "centroid", "complete", "median", "single" "ward", and "weighted." See "doc linkage" (method) for more details. (optional, default = "ward").
cluster_dist is the distance used to measure differences between phenotypes. Some example options include "euclidean", "squaredeuclidean", "seuclidean", "cityblock", "cosine", "hammning", "jaccard", or any other metric that is accepted by pdist. (optional, default = "euclidean").
sz_thresh is the size threshold used to define a cluster. All clusted of size greater than sz_thresh will not be recognized as "true clusters." (optional, default = ceil(N/3)).
Dendrogram and branch length figures (if plotOn = 1) and text files of the estimated significant clusters with corresponding branch length threshold and (2) all clusters ranked by branch length. (if saveOn = 1).
cd src
WINGS('../examples/sim_example_small.csv', '../outputs', 'raw', 1, 1, 'ward', 'euclidean', 4)
Sample inputs are contained in the 'examples' folder and sample outputs and figures are contained in the 'outputs' folder. 'sim_example.csv' and 'sim_example_small.csv' are toy simulations, whereas 'UKBiobank_small_raw.txt' (raw Pegasus scores from 26 phenotypes), 'UKBiobank_expanded_raw.txt' (raw Pegasus scores from 87 phenotypes), and 'UKBiobank_expanded_log10.txt' (log10 transformed Pegasus scores from 87 phenotypes--note, input data is assumed to be on the raw scale) correspond to the empirical PEGASUS scores for the UKBiobank data (as described in the paper). The sample results in 'outputs' correspond to the outputs from running the sample run above.
The script used to generate simulation data is gene.score.array.simulator.py, available in the src folder.
MATLAB with the Statistics and Machine Learning Toolbox
Corresponding Author for Code: Melissa R. McGuirl, [email protected]