rcCAE is a convolutional autoencoder (CAE) based method for detecting tumor clones and copy number alterations from single-cell DNA sequencing data.
- Python 3.8+.
First, download rcCAE from github and change to the directory:
git clone https://github.com/zhyu-lab/rccae
cd rccae
Create a new environment named "rccae":
conda create --name rccae python=3.8.15
Then activate it:
conda activate rccae
To successfully compile the source code, please make sure g++-5 is installed on your machine. you can install it using following commands:
sudo apt install gcc-5 g++-5
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 20
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 20
If your current compiler is not g++-5, you need to change the current compiler to gcc-5 and g++-5:
sudo update-alternatives --config gcc
sudo update-alternatives --config g++
Then install related packages and compile the code:
python -m pip install -r ./cae/requirements.txt
cd prep
sudo cmake .
make
cd ..
chmod +x run_rccae.sh ./hmm/run_SCHMM.sh
If the version of PyTorch does not match the CUDA version, you can find version dependence between PyTorch and CUDA at https://pytorch.org/get-started/previous-versions/, and select appropriate PyTorch version to install on your machine.
After the installation completes, you can reset the compiler using same commands:
sudo update-alternatives --config gcc
sudo update-alternatives --config g++
The “./prep/bin/prepInput” command is used to obtain read counts, GC-content and mappability data.
To successfully run the command you will need to obtain/create these items:
- A merged BAM file (10X Genomics) containing sequencing data of all cells or seperate BAM files of single cells
- A FASTA file defining reference sequence
- A BIGWIG file for calculating mappability scores
- A barcode file listing the barcodes of all cells or names of all BAM files to be analyzed
Reference sequence file formatted as .fasta can be downloaded from UCSC genome browser.
Mappability files formatted as .bw for human genomes are available from UCSC genome browser. Users can also generate their own mappability files using gem-library and wigToBigWig utility.
Here is an example for creating mappability file from reference sequence (suppose “$gemlibrary” and “$bigwig” are the pathes where gem-library and wigToBigWig are installed, respectively).
chmod +x $gemlibrary/bin/gem* $bigwig/wigToBigWig
export PATH=$PATH:$gemlibrary/bin:$bigwig
gem-indexer -T 10 -c dna -i ./testData/refs/example.fa -o ./testData/refs/example_index
gem-mappability -T 10 -I ./testData/refs/example_index.gem -l 36 -o ./testData/refs/example_36
gem-2-wig -I ./testData/refs/example_index.gem -i ./testData/refs/example_36.mappability -o ./testData/refs/example_36
wigToBigWig ./testData/refs/example_36.wig ./testData/refs/example.sizes ./testData/refs/example_36.bw
The all arguments of the “./prep/bin/prepInput” command are as follows:
Parameter | Description | Possible values |
---|---|---|
-b, --bam | a merged BAM file (10X Genomics) or a directory containing BAM files of the cells to be analyzed | Ex: /path/to/sample.bam |
-r, --ref | genome reference file (.fasta) | Ex: /path/to/hg19.fa |
-m, --map | mappability file (.bw) | Ex: /path/to/hg19.bw |
-B, --barcode | a file listing the barcodes of all cells or names of all BAM files to be analyzed | Ex: /path/to/barcodes.txt |
-c, --chrlist | the entries chromosomes to be analyzed (should be separated by commas) | Ex: chrlist=1,2,3,4,5 default:1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 |
-s, --binsize | set the size of bin to count reads | Ex: binsize=200000 default:500000 |
-q, --mapQ | threshold value for mapping quality | Ex: mapQ=10 default:20 |
-o, --output | output file to save results | Ex: /path/to/example.txt |
Example:
./prep/bin/prepInput -b /path/to/sample.bam -r /path/to/hg19.fasta -m /path/to/hg19.bw -B /path/to/barcodes.txt -o example.txt
The “./cae/train.py” Python script is used to learn latent representations of cells and cluster cells into distinct subpopulations.
The arguments to run “./cae/train.py” are as follows:
Parameter | Description | Possible values |
---|---|---|
--input | input file generated by “./prep/bin/prepInput” command | Ex: /path/to/example.txt |
--output | a directory to save results | Ex: /path/to/results |
--epochs | number of epoches to train the CAE | Ex: epochs=200 default:100 |
--batch_size | batch size | Ex: batch_size=32 default:64 |
--lr | learning rate | Ex: lr=0.0005 default:0.0001 |
--max_k | maximum number of clusters to consider | Ex: max_k=30 default:N/5 (N is the number of cells) |
--latent_dim | latent dimension | Ex: latent_dim=4 default:3 |
--kernel_size | convolutional kernel size | Ex: kernel_size=5 default:7 |
--seed | random seed | Ex: seed=1 default:0 |
Example:
python ./cae/train.py --input ./data/example.txt --epochs 200 --batch_size 64 --lr 0.0001 --latent_dim 3 --seed 0 --output data
The “./hmm/SCHMM.m” MATLAB script is used to call single-cell CNAs.
The arguments run “./hmm/SCHMM.m” are as follows:
Parameter | Description | Possible values |
---|---|---|
inputFile | “lrc.txt” file generated by “./cae/train.py” script | Ex: /path/to/lrc.txt |
outputDir | a directory to save results | Ex: /path/to/results |
maxCN | maximum copy number to consider | Ex: 6 default:10 |
Example:
SCHMM('../data/lrc_example.txt','../data',10)
We also provide a script “run_rccae.sh” to integrate all three steps to run rcCAE. This script requires that MATLAB Compiler Runtime (MCR) v91 (R2016b) is installed in user's machine. The MCR can be downloaded from MathWorks Web site.
Example:
./run_rccae.sh /path/to/bam /path/to/ref.fa /path/to/ref.bw /path/to/barcodes.txt /path/to/results
Type ./run_rccae.sh to learn details about how to use this script.
If you have any questions, please contact [email protected].