Skip to content
/ rccae Public

A convolutional autoencoder based method for detecting tumor clones and copy number alterations from single-cell DNA sequencing data

License

Notifications You must be signed in to change notification settings

zhyu-lab/rccae

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rcCAE

rcCAE is a convolutional autoencoder (CAE) based method for detecting tumor clones and copy number alterations from single-cell DNA sequencing data.

Requirements

  • Python 3.8+.

Installation

Clone repository

First, download rcCAE from github and change to the directory:

git clone https://github.com/zhyu-lab/rccae
cd rccae

Create conda environment (optional)

Create a new environment named "rccae":

conda create --name rccae python=3.8.15

Then activate it:

conda activate rccae

Install requirements

To successfully compile the source code, please make sure g++-5 is installed on your machine. you can install it using following commands:

sudo apt install gcc-5 g++-5
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 20  
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 20

If your current compiler is not g++-5, you need to change the current compiler to gcc-5 and g++-5:

sudo update-alternatives --config gcc
sudo update-alternatives --config g++

Then install related packages and compile the code:

python -m pip install -r ./cae/requirements.txt
cd prep
sudo cmake .
make
cd ..
chmod +x run_rccae.sh ./hmm/run_SCHMM.sh

If the version of PyTorch does not match the CUDA version, you can find version dependence between PyTorch and CUDA at https://pytorch.org/get-started/previous-versions/, and select appropriate PyTorch version to install on your machine.

After the installation completes, you can reset the compiler using same commands:

sudo update-alternatives --config gcc
sudo update-alternatives --config g++

Usage

Step 1: prepare input data

The “./prep/bin/prepInput” command is used to obtain read counts, GC-content and mappability data.

To successfully run the command you will need to obtain/create these items:

  • A merged BAM file (10X Genomics) containing sequencing data of all cells or seperate BAM files of single cells
  • A FASTA file defining reference sequence
  • A BIGWIG file for calculating mappability scores
  • A barcode file listing the barcodes of all cells or names of all BAM files to be analyzed

Reference sequence file formatted as .fasta can be downloaded from UCSC genome browser.

Mappability files formatted as .bw for human genomes are available from UCSC genome browser. Users can also generate their own mappability files using gem-library and wigToBigWig utility.

Here is an example for creating mappability file from reference sequence (suppose “$gemlibrary” and “$bigwig” are the pathes where gem-library and wigToBigWig are installed, respectively).

chmod +x $gemlibrary/bin/gem* $bigwig/wigToBigWig
export PATH=$PATH:$gemlibrary/bin:$bigwig
gem-indexer -T 10 -c dna -i ./testData/refs/example.fa -o ./testData/refs/example_index
gem-mappability -T 10 -I ./testData/refs/example_index.gem -l 36 -o ./testData/refs/example_36
gem-2-wig -I ./testData/refs/example_index.gem -i ./testData/refs/example_36.mappability -o ./testData/refs/example_36
wigToBigWig ./testData/refs/example_36.wig ./testData/refs/example.sizes ./testData/refs/example_36.bw

The all arguments of the “./prep/bin/prepInput” command are as follows:

Parameter Description Possible values
-b, --bam a merged BAM file (10X Genomics) or a directory containing BAM files of the cells to be analyzed Ex: /path/to/sample.bam
-r, --ref genome reference file (.fasta) Ex: /path/to/hg19.fa
-m, --map mappability file (.bw) Ex: /path/to/hg19.bw
-B, --barcode a file listing the barcodes of all cells or names of all BAM files to be analyzed Ex: /path/to/barcodes.txt
-c, --chrlist the entries chromosomes to be analyzed (should be separated by commas) Ex: chrlist=1,2,3,4,5 default:1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
-s, --binsize set the size of bin to count reads Ex: binsize=200000 default:500000
-q, --mapQ threshold value for mapping quality Ex: mapQ=10 default:20
-o, --output output file to save results Ex: /path/to/example.txt

Example:

./prep/bin/prepInput -b /path/to/sample.bam -r /path/to/hg19.fasta -m /path/to/hg19.bw -B /path/to/barcodes.txt -o example.txt

Step 2: train the CAE model

The “./cae/train.py” Python script is used to learn latent representations of cells and cluster cells into distinct subpopulations.

The arguments to run “./cae/train.py” are as follows:

Parameter Description Possible values
--input input file generated by “./prep/bin/prepInput” command Ex: /path/to/example.txt
--output a directory to save results Ex: /path/to/results
--epochs number of epoches to train the CAE Ex: epochs=200 default:100
--batch_size batch size Ex: batch_size=32 default:64
--lr learning rate Ex: lr=0.0005 default:0.0001
--max_k maximum number of clusters to consider Ex: max_k=30 default:N/5 (N is the number of cells)
--latent_dim latent dimension Ex: latent_dim=4 default:3
--kernel_size convolutional kernel size Ex: kernel_size=5 default:7
--seed random seed Ex: seed=1 default:0

Example:

python ./cae/train.py --input ./data/example.txt --epochs 200 --batch_size 64 --lr 0.0001 --latent_dim 3 --seed 0 --output data

Step 3: detect single-cell CNAs

The “./hmm/SCHMM.m” MATLAB script is used to call single-cell CNAs.

The arguments run “./hmm/SCHMM.m” are as follows:

Parameter Description Possible values
inputFile “lrc.txt” file generated by “./cae/train.py” script Ex: /path/to/lrc.txt
outputDir a directory to save results Ex: /path/to/results
maxCN maximum copy number to consider Ex: 6 default:10

Example:

SCHMM('../data/lrc_example.txt','../data',10)

We also provide a script “run_rccae.sh” to integrate all three steps to run rcCAE. This script requires that MATLAB Compiler Runtime (MCR) v91 (R2016b) is installed in user's machine. The MCR can be downloaded from MathWorks Web site.

Example:

./run_rccae.sh /path/to/bam /path/to/ref.fa /path/to/ref.bw /path/to/barcodes.txt /path/to/results

Type ./run_rccae.sh to learn details about how to use this script.

Contact

If you have any questions, please contact [email protected].

About

A convolutional autoencoder based method for detecting tumor clones and copy number alterations from single-cell DNA sequencing data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published