scAGDE
is a Python implementation for a novel single-cell chromatin accessibility model-based deep graph representation learning method that simultaneously learns feature representation and
clustering through explicit modeling of single-cell ATAC-seq data generation.
Single-cell ATAC-seq technology has significantly advanced our understanding of cellular heterogeneity by enabling the exploration of epigenetic landscapes and regulatory elements at the single-cell level. A major challenge in analyzing high-throughput single-cell ATAC-seq data is its inherently low copy number, leading to data sparsity and high dimensionality, significantly limiting the elucidation and characterization of gene regulatory elements. To address these limitations, we developed scAGDE, a novel single-cell chromatin accessibility model-based deep graph representation learning method that simultaneously learns feature representation and clustering through explicit modeling of single-cell ATAC-seq data generation. scAGDE first leverages a chromatin accessibility-based autoencoder, which is designed to identify key patterns in single-cell ATAC-seq data, eliminate less relevant peaks, and construct a cell graph to elucidate the topological connections among individual cells. After that, scAGDE integrates a Graph Convolutional Network (GCN) as an encoder to extract essential structural information from both the ATAC-seq count matrix and the cell graph, coupled with a Bernoulli-based decoder to characterize the global probabilistic structure of the data. Additionally, the graph embedding process independently generates soft labels that guide self-supervised deep clustering, which is characterized by its iterative refinement of results.
Overview of the scAGDE framework. (a) A summary graphical illustration of scAGDE workflow. scAGDE takes as input the binary cell-by-peak matrix first into
a chromatin accessibility-based autoencoder and then performs the graph embedding learning. (b) The chromatin accessibility-based autoencoder maps data into latent
space, where each individual cell connects its nearest cell as neighbours to construct a cell graph. The variation of encoder’s weights can be translated to importance score
of peaks for peak selection procedure. (c) The well-prepared cell graph and filtered data are simultaneously handled by a two-layer GCN encoder (i) and mapped into the
latent space (ii). On the one hand, the latent embedding serves as input to dual decoders (iii), which include a graph decoder module to reconstruct from embedding, and a
Bernoulli-based decoder module to estimate the probability of a peak being accessible, which are estimates of the true chromatin landscape in each cell. On the other hand,
the dual clustering optimizations are introduced (iv), where a network of cluster layer, which is initialized by K-means results on the embedding, infers soft clustering label.
The target distribution and one-hot pseudo label are sequentially calculated and used for label prediction loss and distribution alignment loss. (d) scAGDE facilitates critical
downstream applications of clustering, visualization, imputation, enrichment analysis and discovery of regulators.
scAGDE
package requires only a standard computer with enough RAM to support the in-memory operations.
This package is supported for Linux. The package has been tested on the following systems:
- Linux: Ubuntu 18.04
scAGDE
mainly depends on the Python scientific stack.
numpy
scipy
torch
scikit-learn
pandas
scanpy
anndata
rpy2
For specific setting, please see requirements.text.
We need your environment to have R and mclust
package installed.
You can create an environment to run scAGDE without any problems by following the code below:
conda create -n scagde python=3.9.13 -y
conda activate scagde
pip install torch==2.0.1
pip install numpy==1.26.4
pip install rpy2==3.5.16
pip install scanpy==1.9.3
pip install matplotlib==3.5.0
pip install leidenalg==0.10.2
conda install r-base==4.3.1 -y
conda install r-mclust -y
pip install scAGDE
We give users detailed usage guidelines in the folder tutorials
. Specifically, Tutorial 1
provides suggestions for running scAGDE in an end-to-end style and Tutorial 2
for running scAGDE in an step-by-step way, where detailed instructions are added for each step. Tutorial 3
and 4
provide numerous R scripts for you to complete the experimental analysis of imputation or peak selection preferences. Tutorial 5
illustrates how scAGDE utilizes batch training on large-scale data to speed up training.
You can also visit the online document at https://scagde-tutorial.readthedocs.io/en/latest/index.html for instruction.
All the simulated and realistic datasets we used in our study, including the human brain dataset can be download here.
This project is covered under the MIT License.