This project is my effort to learn to implement Bayesian probabilistic latent variable modelling and inference framework in Pytorch. I tried to implement the Contrastive Poisson latent variable model as described in the paper Contrastive latent variable modeling with application to case-control sequencing experiments by Jones et al.. The provided code for model inference by the authors are in tensorflow
.
In this paper, the authors propose a novel probabilistic model for case-control discreet data. Direct applications involve when we have single-cell RNA-seq data from two samples, case and control, such that each data matrix is of size
-
Observed data:
-
$X \in \mathbb{N}^{n_x \times m}$ , where$X_{ij}$ is the count of gene$j$ in cell$i$ . -
$Y \in \mathbb{N}^{n_y \times m}$ , where$Y_{ij}$ is the count of gene$j$ in cell$i$ . -
$n_x, n_y$ are the number of cells in case and control samples, respectively. -
$m$ is the number of genes. - Let
$X$ represents data from case sample and$Y$ represents data from control sample.
-
- Generative model:
and
Note
-
$i1$ : index of cells (rows) in case sample$X$ -
$i2$ : index of cells (rows) in control sample$Y$ -
$\mathbf{z}_{i1,l} \sim \text{Gamma}(\gamma_1, \beta_1) $ : shared, one entry$i1,l$ latent variable for case sample -
$\mathbf{z}_{i2,l} \sim \text{Gamma}(\gamma_2, \beta_2) $ : shared, one entry$i2,l$ latent variable for control sample -
$\mathbf{t}_{i2,d} \sim \text{Gamma}(\gamma_3, \beta_3) $ : foreground-specific, one entry$i2,d$ latent variable for control sample$X$ . -
$\mathbf{S} \in \mathbb{R}^{L \times m} $ : loading matrix (weights) shared across the background ($Y, z_{i1}$ ) and foreground$(X, z_{i2})$ samples. -
$\mathbf{W} \in \mathbb{R}^{D \times m} $ : loading matrix (weights) specific to the foreground sample$X$ . - The model learns the optimal values of
$\mathbf{S}$ ,$\mathbf{W}$ ,$z_{i1}$ ,$z_{i2}$ ,$t_{i1}$ that best fit the data$X$ and$Y$ , with prior distributions. Inference framework is based on variational inference, with the objective to maximize the ELBO, and log-normal variational distributions for latent variables. Please kindly refer to the paper for details about the models, since there is no way to shortly explain it in this README.
I copied the example data provided in the orginal CPLVM repository, stored in the folder data
. Here,
To run CPLVM for this particular data:
# Now define the model
import torch
from torch import nn
from torch.nn import functional as F
import cplvm
import numpy as np
import pandas as pd
Y = pd.read_csv("./data/toy_background.csv", header=None).values
X = pd.read_csv("./data/toy_foreground.csv", header=None).values
# now apply the model to the sample data
Kf=2 # number of foreground-specific latent variables
Ks=1 # number of shared latent variables
model = cplvm.CPLVM(p, Ks, Kf)
model._fit_VI(torch.tensor(X), torch.tensor(Y), 10000, 0.01)
The resulting inferred latent factor weights are plotted below:
You can see the whole example code at cplvm_learning.ipynb
.
As of right now, since the model is specified in one file cplvm.py
, you can simply copy the file to your project directory and import it. You will need torch
to run the model, and numpy
, pandas
to read in the data as in cplvm_learning.ipynb
. I will try to make it a package in the future if there are demands. I also implemented only the model that is generated by Poisson distribution for discrete data (such as counts from single-cell RNA-seq data), the paper did present more background on contrastive probabilistc methods that were previously developed such as contrastive PCA and its probabilistic version. The orginal repository also contains the implementation of these models in tensorflow. If you like those implemented in Pytorch please reach out to me at [email protected]
. Right now I do not implement them since I stopped working on it once I learned the fundamental ideas of the paper.