This repository shows my pytorch
implementation of fixed-breaks piecewise linear regression. I was doing research in estimating RNA elongation rates, and one essential step in the algorithm the estimation task is to precisely regress the read coverage for a gene. I looked around for different implementations of piecewise linear regression, but I could not find any that specifically fits my use case (where the need for precise regression is very crucial, and fast estimation is also important because each data point is a position on the gene, there are many data points). So, I tried to implement it in pytorch
, which is the fastest and neatest way that I know for this problem. I share this implementation here mostly for educational purposes for my own learning. If you ever get interested in this implementation and want me to modify (or me to help you modify the code), please reach out.
As mentioned above, the use case is to regress the read coverage given genomic positions. Details:
- Input data: genomic positions (x) and read coverage (y)
- Expected output: a list of breakpoints (px, py) of the piece-wise regression line. The data is such that we would most likely expect a decreasing regression line. We also want the most precise regression line (low-noise data).
Example data is provided in data/coverage_df.csv.gz
, and visualized here:
Each color represents the read coverage at different time points during the experiment. The data is such that we would most likely expect a decreasing regression line.
The cost function is
There are two options to approach this regression problem:
1, Use a fixed number of breakpoints (M), fix regression/fixedX_pwlr.py
. Here is how to use it:
fxlr_model = fxpwlr.PiecewiseLinearRegression(x_min=x_min, x_max=x_max, gap = regress_bin) # initiate the model with fixed px
# regress_bin is the bin size on x-axis of the regession segments
fxlr_model.fit(X, Y) # fit the model with data, X and Y each are tensors of shape (N,)
There are many functionality of the object fxlr_model, such as plotting the regression line, getting the breakpoints, x_to_y, y_to_x, etc. The inferences of x_to_y and y_to_x are such that any lines on an approximate horizontal lines are not allowed and hence declared as nan. This is very useful for my application because we will eventually try to estimate the horizontal distance between two regression lines corresponding to two time points.
2, Use a learned number of breakpoints. This implementation is done using numpy and scipy, very slow, and I do not have plans to transfer it to pytorch
unless requested. It is implemented in regression/piecewise_linear_regression.py
. Here is how to use it:
pwlr_model = pwlr.PiecewiseLinearRegression()
pwlr_model.fit(X, Y)
Results:
There are, again, many functionalities of the object pwlr_model, such as plotting the regression line, getting the breakpoints, x_to_y, y_to_x, etc. I do not have time yet to write down all documentation, but I will do it attracts interest.
Here are the packages that I used in this implementation. I do not have time to package this into a pip package, but I will do it if there is interest.
- numpy
- scipy
- torch
- pandas
- matplotlib
- seaborn
This is my attempt at learning about how to do optimization using pytorch
for a particular application very specific to my current research in RNA elongation rate estimation. I have used many different packages for this particular problem, but I found pytorch
to be fastest and most accurate. If you ever find yourself in stuations where you want joint-breakpoint piecewise linear regression, maybe this repository can help you, and if there are features you like implemented, please let me know.