Skip to content

Commit 0ce8e8c

Browse files
committed
fast forward to version 0.3.3
1 parent 31616e3 commit 0ce8e8c

File tree

167 files changed

+46579
-3630
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

167 files changed

+46579
-3630
lines changed

CHANGELOG.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# CHANGELOG
2+
3+
## 0.3.4
4+
- Cleaned up the makefile
5+
6+
## 0.3.3
7+
- New behaviour, instead of `chromflock`, do `chromflock init`.
8+
- `cc2cpm` is baked into the binary `chromflock` and renamed to
9+
`hic2cpm`
10+
- `sprite2cpm`, `any2string` and `string2any` also moved to `chromflock`.
11+
- documentation is based on markdown instead of troff.
12+
13+
## 0.3.2
14+
- Cleaned up the makefile.
15+
- Added **--version** argument to **aflock**, **mflock**, and **cc2cpm**
16+
- Fixed typos in the **cc2cpm** documentation.
17+
18+
## 0.3.1
19+
- Added script to make a deb package for Ubuntu. That is the
20+
preferred way to install from now.

HAP1_100k.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# chromflock at 100kb on HAP1 using GPSeq
2+
3+
This can be done on a system with 32 GB but will kill a system with 16 GB of ram (or use a lot of swapping).
4+
5+
## Get Hi-C data
6+
```
7+
# 5,032,452,195 bytes
8+
wget https://data.4dnucleome.org/files-processed/4DNFI1E6NJQJ/@@download/4DNFI1E6NJQJ.hic
9+
```
10+
11+
## Convert Hi-C matrix to raw matrices
12+
This step requires that straw is installed and can be found in the PATH.
13+
```
14+
# A 30877x30877 matrix
15+
~/code/chromflock_dev/util/python/chromflock_hic2H.py 4DNFI1E6NJQJ.hic 100000
16+
```
17+
18+
## Translocate Hi-C data
19+
```
20+
~/code/chromflock_dev/util/python/chromflock_translocate.py 4DNFI1E6NJQJ.hic.H.double 4DNFI1E6NJQJ.hic.L0.uint8 100000
21+
```
22+
23+
## Convert Hi-C (cc) data to contact probability matrix (cpm)
24+
ChrY is also removed at this stage.
25+
```
26+
mkdir 1000
27+
cd 1000
28+
cc2cpm --hfile ../4DNFI1E6NJQJ.hic.H_trans.double --lfile ../4DNFI1E6NJQJ.hic.H_trans.L.uint8 --nCont 8 --nStruct 1000
29+
```
30+
31+
## Get GPSeq data (matlab)
32+
```
33+
L = read_raw('8.000000_1000_MAX.L.uint8', 'uint8');
34+
L = L(L<24); % Excluded Y if not already excluded
35+
36+
G = loadGPSeq(L, ...
37+
'/mnt/bicroserver2/projects/GPSeq/centrality_by_seq/SeqNoGroup/B170_transCorrected/all/B170_transCorrected.asB165.rescaled.bins.size100000.step100000.csm3.rmOutliers_chi2.rmAllOutliers.tsv', ...
38+
100000);
39+
40+
GN = log2(G);
41+
GN(GN>1) = 1;
42+
GN(GN<0) = 0;
43+
% GG: That's GPSeq score, ergo "centrality". I.e., greater at the center.
44+
GN = 1-GN;
45+
46+
figure, plot(GN)
47+
write_raw(GN, 'G.double', 'double');
48+
```
49+
50+
## Run chromflock
51+
```
52+
chromflock
53+
vim chromflock_gen
54+
# Add gpseq force
55+
vim chromflock.lua
56+
./chromflock_gen
57+
screen
58+
./chromflock_run
59+
```

INSTALL.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# INSTALLATION
2+
3+
On Ubuntu 22.04
4+
5+
``` shell
6+
cd src/lua-5.3.5/
7+
make linux # or pick another suitable target
8+
cd ../../
9+
make
10+
./makedeb-ubuntu_2204
11+
sudo apt install ./chromflock_*.deb
12+
```
13+
14+
Check the dependencies section if there are any missing libraries. If
15+
you have another OS, please check the section **Custom install**.
16+
17+
## Dependencies
18+
19+
For compilation the following is needed:
20+
21+
* gcc
22+
* libreadline -- to compile lua
23+
* zlib -- to write `.gz` files
24+
* pkg-config -- for the makefile
25+
* libcairo -- optional, to write out the colour map
26+
* libsdl2 -- optional, to compile mflock with live view (`mflock_sdl`)
27+
* parallel
28+
* liblua5.4.5 (included in src/lua-5.3.5/ )
29+
30+
On Ubuntu 22.04 these packages can be installed by
31+
32+
``` shell
33+
# Ubuntu
34+
sudo apt-get install libcairo-dev
35+
sudo apt-get install libreadline-dev
36+
sudo apt install pkg-config
37+
sudo apt-get install zlib1g-dev
38+
sudo apt-get install libsdl2-dev
39+
sudo apt-get install parallel
40+
```
41+
42+
* lua, included but has to be built, on a Linux machine that corresponds to:
43+
```
44+
cd src/lua-5.3.5/
45+
# All possible platforms: aix bsd c89 freebsd generic linux macosx mingw posix solaris
46+
# e.g., on linux use:
47+
make linux
48+
```
49+
Please check the Lua documentation if you are on another platform.
50+
51+
52+
## Custom install
53+
54+
If your system does not support deb files you can still install
55+
chromflock. Typically by
56+
57+
``` shell
58+
make
59+
sudo make install
60+
```
61+
62+
please check what the **makefile** does before doing this. If you
63+
change the **DESTDIR** OR **PREFIX** variables please update the
64+
script called **chromflock** to reflect these changes.

NOTES.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
## Files:
2+
```
3+
# Initial MATLAB prototype for the M-step
4+
src/mstep.m
5+
6+
# Implementation of the 'error' function and it's gradient:
7+
src/functional.*
8+
```

README.md

100644100755
Lines changed: 30 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,42 @@
11
# CHROMFLOCK
22

3-
![example structure](cf_000038.png)
3+
![example structure](doc/cf_000038.png)
44

5+
## Introduction
56

6-
## Usage:
7-
Compile, if gcc, pkg-config and zlib are installed, compilation is as simple as:
7+
**chromflock** is a software package to deconvolve bulk Hi-C data into
8+
putative single-cell structures. The overall scheme is borrowed from
9+
[PGS](https://www.github.com/alberlab/PGS) although it is not a direct
10+
fork.
811

9-
```
10-
make all
11-
```
12+
Chromflock will:
13+
- Work with haploid as well as diploid structures.
14+
- Can integrate GPSeq data for radial preferences.
15+
- Supports spherical as well as an ellipsoidal domain for the beads.
1216

13-
To make the binaries and then man pages is visible, use:
17+
The documentation is not complete at the moment and it is suggested
18+
that anyone interested in chromflock start by reading [Kalhor et al,
19+
2012](https://doi.org/10.1038/nbt.2057), especially the supplementary
20+
materials since much of the terminology used here can be traced back
21+
to that paper.
1422

15-
```
16-
here=$(pwd)
17-
export PATH=$PATH:$here/bin/
18-
export MANPATH=$MANPATH:$here/man/
19-
```
20-
possibly add that to your `.bashrc` or whatever is used on your system.
23+
Installation instructions can be found in [INSTALL.md](INSTALL.md) and
24+
some example usage in [USAGE.md](USAGE.md).
2125

22-
To get started, read the man page for `chromflock`, i.e.:
26+
In addition to csv files, chromflock will use chimera marker files
27+
(.cmm) to write out structures for viewing. These can be opened by
28+
[UCFS chimera](https://www.cgl.ucsf.edu/chimera/) and also by
29+
[nua](https://www.github.com/elgw/nua/).
2330

24-
```
25-
man chromflock
26-
```
31+
## References
2732

28-
in general the sequence of command needed to start an experiment are:
29-
```
30-
chromflock
31-
vim chromflock_gen
32-
./chromflock_run
33-
```
34-
35-
If `chromflock_run` was aborted for some reason, inspect `status.txt` to see the last thing that was finished and continue from any line, L, by
36-
```
37-
bash < (sed -n 'L,$p' chromflock_run)
38-
```
39-
40-
# References
41-
For the RNG and Normal distribution:
42-
- [A modified ziggurat algorithm for generating exponentially and normally distributed pseudorandom numbers](http://www.tandfonline.com/doi/abs/10.1080/00949655.2015.1060234).
33+
chromflock was used in
34+
- [Girelli et al. 2020](https://www.nature.com/articles/s41587-020-0519-y).
4335

36+
The ideas behind it can be found in the following papers:
37+
- [Kalhor et al, 2012](https://doi.org/10.1038/nbt.2057)
38+
- [Tjong et al, 2016](http://dx.doi.org/10.1073/pnas.1512577113)
39+
- [Hua et al, 2018](http://dx.doi.org/10.1038/nprot.2018.008) [github](https://www.github.com/alberlab/PGS)
4440

41+
For random numbers chromflock uses
42+
- [McFarland, 2014](http://www.tandfonline.com/doi/abs/10.1080/00949655.2015.1060234) [code](https://github.com/cd-mcfarland/fast_prng).

TODO.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Roadmap / to do
2+
3+
## General todo:
4+
- [ ] Fix so that `chromflock_gen` passes the gpseq file properly.
5+
6+
## Molecular dynamics, M-step
7+
8+
The M-step is where the model is updated. Initially the model is built from random positions, then at each following step:
9+
1. The contact indication matrix, W, for each structure is updated (A-step)
10+
2. New positions, X, are found in the structure that minimized the error functional (which includes W).
11+
12+
Our implementation of the M-step uses Simulated Annealing with a Brownian force, much like what Peter Cook and Davide Marenduzzo does. Alber does not provide any details on what they are doing. However they refer to Russel2012, but not much details can be found their either. Since they published the Nat. Prot. paper the details are at least available in their source code.
13+
14+
### TODO
15+
- [ ] a lua script should be the only way to set the dynamics parameters.
16+
- [ ] consider different stopping criteria, for example based on the radius of gyration.
17+
- [ ] Fix so that mflock crashes gracefully when fRad is specified but no `-r` file is supplied.
18+
- [ ] Append to log file if it already exists?
19+
- [ ] Warn if largest `max(R) > 1-r0`. Or convert `R' = R*(1-r0)`? Same goes of the output profiles, should `r` be output or `r*(1-r0)`? (`r0` is the bead radius).
20+
- [ ] Alber uses k=10 for consecutive beads (and k=1 for non consecutive).
21+
- [ ] Fix so that the supplied random seed is used.
22+
- [x] Put live view in main thread and optimization in child in order for SDL to be able to capture events on OSX.
23+
- [x] Implement ellipsoidal geometry.
24+
- [x] See if dInteraction = 4 is better than dInteraction = 3. -- seems slightly better. Changing default. See `fconf.dInteraction` in `md.c`.
25+
- [x] Chimera script to add extra annotations and save automatically -- see `util/chromflock_images.py`
26+
27+
- [x] Removed any dependencies of GSL -- not necessary any more since only MD is used.
28+
- [x] Switched RNG and Normal Distribution generator to [A modified ziggurat algorithm for generating exponentially and normally distributed pseudorandom numbers](http://www.tandfonline.com/doi/abs/10.1080/00949655.2015.1060234).
29+
( Previously used Zignor, copied from Jurgen A. Doornik (2005). "An Improved Ziggurat Method to Generate Normal Random Samples" (PDF). Nuffield College, Oxford, together with [PCG](http://www.pcg-random.org) which should be better than a Marsenne Twister in terms of statistical properties and has about the same speed.)
30+
- [x] Use 3D Gaussian for the Brownian force. (Used a force vector drawn uniformly from a 3D ball before).
31+
- [x] Long option names, i.e. not just single letter names for the parameters.
32+
- [x] Consider playing with the radius of the simulation (equivalent with the bead radius), one of the *tricks* used by Alber. -- Faster convergence when the volume quotient of the beads is smaller. However when doing that the beads does not use the full domain very well and the structures look very unrealistic.
33+
- [ ] Consider to change from double to single precision. The instructions might not be faster for SP, but the memory locality will be better. -- Probably not worthwhile.
34+
- [x] Any reason to use the varying `k_{con}` described on p. 64? -- Has possibly something todo with the *entropic forces* and convergence speed. -- Se below.
35+
- [x] Use a decreasing volume exclusion force during the simulations. By hypothesis this reduces the bias towards *old* contacts, i.e., contacts introduced at an early stage. To be specific, the old contacts are the ones closest to the diagonal. This means that the diagonal is well represented among the contacts in the structures but non the off-diagnoal elements. This can be seen as a lack of of sharpness at the edges of the blocks.
36+
- [x] Write error log to separate file, or use other mechanism to track crashes. -- Added `--joblog parallel.log` to `parallel`.
37+
- [x] Is there a bias in the brownian force? (that brings the structure to the side) -- there was. Issue resolved and unit test in place.
38+
- [x] Implement Verlet integration. No significant changes in performance.
39+
- [x] How many "brownian" steps and what falloff?
40+
- [x] Delta t, switched from 1 to 5 2019-04-17. What would be a good value?
41+
- [x] Implement live view of current state. -- available from SDL with the -a swithch (when compiled with -DSDL).
42+
- [x] Add a COM force (per chr).
43+
- [x] Switch to radial data in the "structure summary"
44+
- [x] Timing does not work in the -D mode.
45+
- [x] Use a label vector to generate cmms.
46+
- [x] Include L - label, both for cmm creation and for various per chromosome things.
47+
- [x] Integrate with GPSeq data.
48+
- [x] Add comparison of all `err` and all `grad` in the unit tests.
49+
- [x] Make bead size a parameter or set so that the volume occupancy is 20% (or some other number).
50+
- [x] Output coordinates for chimera.
51+
- [x] HASH table to find points in contact.
52+
- [x] Use a list of pairwise interactions rather than the full matrix.
53+
54+
And for the MD implemenation
55+
- [ ] Keep track of Ek, and potentially set it to a specific value as in IMP.
56+
- [ ] For the above point, use 1-diagonal only.
57+
- [x] Implement a force that brings chrs together to their centre of mass as Alber does.
58+
59+
### Parallelization
60+
61+
- [x] Structure optimizations can be carried out in parallel, one thread per structure. The jobs are distributed with GNU parallel.
62+
63+
### Memory
64+
65+
## A-step
66+
- [ ] Consider re-running the m-step until the number of contacts falls below some threshold.
67+
- [ ] Consider writing coords as raw double data.
68+
- [ ] Add counting of failed contacts to the `-F` mode.
69+
- [ ] Add a flag to ignore inter contacts, then there won't be any need to generate a separate A-matrix when trying without inter contacts.
70+
- [x] Allow diploid experiments.
71+
- [x] Use zlib to compress the W matrices.
72+
- [x] Consider *even* assignment in the sense that each structure get the same number of contact initially.
73+
- [x] Implement re-assignment, and improve the assignment.
74+
- [x] Possible error when setting the bead radius from volume quotient?!
75+
- [x] Use zlib to write the cmms.
76+
- [x] Enable chromflock to use X from previous simulation with updated A.
77+
- [x] Implement the HiC-normalization from p. 53. The transform was actually already performed on the data!
78+
- [x] Implement the sphere contact probability matrix from p. 65.
79+
- [x] Determine W, see p. 66. Rather assign the contacts as in Tjong2016, i.e., the structures that already have the shortest distances will have to do the job.
80+
And for flock:
81+
- [x] Read sphere contact probability matrix A.
82+
- [x] Initialize new experiment here or separate? -- in bash script
83+
- [x] Read and write the contact indicator tensor W.
84+
- [x] Write radial profile at -F
85+
86+
### Memory
87+
- Coordinates: Surprisingly not that much memory is required. Let's get an estimate:
88+
10,000 structure x 5,000 points x 3 dimensions = 1,200 Mb (64 bit floats). Currently aflock is using single precision to represent the coordinates. Using 10,000 structures x 3,300 points requires about .5 Gb.
89+
90+
- Constraints: Each constraint requires 8 bytes `(uint_32, uint_32)`. By keeping separate lists for separate types of constraints there is no need to story the type.
91+
10,000 structures x 5,000 constraints = 400 MB.
92+
93+
### Parallelization
94+
- [x] The assignment process could be parallelized over each contact to be handed out, i.e., the qsorts. -- The generation of the contact distances is now parallelized using `pthread`. Initially all available cps are used.
95+
96+
## General and Random notes
97+
98+
- [ ] Can we estimate the entropy based on the gradient norm?
99+
100+
- [ ] Does the simulation tool yield end-to-end distances of L^{3/5} for unperturbated chains?
101+
102+
- [ ] Update makefile and folder layout, have a look at [stackoverflow](https://stackoverflow.com/questions/7004702/how-can-i-create-a-makefile-for-c-projects-with-src-obj-and-bin-subdirectories)
103+
104+
- [x] Make Hi-C look like TCC!
105+
106+
- [ ] Estimate number of failed contacts after each `mflock`-run in order to decide if another round is needed of `aflock` before going to smaller thetas.
107+
108+
- [ ] ~~libexpat for settings?~~
109+
110+
- [x] lua for interactions/setting md/sa schedules?
111+
112+
- [x] The chr4 trick?! -- No
113+
114+
- [x] Rewrite `chromflock_run` so that it is easier to continue/restart failed jobs. New pipeline: `chromflock` generates `chromflock_gen` which should be edited and run. That script does in turn create `chromflock_run` which is a linear batch of jobs to be run. If anything in `chromflock_run` fails it should be easier to restart from that position.
115+
116+
- [ ] Are beads of different type handled differently by Alber?
117+
118+
- [ ] What are the results if only the structures with average number of contact restraints are studied?
119+
120+
- [x] How many inter vs intra contacts are there when theta = 1, .2, ... , 0.01. (Is the algorithm agnostic to anything but the intra contacts? That would explain random orientation of the chromosomes).
121+
122+
- [x] What if delta t is increased? -- I think that I got that parameter quite well.
123+
124+
- PGP is only for diploid cells.
125+
126+
- Maybe it makes sense to 'kill your darlings', i.e., open up for the possibility to not use all of the initial structures (discard or replace). We don't expect the nucleus to be completely random so random initializations might go wrong.
127+
128+
- chromflock is snappier than chromoflock. chrof, crof.
129+
130+
- Convergence critera shouldn't have to be that strict until final theta.
131+
132+
- A very limited number of theta values, i.e., only few assignment steps (A) are used. This could potentially lead to the inclusion of conflicting constraints. However, it is extremely costly to increase this number...
133+
134+
- The Hi-C maps does probably implicitly suggest radial positions. However what radial positions that are assigned to each loci depends on the reconstruction method. No reconstruction method is perfect, rather the opposite. This means that GPSeq adds extra constraints, i.e., reduces the space of probable states/configurations.
135+
136+
- There is an upper limit on the number of neighbours each bead can have. Geometry tells us 12, see the wikipedia page on sphere packings.
137+
138+
## Further devolopment
139+
- [ ] Use local temperature like [Agrawal](https://www.biorxiv.org/node/96892.full)
140+
- [ ] Use proteins like [Cook and Marenduzzo](http://dx.doi.org/10.1093/nar/gkw135)

0 commit comments

Comments
 (0)