|
| 1 | +# Roadmap / to do |
| 2 | + |
| 3 | +## General todo: |
| 4 | + - [ ] Fix so that `chromflock_gen` passes the gpseq file properly. |
| 5 | + |
| 6 | +## Molecular dynamics, M-step |
| 7 | + |
| 8 | +The M-step is where the model is updated. Initially the model is built from random positions, then at each following step: |
| 9 | + 1. The contact indication matrix, W, for each structure is updated (A-step) |
| 10 | + 2. New positions, X, are found in the structure that minimized the error functional (which includes W). |
| 11 | + |
| 12 | +Our implementation of the M-step uses Simulated Annealing with a Brownian force, much like what Peter Cook and Davide Marenduzzo does. Alber does not provide any details on what they are doing. However they refer to Russel2012, but not much details can be found their either. Since they published the Nat. Prot. paper the details are at least available in their source code. |
| 13 | + |
| 14 | +### TODO |
| 15 | + - [ ] a lua script should be the only way to set the dynamics parameters. |
| 16 | + - [ ] consider different stopping criteria, for example based on the radius of gyration. |
| 17 | + - [ ] Fix so that mflock crashes gracefully when fRad is specified but no `-r` file is supplied. |
| 18 | + - [ ] Append to log file if it already exists? |
| 19 | + - [ ] Warn if largest `max(R) > 1-r0`. Or convert `R' = R*(1-r0)`? Same goes of the output profiles, should `r` be output or `r*(1-r0)`? (`r0` is the bead radius). |
| 20 | + - [ ] Alber uses k=10 for consecutive beads (and k=1 for non consecutive). |
| 21 | + - [ ] Fix so that the supplied random seed is used. |
| 22 | + - [x] Put live view in main thread and optimization in child in order for SDL to be able to capture events on OSX. |
| 23 | + - [x] Implement ellipsoidal geometry. |
| 24 | + - [x] See if dInteraction = 4 is better than dInteraction = 3. -- seems slightly better. Changing default. See `fconf.dInteraction` in `md.c`. |
| 25 | + - [x] Chimera script to add extra annotations and save automatically -- see `util/chromflock_images.py` |
| 26 | + |
| 27 | + - [x] Removed any dependencies of GSL -- not necessary any more since only MD is used. |
| 28 | + - [x] Switched RNG and Normal Distribution generator to [A modified ziggurat algorithm for generating exponentially and normally distributed pseudorandom numbers](http://www.tandfonline.com/doi/abs/10.1080/00949655.2015.1060234). |
| 29 | +( Previously used Zignor, copied from Jurgen A. Doornik (2005). "An Improved Ziggurat Method to Generate Normal Random Samples" (PDF). Nuffield College, Oxford, together with [PCG](http://www.pcg-random.org) which should be better than a Marsenne Twister in terms of statistical properties and has about the same speed.) |
| 30 | + - [x] Use 3D Gaussian for the Brownian force. (Used a force vector drawn uniformly from a 3D ball before). |
| 31 | + - [x] Long option names, i.e. not just single letter names for the parameters. |
| 32 | + - [x] Consider playing with the radius of the simulation (equivalent with the bead radius), one of the *tricks* used by Alber. -- Faster convergence when the volume quotient of the beads is smaller. However when doing that the beads does not use the full domain very well and the structures look very unrealistic. |
| 33 | + - [ ] Consider to change from double to single precision. The instructions might not be faster for SP, but the memory locality will be better. -- Probably not worthwhile. |
| 34 | + - [x] Any reason to use the varying `k_{con}` described on p. 64? -- Has possibly something todo with the *entropic forces* and convergence speed. -- Se below. |
| 35 | + - [x] Use a decreasing volume exclusion force during the simulations. By hypothesis this reduces the bias towards *old* contacts, i.e., contacts introduced at an early stage. To be specific, the old contacts are the ones closest to the diagonal. This means that the diagonal is well represented among the contacts in the structures but non the off-diagnoal elements. This can be seen as a lack of of sharpness at the edges of the blocks. |
| 36 | + - [x] Write error log to separate file, or use other mechanism to track crashes. -- Added `--joblog parallel.log` to `parallel`. |
| 37 | + - [x] Is there a bias in the brownian force? (that brings the structure to the side) -- there was. Issue resolved and unit test in place. |
| 38 | +- [x] Implement Verlet integration. No significant changes in performance. |
| 39 | +- [x] How many "brownian" steps and what falloff? |
| 40 | + - [x] Delta t, switched from 1 to 5 2019-04-17. What would be a good value? |
| 41 | + - [x] Implement live view of current state. -- available from SDL with the -a swithch (when compiled with -DSDL). |
| 42 | + - [x] Add a COM force (per chr). |
| 43 | + - [x] Switch to radial data in the "structure summary" |
| 44 | + - [x] Timing does not work in the -D mode. |
| 45 | + - [x] Use a label vector to generate cmms. |
| 46 | + - [x] Include L - label, both for cmm creation and for various per chromosome things. |
| 47 | + - [x] Integrate with GPSeq data. |
| 48 | + - [x] Add comparison of all `err` and all `grad` in the unit tests. |
| 49 | + - [x] Make bead size a parameter or set so that the volume occupancy is 20% (or some other number). |
| 50 | + - [x] Output coordinates for chimera. |
| 51 | + - [x] HASH table to find points in contact. |
| 52 | + - [x] Use a list of pairwise interactions rather than the full matrix. |
| 53 | + |
| 54 | +And for the MD implemenation |
| 55 | + - [ ] Keep track of Ek, and potentially set it to a specific value as in IMP. |
| 56 | + - [ ] For the above point, use 1-diagonal only. |
| 57 | + - [x] Implement a force that brings chrs together to their centre of mass as Alber does. |
| 58 | + |
| 59 | +### Parallelization |
| 60 | + |
| 61 | + - [x] Structure optimizations can be carried out in parallel, one thread per structure. The jobs are distributed with GNU parallel. |
| 62 | + |
| 63 | +### Memory |
| 64 | + |
| 65 | +## A-step |
| 66 | + - [ ] Consider re-running the m-step until the number of contacts falls below some threshold. |
| 67 | + - [ ] Consider writing coords as raw double data. |
| 68 | + - [ ] Add counting of failed contacts to the `-F` mode. |
| 69 | + - [ ] Add a flag to ignore inter contacts, then there won't be any need to generate a separate A-matrix when trying without inter contacts. |
| 70 | + - [x] Allow diploid experiments. |
| 71 | + - [x] Use zlib to compress the W matrices. |
| 72 | + - [x] Consider *even* assignment in the sense that each structure get the same number of contact initially. |
| 73 | + - [x] Implement re-assignment, and improve the assignment. |
| 74 | + - [x] Possible error when setting the bead radius from volume quotient?! |
| 75 | + - [x] Use zlib to write the cmms. |
| 76 | + - [x] Enable chromflock to use X from previous simulation with updated A. |
| 77 | + - [x] Implement the HiC-normalization from p. 53. The transform was actually already performed on the data! |
| 78 | + - [x] Implement the sphere contact probability matrix from p. 65. |
| 79 | + - [x] Determine W, see p. 66. Rather assign the contacts as in Tjong2016, i.e., the structures that already have the shortest distances will have to do the job. |
| 80 | +And for flock: |
| 81 | + - [x] Read sphere contact probability matrix A. |
| 82 | + - [x] Initialize new experiment here or separate? -- in bash script |
| 83 | + - [x] Read and write the contact indicator tensor W. |
| 84 | + - [x] Write radial profile at -F |
| 85 | + |
| 86 | +### Memory |
| 87 | + - Coordinates: Surprisingly not that much memory is required. Let's get an estimate: |
| 88 | + 10,000 structure x 5,000 points x 3 dimensions = 1,200 Mb (64 bit floats). Currently aflock is using single precision to represent the coordinates. Using 10,000 structures x 3,300 points requires about .5 Gb. |
| 89 | + |
| 90 | + - Constraints: Each constraint requires 8 bytes `(uint_32, uint_32)`. By keeping separate lists for separate types of constraints there is no need to story the type. |
| 91 | + 10,000 structures x 5,000 constraints = 400 MB. |
| 92 | + |
| 93 | +### Parallelization |
| 94 | + - [x] The assignment process could be parallelized over each contact to be handed out, i.e., the qsorts. -- The generation of the contact distances is now parallelized using `pthread`. Initially all available cps are used. |
| 95 | + |
| 96 | +## General and Random notes |
| 97 | + |
| 98 | + - [ ] Can we estimate the entropy based on the gradient norm? |
| 99 | + |
| 100 | + - [ ] Does the simulation tool yield end-to-end distances of L^{3/5} for unperturbated chains? |
| 101 | + |
| 102 | + - [ ] Update makefile and folder layout, have a look at [stackoverflow](https://stackoverflow.com/questions/7004702/how-can-i-create-a-makefile-for-c-projects-with-src-obj-and-bin-subdirectories) |
| 103 | + |
| 104 | + - [x] Make Hi-C look like TCC! |
| 105 | + |
| 106 | + - [ ] Estimate number of failed contacts after each `mflock`-run in order to decide if another round is needed of `aflock` before going to smaller thetas. |
| 107 | + |
| 108 | + - [ ] ~~libexpat for settings?~~ |
| 109 | + |
| 110 | + - [x] lua for interactions/setting md/sa schedules? |
| 111 | + |
| 112 | + - [x] The chr4 trick?! -- No |
| 113 | + |
| 114 | + - [x] Rewrite `chromflock_run` so that it is easier to continue/restart failed jobs. New pipeline: `chromflock` generates `chromflock_gen` which should be edited and run. That script does in turn create `chromflock_run` which is a linear batch of jobs to be run. If anything in `chromflock_run` fails it should be easier to restart from that position. |
| 115 | + |
| 116 | + - [ ] Are beads of different type handled differently by Alber? |
| 117 | + |
| 118 | + - [ ] What are the results if only the structures with average number of contact restraints are studied? |
| 119 | + |
| 120 | + - [x] How many inter vs intra contacts are there when theta = 1, .2, ... , 0.01. (Is the algorithm agnostic to anything but the intra contacts? That would explain random orientation of the chromosomes). |
| 121 | + |
| 122 | + - [x] What if delta t is increased? -- I think that I got that parameter quite well. |
| 123 | + |
| 124 | + - PGP is only for diploid cells. |
| 125 | + |
| 126 | + - Maybe it makes sense to 'kill your darlings', i.e., open up for the possibility to not use all of the initial structures (discard or replace). We don't expect the nucleus to be completely random so random initializations might go wrong. |
| 127 | + |
| 128 | + - chromflock is snappier than chromoflock. chrof, crof. |
| 129 | + |
| 130 | + - Convergence critera shouldn't have to be that strict until final theta. |
| 131 | + |
| 132 | + - A very limited number of theta values, i.e., only few assignment steps (A) are used. This could potentially lead to the inclusion of conflicting constraints. However, it is extremely costly to increase this number... |
| 133 | + |
| 134 | + - The Hi-C maps does probably implicitly suggest radial positions. However what radial positions that are assigned to each loci depends on the reconstruction method. No reconstruction method is perfect, rather the opposite. This means that GPSeq adds extra constraints, i.e., reduces the space of probable states/configurations. |
| 135 | + |
| 136 | + - There is an upper limit on the number of neighbours each bead can have. Geometry tells us 12, see the wikipedia page on sphere packings. |
| 137 | + |
| 138 | +## Further devolopment |
| 139 | + - [ ] Use local temperature like [Agrawal](https://www.biorxiv.org/node/96892.full) |
| 140 | + - [ ] Use proteins like [Cook and Marenduzzo](http://dx.doi.org/10.1093/nar/gkw135) |
0 commit comments