Skip to content

Commit 6baf3b5

Browse files
committed
[README] Improve quickstart, reinstate old bootstrap video link
1 parent 876d017 commit 6baf3b5

File tree

3 files changed

+83
-74
lines changed

3 files changed

+83
-74
lines changed

README.md

Lines changed: 25 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ containing interacting biological sequences, find the optimal one-to-one
1616
pairing between the sequences in A and B.
1717

1818
<figure>
19-
<img src="https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg" alt="MSA pairing problem" />
19+
<img src="https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg" width="640" height="201.6" alt="MSA pairing problem" />
2020
<figcaption>
2121
Pairing problem for two multiple sequence alignments, where pairings are
2222
restricted to be within the same species
@@ -84,7 +84,7 @@ ingredients are as follows:
8484
the DiffPaSS-Iterative Pairing Algorithm (DiffPaSS-IPA).
8585

8686
<figure>
87-
<video src="https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/DiffPaSS_bootstrap.mp4" width="432" height="243" controls>
87+
<video src="https://github.com/Bitbol-Lab/DiffPaSS/assets/46537483/e411fe8c-2fed-4723-a25c-ff69a1abccec" width="640" height="360" controls>
8888
</video>
8989
<figcaption>
9090
The DiffPaSS bootstrap technique and robust pairs
@@ -129,9 +129,9 @@ into a list of tuples `(header, sequence)` using
129129
``` python
130130
from diffpass.msa_parsing import read_msa
131131

132-
# Parse and one-hot encode the MSAs
133-
msa_data_A = read_msa("path/to/msa_A.fasta")
134-
msa_data_B = read_msa("path/to/msa_B.fasta")
132+
# Parse the MSAs into lists of tuples (header, sequence)
133+
msa_A = read_msa("path/to/msa_A.fasta")
134+
msa_B = read_msa("path/to/msa_B.fasta")
135135
```
136136

137137
We assume that the MSAs contain species information in the headers,
@@ -150,8 +150,8 @@ This function will be used to group the sequences by species:
150150
``` python
151151
from diffpass.data_utils import create_groupwise_seq_records
152152

153-
msa_data_A_species_by_species = create_groupwise_seq_records(msa_data_A, species_name_func)
154-
msa_data_B_species_by_species = create_groupwise_seq_records(msa_data_B, species_name_func)
153+
msa_A_by_sp = create_groupwise_seq_records(msa_A, species_name_func)
154+
msa_B_by_sp = create_groupwise_seq_records(msa_B, species_name_func)
155155
```
156156

157157
If one of the MSAs contains sequences from species not present in the
@@ -160,8 +160,8 @@ other MSA, we can remove these species from both MSAs:
160160
``` python
161161
from diffpass.data_utils import remove_groups_not_in_both
162162

163-
msa_data_A_species_by_species, msa_data_B_species_by_species = remove_groups_not_in_both(
164-
msa_data_A_species_by_species, msa_data_B_species_by_species
163+
msa_A_by_sp, msa_B_by_sp = remove_groups_not_in_both(
164+
msa_A_by_sp, msa_B_by_sp
165165
)
166166
```
167167

@@ -173,12 +173,12 @@ consisting entirely of gap symbols:
173173
``` python
174174
from diffpass.data_utils import pad_msas_with_dummy_sequences
175175

176-
msa_data_A_species_by_species_padded, msa_data_B_species_by_species_padded = pad_msas_with_dummy_sequences(
177-
msa_data_A_species_by_species, msa_data_B_species_by_species
176+
msa_A_by_sp_pad, msa_B_by_sp_pad = pad_msas_with_dummy_sequences(
177+
msa_A_by_sp, msa_B_by_sp
178178
)
179179

180-
species = list(msa_data_A_species_by_species_padded.keys())
181-
species_sizes = list(map(len, msa_data_A_species_by_species_padded.values()))
180+
species = list(msa_A_by_sp_pad.keys())
181+
species_sizes = list(map(len, msa_A_by_sp_pad.values()))
182182
```
183183

184184
Next, one-hot encode the MSAs using the
@@ -191,23 +191,28 @@ from diffpass.data_utils import one_hot_encode_msa
191191
device = "cuda" if torch.cuda.is_available() else "cpu"
192192

193193
# Unpack the padded MSAs into a list of records
194-
msa_data_A_for_pairing = [record for records_this_species in msa_data_A_species_by_species_padded.values() for record in records_this_species]
195-
msa_data_B_for_pairing = [record for records_this_species in msa_data_B_species_by_species_padded.values() for record in records_this_species]
194+
msa_A_for_pairing = [
195+
rec for recs_this_sp in msa_A_by_sp_pad.values() for rec in recs_this_sp
196+
]
197+
msa_B_for_pairing = [
198+
rec for recs_this_sp in msa_B_by_sp_pad.values() for rec in recs_this_sp
199+
]
196200

197201
# One-hot encode the MSAs and load them to a device
198-
msa_A_oh = one_hot_encode_msa(msa_data_A_for_pairing, device=device)
199-
msa_B_oh = one_hot_encode_msa(msa_data_B_for_pairing, device=device)
202+
msa_A_oh = one_hot_encode_msa(msa_A_for_pairing, device=device)
203+
msa_B_oh = one_hot_encode_msa(msa_B_for_pairing, device=device)
200204
```
201205

202206
### Pairing optimization
203207

204208
Finally, we can instantiate an
205209
[`InformationPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#informationpairing)
206210
object and optimize the mutual information between the paired MSAs using
207-
the DiffPaSS bootstrap algorithm. The results are stored in a
211+
the DiffPaSS bootstrapped optimization algorithm. The results are stored
212+
in a
208213
[`DiffPaSSResults`](https://Bitbol-Lab.github.io/DiffPaSS/base.html#diffpassresults)
209-
container. The lists of (hard) losses and permutations found can be
210-
accessed as attributes of the container.
214+
container. The lists of (hard) losses and permutations found during the
215+
optimization can be accessed as attributes of the container.
211216

212217
``` python
213218
from diffpass.train import InformationPairing

nbs/index.ipynb

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
"A typical example of the problem DiffPaSS is designed to solve is the following: given two multiple sequence alignments (MSAs) A and B, containing interacting biological sequences, find the optimal one-to-one pairing between the sequences in A and B.\n",
2323
"\n",
2424
"<figure>\n",
25-
" <img src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg\" alt=\"MSA pairing problem\" />\n",
25+
" <img src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg\" width=\"640\" height=\"201.6\" alt=\"MSA pairing problem\" />\n",
2626
" <figcaption>Pairing problem for two multiple sequence alignments, where pairings are restricted to be within the same species</figcaption>\n",
2727
"</figure>\n",
2828
"\n",
@@ -51,7 +51,7 @@
5151
" 4. A notion of \"robust pairs\" that can be used to identify pairs that are consistently found throughout a DiffPaSS bootstrap. These pairs can be used as ground truths in another DiffPaSS run, giving rise to the DiffPaSS-Iterative Pairing Algorithm (DiffPaSS-IPA).\n",
5252
" \n",
5353
"<figure>\n",
54-
" <video src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/DiffPaSS_bootstrap.mp4\" width=\"432\" height=\"243\" controls></video>\n",
54+
" <video src=\"https://github.com/Bitbol-Lab/DiffPaSS/assets/46537483/e411fe8c-2fed-4723-a25c-ff69a1abccec\" width=\"640\" height=\"360\" controls></video>\n",
5555
" <figcaption>The DiffPaSS bootstrap technique and robust pairs</figcaption>\n",
5656
"</figure>"
5757
]
@@ -98,9 +98,9 @@
9898
"```python\n",
9999
"from diffpass.msa_parsing import read_msa\n",
100100
"\n",
101-
"# Parse and one-hot encode the MSAs\n",
102-
"msa_data_A = read_msa(\"path/to/msa_A.fasta\")\n",
103-
"msa_data_B = read_msa(\"path/to/msa_B.fasta\")\n",
101+
"# Parse the MSAs into lists of tuples (header, sequence)\n",
102+
"msa_A = read_msa(\"path/to/msa_A.fasta\")\n",
103+
"msa_B = read_msa(\"path/to/msa_B.fasta\")\n",
104104
"```\n",
105105
"\n",
106106
"We assume that the MSAs contain species information in the headers, which will be used to restrict the pairings to be within the same species (more generally, \"groups\"). We need a simple function to extract the species information from the headers. For instance, if the headers are in the format `>sequence_id|species_name|...`, we can use:\n",
@@ -115,17 +115,17 @@
115115
"```python\n",
116116
"from diffpass.data_utils import create_groupwise_seq_records\n",
117117
"\n",
118-
"msa_data_A_species_by_species = create_groupwise_seq_records(msa_data_A, species_name_func)\n",
119-
"msa_data_B_species_by_species = create_groupwise_seq_records(msa_data_B, species_name_func)\n",
118+
"msa_A_by_sp = create_groupwise_seq_records(msa_A, species_name_func)\n",
119+
"msa_B_by_sp = create_groupwise_seq_records(msa_B, species_name_func)\n",
120120
"```\n",
121121
"\n",
122122
"If one of the MSAs contains sequences from species not present in the other MSA, we can remove these species from both MSAs:\n",
123123
"\n",
124124
"```python\n",
125125
"from diffpass.data_utils import remove_groups_not_in_both\n",
126126
"\n",
127-
"msa_data_A_species_by_species, msa_data_B_species_by_species = remove_groups_not_in_both(\n",
128-
" msa_data_A_species_by_species, msa_data_B_species_by_species\n",
127+
"msa_A_by_sp, msa_B_by_sp = remove_groups_not_in_both(\n",
128+
" msa_A_by_sp, msa_B_by_sp\n",
129129
")\n",
130130
"```\n",
131131
"\n",
@@ -134,12 +134,12 @@
134134
"```python\n",
135135
"from diffpass.data_utils import pad_msas_with_dummy_sequences\n",
136136
"\n",
137-
"msa_data_A_species_by_species_padded, msa_data_B_species_by_species_padded = pad_msas_with_dummy_sequences(\n",
138-
" msa_data_A_species_by_species, msa_data_B_species_by_species\n",
137+
"msa_A_by_sp_pad, msa_B_by_sp_pad = pad_msas_with_dummy_sequences(\n",
138+
" msa_A_by_sp, msa_B_by_sp\n",
139139
")\n",
140140
"\n",
141-
"species = list(msa_data_A_species_by_species_padded.keys())\n",
142-
"species_sizes = list(map(len, msa_data_A_species_by_species_padded.values()))\n",
141+
"species = list(msa_A_by_sp_pad.keys())\n",
142+
"species_sizes = list(map(len, msa_A_by_sp_pad.values()))\n",
143143
"```\n",
144144
"\n",
145145
"Next, one-hot encode the MSAs using the `one_hot_encode_msa` function.\n",
@@ -150,17 +150,21 @@
150150
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
151151
"\n",
152152
"# Unpack the padded MSAs into a list of records\n",
153-
"msa_data_A_for_pairing = [record for records_this_species in msa_data_A_species_by_species_padded.values() for record in records_this_species]\n",
154-
"msa_data_B_for_pairing = [record for records_this_species in msa_data_B_species_by_species_padded.values() for record in records_this_species]\n",
153+
"msa_A_for_pairing = [\n",
154+
" rec for recs_this_sp in msa_A_by_sp_pad.values() for rec in recs_this_sp\n",
155+
"]\n",
156+
"msa_B_for_pairing = [\n",
157+
" rec for recs_this_sp in msa_B_by_sp_pad.values() for rec in recs_this_sp\n",
158+
"]\n",
155159
"\n",
156160
"# One-hot encode the MSAs and load them to a device\n",
157-
"msa_A_oh = one_hot_encode_msa(msa_data_A_for_pairing, device=device)\n",
158-
"msa_B_oh = one_hot_encode_msa(msa_data_B_for_pairing, device=device)\n",
161+
"msa_A_oh = one_hot_encode_msa(msa_A_for_pairing, device=device)\n",
162+
"msa_B_oh = one_hot_encode_msa(msa_B_for_pairing, device=device)\n",
159163
"```\n",
160164
"\n",
161165
"### Pairing optimization\n",
162166
"\n",
163-
"Finally, we can instantiate an `InformationPairing` object and optimize the mutual information between the paired MSAs using the DiffPaSS bootstrap algorithm. The results are stored in a `DiffPaSSResults` container. The lists of (hard) losses and permutations found can be accessed as attributes of the container.\n",
167+
"Finally, we can instantiate an `InformationPairing` object and optimize the mutual information between the paired MSAs using the DiffPaSS bootstrapped optimization algorithm. The results are stored in a `DiffPaSSResults` container. The lists of (hard) losses and permutations found during the optimization can be accessed as attributes of the container.\n",
164168
"\n",
165169
"```python\n",
166170
"from diffpass.train import InformationPairing\n",

nbs/model.ipynb

Lines changed: 36 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"source": [
77
"# model\n",
88
"\n",
9-
"> DiffPaSS models for optimizing MSA pairing"
9+
"> DiffPaSS modules for optimizing permutations and computing soft scores"
1010
]
1111
},
1212
{
@@ -392,6 +392,41 @@
392392
" return torch.gather(x_permuted_rows, -1, index)"
393393
]
394394
},
395+
{
396+
"cell_type": "code",
397+
"execution_count": null,
398+
"metadata": {},
399+
"outputs": [
400+
{
401+
"data": {
402+
"text/markdown": "---\n\n[source](https://github.com/Bitbol-Lab/DiffPaSS/blob/main/diffpass/model.py#L49){target=\"_blank\" style=\"float:right; font-size:smaller\"}\n\n### GeneralizedPermutation\n\n> GeneralizedPermutation (group_sizes:collections.abc.Iterable[int], fixed_\n> pairings:Optional[collections.abc.Sequence[collec\n> tions.abc.Sequence[collections.abc.Sequence[int]]\n> ]]=None, tau:float=1.0, n_iter:int=1,\n> noise:bool=False, noise_factor:float=1.0,\n> noise_std:bool=False,\n> mode:Literal['soft','hard']='soft')\n\nGeneralized permutation layer implementing both soft and hard permutations.",
403+
"text/plain": [
404+
"---\n",
405+
"\n",
406+
"[source](https://github.com/Bitbol-Lab/DiffPaSS/blob/main/diffpass/model.py#L49){target=\"_blank\" style=\"float:right; font-size:smaller\"}\n",
407+
"\n",
408+
"### GeneralizedPermutation\n",
409+
"\n",
410+
"> GeneralizedPermutation (group_sizes:collections.abc.Iterable[int], fixed_\n",
411+
"> pairings:Optional[collections.abc.Sequence[collec\n",
412+
"> tions.abc.Sequence[collections.abc.Sequence[int]]\n",
413+
"> ]]=None, tau:float=1.0, n_iter:int=1,\n",
414+
"> noise:bool=False, noise_factor:float=1.0,\n",
415+
"> noise_std:bool=False,\n",
416+
"> mode:Literal['soft','hard']='soft')\n",
417+
"\n",
418+
"Generalized permutation layer implementing both soft and hard permutations."
419+
]
420+
},
421+
"execution_count": null,
422+
"metadata": {},
423+
"output_type": "execute_result"
424+
}
425+
],
426+
"source": [
427+
"show_doc(GeneralizedPermutation)"
428+
]
429+
},
395430
{
396431
"cell_type": "code",
397432
"execution_count": null,
@@ -449,41 +484,6 @@
449484
"test_batch_perm((2, 5, 4, 4))"
450485
]
451486
},
452-
{
453-
"cell_type": "code",
454-
"execution_count": null,
455-
"metadata": {},
456-
"outputs": [
457-
{
458-
"data": {
459-
"text/markdown": "---\n\n[source](https://github.com/Bitbol-Lab/DiffPaSS/blob/main/diffpass/model.py#L49){target=\"_blank\" style=\"float:right; font-size:smaller\"}\n\n### GeneralizedPermutation\n\n> GeneralizedPermutation (group_sizes:collections.abc.Iterable[int], fixed_\n> pairings:Optional[collections.abc.Sequence[collec\n> tions.abc.Sequence[collections.abc.Sequence[int]]\n> ]]=None, tau:float=1.0, n_iter:int=1,\n> noise:bool=False, noise_factor:float=1.0,\n> noise_std:bool=False,\n> mode:Literal['soft','hard']='soft')\n\nGeneralized permutation layer implementing both soft and hard permutations.",
460-
"text/plain": [
461-
"---\n",
462-
"\n",
463-
"[source](https://github.com/Bitbol-Lab/DiffPaSS/blob/main/diffpass/model.py#L49){target=\"_blank\" style=\"float:right; font-size:smaller\"}\n",
464-
"\n",
465-
"### GeneralizedPermutation\n",
466-
"\n",
467-
"> GeneralizedPermutation (group_sizes:collections.abc.Iterable[int], fixed_\n",
468-
"> pairings:Optional[collections.abc.Sequence[collec\n",
469-
"> tions.abc.Sequence[collections.abc.Sequence[int]]\n",
470-
"> ]]=None, tau:float=1.0, n_iter:int=1,\n",
471-
"> noise:bool=False, noise_factor:float=1.0,\n",
472-
"> noise_std:bool=False,\n",
473-
"> mode:Literal['soft','hard']='soft')\n",
474-
"\n",
475-
"Generalized permutation layer implementing both soft and hard permutations."
476-
]
477-
},
478-
"execution_count": null,
479-
"metadata": {},
480-
"output_type": "execute_result"
481-
}
482-
],
483-
"source": [
484-
"show_doc(GeneralizedPermutation)"
485-
]
486-
},
487487
{
488488
"cell_type": "markdown",
489489
"metadata": {},

0 commit comments

Comments
 (0)