|
22 | 22 | "A typical example of the problem DiffPaSS is designed to solve is the following: given two multiple sequence alignments (MSAs) A and B, containing interacting biological sequences, find the optimal one-to-one pairing between the sequences in A and B.\n",
|
23 | 23 | "\n",
|
24 | 24 | "<figure>\n",
|
25 |
| - " <img src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg\" alt=\"MSA pairing problem\" />\n", |
| 25 | + " <img src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg\" width=\"640\" height=\"201.6\" alt=\"MSA pairing problem\" />\n", |
26 | 26 | " <figcaption>Pairing problem for two multiple sequence alignments, where pairings are restricted to be within the same species</figcaption>\n",
|
27 | 27 | "</figure>\n",
|
28 | 28 | "\n",
|
|
51 | 51 | " 4. A notion of \"robust pairs\" that can be used to identify pairs that are consistently found throughout a DiffPaSS bootstrap. These pairs can be used as ground truths in another DiffPaSS run, giving rise to the DiffPaSS-Iterative Pairing Algorithm (DiffPaSS-IPA).\n",
|
52 | 52 | " \n",
|
53 | 53 | "<figure>\n",
|
54 |
| - " <video src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/DiffPaSS_bootstrap.mp4\" width=\"432\" height=\"243\" controls></video>\n", |
| 54 | + " <video src=\"https://github.com/Bitbol-Lab/DiffPaSS/assets/46537483/e411fe8c-2fed-4723-a25c-ff69a1abccec\" width=\"640\" height=\"360\" controls></video>\n", |
55 | 55 | " <figcaption>The DiffPaSS bootstrap technique and robust pairs</figcaption>\n",
|
56 | 56 | "</figure>"
|
57 | 57 | ]
|
|
98 | 98 | "```python\n",
|
99 | 99 | "from diffpass.msa_parsing import read_msa\n",
|
100 | 100 | "\n",
|
101 |
| - "# Parse and one-hot encode the MSAs\n", |
102 |
| - "msa_data_A = read_msa(\"path/to/msa_A.fasta\")\n", |
103 |
| - "msa_data_B = read_msa(\"path/to/msa_B.fasta\")\n", |
| 101 | + "# Parse the MSAs into lists of tuples (header, sequence)\n", |
| 102 | + "msa_A = read_msa(\"path/to/msa_A.fasta\")\n", |
| 103 | + "msa_B = read_msa(\"path/to/msa_B.fasta\")\n", |
104 | 104 | "```\n",
|
105 | 105 | "\n",
|
106 | 106 | "We assume that the MSAs contain species information in the headers, which will be used to restrict the pairings to be within the same species (more generally, \"groups\"). We need a simple function to extract the species information from the headers. For instance, if the headers are in the format `>sequence_id|species_name|...`, we can use:\n",
|
|
115 | 115 | "```python\n",
|
116 | 116 | "from diffpass.data_utils import create_groupwise_seq_records\n",
|
117 | 117 | "\n",
|
118 |
| - "msa_data_A_species_by_species = create_groupwise_seq_records(msa_data_A, species_name_func)\n", |
119 |
| - "msa_data_B_species_by_species = create_groupwise_seq_records(msa_data_B, species_name_func)\n", |
| 118 | + "msa_A_by_sp = create_groupwise_seq_records(msa_A, species_name_func)\n", |
| 119 | + "msa_B_by_sp = create_groupwise_seq_records(msa_B, species_name_func)\n", |
120 | 120 | "```\n",
|
121 | 121 | "\n",
|
122 | 122 | "If one of the MSAs contains sequences from species not present in the other MSA, we can remove these species from both MSAs:\n",
|
123 | 123 | "\n",
|
124 | 124 | "```python\n",
|
125 | 125 | "from diffpass.data_utils import remove_groups_not_in_both\n",
|
126 | 126 | "\n",
|
127 |
| - "msa_data_A_species_by_species, msa_data_B_species_by_species = remove_groups_not_in_both(\n", |
128 |
| - " msa_data_A_species_by_species, msa_data_B_species_by_species\n", |
| 127 | + "msa_A_by_sp, msa_B_by_sp = remove_groups_not_in_both(\n", |
| 128 | + " msa_A_by_sp, msa_B_by_sp\n", |
129 | 129 | ")\n",
|
130 | 130 | "```\n",
|
131 | 131 | "\n",
|
|
134 | 134 | "```python\n",
|
135 | 135 | "from diffpass.data_utils import pad_msas_with_dummy_sequences\n",
|
136 | 136 | "\n",
|
137 |
| - "msa_data_A_species_by_species_padded, msa_data_B_species_by_species_padded = pad_msas_with_dummy_sequences(\n", |
138 |
| - " msa_data_A_species_by_species, msa_data_B_species_by_species\n", |
| 137 | + "msa_A_by_sp_pad, msa_B_by_sp_pad = pad_msas_with_dummy_sequences(\n", |
| 138 | + " msa_A_by_sp, msa_B_by_sp\n", |
139 | 139 | ")\n",
|
140 | 140 | "\n",
|
141 |
| - "species = list(msa_data_A_species_by_species_padded.keys())\n", |
142 |
| - "species_sizes = list(map(len, msa_data_A_species_by_species_padded.values()))\n", |
| 141 | + "species = list(msa_A_by_sp_pad.keys())\n", |
| 142 | + "species_sizes = list(map(len, msa_A_by_sp_pad.values()))\n", |
143 | 143 | "```\n",
|
144 | 144 | "\n",
|
145 | 145 | "Next, one-hot encode the MSAs using the `one_hot_encode_msa` function.\n",
|
|
150 | 150 | "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
151 | 151 | "\n",
|
152 | 152 | "# Unpack the padded MSAs into a list of records\n",
|
153 |
| - "msa_data_A_for_pairing = [record for records_this_species in msa_data_A_species_by_species_padded.values() for record in records_this_species]\n", |
154 |
| - "msa_data_B_for_pairing = [record for records_this_species in msa_data_B_species_by_species_padded.values() for record in records_this_species]\n", |
| 153 | + "msa_A_for_pairing = [\n", |
| 154 | + " rec for recs_this_sp in msa_A_by_sp_pad.values() for rec in recs_this_sp\n", |
| 155 | + "]\n", |
| 156 | + "msa_B_for_pairing = [\n", |
| 157 | + " rec for recs_this_sp in msa_B_by_sp_pad.values() for rec in recs_this_sp\n", |
| 158 | + "]\n", |
155 | 159 | "\n",
|
156 | 160 | "# One-hot encode the MSAs and load them to a device\n",
|
157 |
| - "msa_A_oh = one_hot_encode_msa(msa_data_A_for_pairing, device=device)\n", |
158 |
| - "msa_B_oh = one_hot_encode_msa(msa_data_B_for_pairing, device=device)\n", |
| 161 | + "msa_A_oh = one_hot_encode_msa(msa_A_for_pairing, device=device)\n", |
| 162 | + "msa_B_oh = one_hot_encode_msa(msa_B_for_pairing, device=device)\n", |
159 | 163 | "```\n",
|
160 | 164 | "\n",
|
161 | 165 | "### Pairing optimization\n",
|
162 | 166 | "\n",
|
163 |
| - "Finally, we can instantiate an `InformationPairing` object and optimize the mutual information between the paired MSAs using the DiffPaSS bootstrap algorithm. The results are stored in a `DiffPaSSResults` container. The lists of (hard) losses and permutations found can be accessed as attributes of the container.\n", |
| 167 | + "Finally, we can instantiate an `InformationPairing` object and optimize the mutual information between the paired MSAs using the DiffPaSS bootstrapped optimization algorithm. The results are stored in a `DiffPaSSResults` container. The lists of (hard) losses and permutations found during the optimization can be accessed as attributes of the container.\n", |
164 | 168 | "\n",
|
165 | 169 | "```python\n",
|
166 | 170 | "from diffpass.train import InformationPairing\n",
|
|
0 commit comments