|
22 | 22 | "A typical example of the problem DiffPaSS is designed to solve is the following: given two multiple sequence alignments (MSAs) A and B, containing interacting biological sequences, find the optimal one-to-one pairing between the sequences in A and B.\n", |
23 | 23 | "\n", |
24 | 24 | "<figure>\n", |
25 | | - " <img src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg\" alt=\"MSA pairing problem\" />\n", |
| 25 | + " <img src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/MSA_pairing_problem.svg\" width=\"640\" height=\"201.6\" alt=\"MSA pairing problem\" />\n", |
26 | 26 | " <figcaption>Pairing problem for two multiple sequence alignments, where pairings are restricted to be within the same species</figcaption>\n", |
27 | 27 | "</figure>\n", |
28 | 28 | "\n", |
|
51 | 51 | " 4. A notion of \"robust pairs\" that can be used to identify pairs that are consistently found throughout a DiffPaSS bootstrap. These pairs can be used as ground truths in another DiffPaSS run, giving rise to the DiffPaSS-Iterative Pairing Algorithm (DiffPaSS-IPA).\n", |
52 | 52 | " \n", |
53 | 53 | "<figure>\n", |
54 | | - " <video src=\"https://raw.githubusercontent.com/Bitbol-Lab/DiffPaSS/main/media/DiffPaSS_bootstrap.mp4\" width=\"432\" height=\"243\" controls></video>\n", |
| 54 | + " <video src=\"https://github.com/Bitbol-Lab/DiffPaSS/assets/46537483/e411fe8c-2fed-4723-a25c-ff69a1abccec\" width=\"640\" height=\"360\" controls></video>\n", |
55 | 55 | " <figcaption>The DiffPaSS bootstrap technique and robust pairs</figcaption>\n", |
56 | 56 | "</figure>" |
57 | 57 | ] |
|
98 | 98 | "```python\n", |
99 | 99 | "from diffpass.msa_parsing import read_msa\n", |
100 | 100 | "\n", |
101 | | - "# Parse and one-hot encode the MSAs\n", |
102 | | - "msa_data_A = read_msa(\"path/to/msa_A.fasta\")\n", |
103 | | - "msa_data_B = read_msa(\"path/to/msa_B.fasta\")\n", |
| 101 | + "# Parse the MSAs into lists of tuples (header, sequence)\n", |
| 102 | + "msa_A = read_msa(\"path/to/msa_A.fasta\")\n", |
| 103 | + "msa_B = read_msa(\"path/to/msa_B.fasta\")\n", |
104 | 104 | "```\n", |
105 | 105 | "\n", |
106 | 106 | "We assume that the MSAs contain species information in the headers, which will be used to restrict the pairings to be within the same species (more generally, \"groups\"). We need a simple function to extract the species information from the headers. For instance, if the headers are in the format `>sequence_id|species_name|...`, we can use:\n", |
|
115 | 115 | "```python\n", |
116 | 116 | "from diffpass.data_utils import create_groupwise_seq_records\n", |
117 | 117 | "\n", |
118 | | - "msa_data_A_species_by_species = create_groupwise_seq_records(msa_data_A, species_name_func)\n", |
119 | | - "msa_data_B_species_by_species = create_groupwise_seq_records(msa_data_B, species_name_func)\n", |
| 118 | + "msa_A_by_sp = create_groupwise_seq_records(msa_A, species_name_func)\n", |
| 119 | + "msa_B_by_sp = create_groupwise_seq_records(msa_B, species_name_func)\n", |
120 | 120 | "```\n", |
121 | 121 | "\n", |
122 | 122 | "If one of the MSAs contains sequences from species not present in the other MSA, we can remove these species from both MSAs:\n", |
123 | 123 | "\n", |
124 | 124 | "```python\n", |
125 | 125 | "from diffpass.data_utils import remove_groups_not_in_both\n", |
126 | 126 | "\n", |
127 | | - "msa_data_A_species_by_species, msa_data_B_species_by_species = remove_groups_not_in_both(\n", |
128 | | - " msa_data_A_species_by_species, msa_data_B_species_by_species\n", |
| 127 | + "msa_A_by_sp, msa_B_by_sp = remove_groups_not_in_both(\n", |
| 128 | + " msa_A_by_sp, msa_B_by_sp\n", |
129 | 129 | ")\n", |
130 | 130 | "```\n", |
131 | 131 | "\n", |
|
134 | 134 | "```python\n", |
135 | 135 | "from diffpass.data_utils import pad_msas_with_dummy_sequences\n", |
136 | 136 | "\n", |
137 | | - "msa_data_A_species_by_species_padded, msa_data_B_species_by_species_padded = pad_msas_with_dummy_sequences(\n", |
138 | | - " msa_data_A_species_by_species, msa_data_B_species_by_species\n", |
| 137 | + "msa_A_by_sp_pad, msa_B_by_sp_pad = pad_msas_with_dummy_sequences(\n", |
| 138 | + " msa_A_by_sp, msa_B_by_sp\n", |
139 | 139 | ")\n", |
140 | 140 | "\n", |
141 | | - "species = list(msa_data_A_species_by_species_padded.keys())\n", |
142 | | - "species_sizes = list(map(len, msa_data_A_species_by_species_padded.values()))\n", |
| 141 | + "species = list(msa_A_by_sp_pad.keys())\n", |
| 142 | + "species_sizes = list(map(len, msa_A_by_sp_pad.values()))\n", |
143 | 143 | "```\n", |
144 | 144 | "\n", |
145 | 145 | "Next, one-hot encode the MSAs using the `one_hot_encode_msa` function.\n", |
|
150 | 150 | "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", |
151 | 151 | "\n", |
152 | 152 | "# Unpack the padded MSAs into a list of records\n", |
153 | | - "msa_data_A_for_pairing = [record for records_this_species in msa_data_A_species_by_species_padded.values() for record in records_this_species]\n", |
154 | | - "msa_data_B_for_pairing = [record for records_this_species in msa_data_B_species_by_species_padded.values() for record in records_this_species]\n", |
| 153 | + "msa_A_for_pairing = [\n", |
| 154 | + " rec for recs_this_sp in msa_A_by_sp_pad.values() for rec in recs_this_sp\n", |
| 155 | + "]\n", |
| 156 | + "msa_B_for_pairing = [\n", |
| 157 | + " rec for recs_this_sp in msa_B_by_sp_pad.values() for rec in recs_this_sp\n", |
| 158 | + "]\n", |
155 | 159 | "\n", |
156 | 160 | "# One-hot encode the MSAs and load them to a device\n", |
157 | | - "msa_A_oh = one_hot_encode_msa(msa_data_A_for_pairing, device=device)\n", |
158 | | - "msa_B_oh = one_hot_encode_msa(msa_data_B_for_pairing, device=device)\n", |
| 161 | + "msa_A_oh = one_hot_encode_msa(msa_A_for_pairing, device=device)\n", |
| 162 | + "msa_B_oh = one_hot_encode_msa(msa_B_for_pairing, device=device)\n", |
159 | 163 | "```\n", |
160 | 164 | "\n", |
161 | 165 | "### Pairing optimization\n", |
162 | 166 | "\n", |
163 | | - "Finally, we can instantiate an `InformationPairing` object and optimize the mutual information between the paired MSAs using the DiffPaSS bootstrap algorithm. The results are stored in a `DiffPaSSResults` container. The lists of (hard) losses and permutations found can be accessed as attributes of the container.\n", |
| 167 | + "Finally, we can instantiate an `InformationPairing` object and optimize the mutual information between the paired MSAs using the DiffPaSS bootstrapped optimization algorithm. The results are stored in a `DiffPaSSResults` container. The lists of (hard) losses and permutations found during the optimization can be accessed as attributes of the container.\n", |
164 | 168 | "\n", |
165 | 169 | "```python\n", |
166 | 170 | "from diffpass.train import InformationPairing\n", |
|
0 commit comments