-
Notifications
You must be signed in to change notification settings - Fork 0
3. Download files
reMap requires a set of object files to run the core commands along with test samples to train and predict pathways. The test samples can either be used to train or test the reMap model. Please download these files from Zenodo. Once you have downloaded the reMap_materials.zip
file, unzip it and make sure you obtain the two folders: model/
and dataset/
, as depicted below:
Note: This tree structure for the directory was generated using the tree
command in the terminal
(on Linux) and in the command prompt
(on Windows).
reMap_materials/
├── model/
│ ├── reMap.pkl
│ ├── leADS_D.pkl
│ ├── leADS_Dy.pkl
│ ├── leADS_F.pkl
│ ├── hin.pkl
│ ├── pathway2vec_embeddings.npz
│ ├── phi.npz
│ └── sigma.npz
└── dataset/
├── biocyc_X.pkl, biocyc_Xe.pkl, biocyc_B.pkl, biocyc_y.pkl, biocyc_y_abun.pkl, biocyc_M.pkl, biocyc_M_y_abun.pkl
├── golden_X.pkl, golden_Xe.pkl, golden_B.pkl, golden_y.pkl, golden_y_abun.pkl, golden_M.pkl, golden_M_y_abun.pkl
├── cami_X.pkl, cami_Xe.pkl, cami_B.pkl, cami_y.pkl, cami_y_abun.pkl, cami_M.pkl, cami_M_y_abun.pkl
├── centroid.npz
├── features.npz
├── rho.npz
├── pathway_group.pkl
├── idxvocab.pkl
├── vocab.pkl
└── ...
A short description of the contents of the above folders is given below.
In this folder, a pre-trained model is provided to predict metabolic pathways using the datasets described in the dataset/ section.
File | Description | Size |
---|---|---|
reMap.pkl | A pretrained model generated using biocyc_Xe.pkl and biocyc_y.pkl data. This model was trained using SOAP with supplementary pathway information. | 70.5MB |
leADS_D.pkl | A pretrained model generated using biocyc_Xe.pkl, biocyc_y.pkl, and biocyc_B.pkl data with nPSP (k=50), ensemble size 10, and per% = 70%. This model was trained using the groups approach. | 730MB |
leADS_Dy.pkl | A pretrained model generated using biocyc_Xe.pkl, biocyc_y.pkl, and biocyc_B.pkl data with ensemble size 10. This model was trained using the class-labels (pathways) approach. | 728MB |
leADS_F.pkl | A pretrained model generated using biocyc_Xe.pkl, biocyc_y.pkl, and biocyc_B.pkl data with nPSP (k=50), ensemble size 10, and per% = 70%. This model was trained using the class-labels (pathways) approach. | 785MB |
hin.pkl | A sample of heterogeneous information network. | 10.0MB |
pathway2vec_embeddings.npz | A matrix file containing a sample of embeddings using RUST-norm. The rows (22593) correspond to the pathway, enzyme, and compound embeddings and the columns (128) represent the features. These features can be generated using pathway2vec. | 11.0MB |
sigma.npz | A matrix file representing the group-group covariance of size 200. This data was obtained using SOAP with supplementary pathway information. | 312KB |
phi.npz | A matrix file representing the distribution of pathways over groups. The rows (200) correspond to the group indices and columns (2526) represent the pathway indices. This data was obtained using SOAP with supplementary pathway information. | 3.85MB |
Here, we show you a visual depiction of some of the object files to help deepen your understanding.
The pathway2vec_embeddings.npz
is a matrix file corresponding to the embeddings of pathways, EC numbers, and compounds. These features are generated using pathway2vec. For example, after including pathway and EC numbers from "biocyc.pkl" in the first column and excluding compounds, the table can be seen as:
Pathway and EC | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
L-valine biosynthesis | 0.089106 | 0.092924 | 0.089035 | 0.101823 | 0.072792 | 0.083173 | 0.096259 | 0.064823 | 0.071481 | 0.094392 |
methylquercetin biosynthesis | 0.112329 | 0.075717 | 0.087717 | 0.094391 | 0.081035 | 0.074514 | 0.095572 | 0.072581 | 0.068458 | 0.096449 |
cyanide degradation | 0.073566 | 0.094817 | 0.087664 | 0.099661 | 0.089182 | 0.103727 | 0.093147 | 0.093047 | 0.083330 | 0.095017 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
EC-1.1.1.10 | 0.095318 | 0.094138 | 0.097567 | 0.087115 | 0.084483 | 0.098668 | 0.078173 | 0.091465 | 0.086675 | 0.086497 |
EC-1.1.1.100 | 0.047987 | 0.096748 | 0.092529 | 0.092395 | 0.116745 | 0.092556 | 0.106274 | 0.107414 | 0.079025 | 0.098948 |
EC-1.1.1.101 | 0.090137 | 0.085566 | 0.087589 | 0.089496 | 0.082936 | 0.088855 | 0.083835 | 0.091411 | 0.085721 | 0.090588 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
This is a matrix file corresponding to the distribution of pathways over groups. Rows correspond to group indices and columns represent pathway indices. For example, the table can be seen as:
Pathway Group Indices | 5-aminoimidazole ribonucleotide biosynthesis II | vitamin E biosynthesis (tocopherols) | spermine and spermidine degradation III | biotin biosynthesis from 8-amino-7-oxononanoate I | mixed acid fermentation | L-glutamate degradation II | chlorosalicylate degradation | L-malate degradation II | pyruvate fermentation to acetate II | acetoin degradation |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1.429221e-07 | 3.524164e-02 | 1.607106e-07 | 1.533687e-07 | 1.528739e-07 | 1.512877e-07 | 1.170707e-07 | 1.470176e-01 | 1.524868e-07 | 1.453987e-07 |
1 | 6.455780e-08 | 6.944885e-08 | 6.996916e-08 | 5.668292e-01 | 5.653213e-08 | 5.686886e-08 | 6.443087e-08 | 6.507242e-08 | 3.832408e-01 | 6.670976e-08 |
2 | 1.367094e-07 | 1.453373e-07 | 1.535185e-07 | 1.362638e-07 | 1.316325e-07 | 1.209321e-07 | 1.535094e-07 | 1.433702e-07 | 1.200712e-07 | 4.520255e-01 |
3 | 6.353614e-01 | 3.844192e-08 | 3.191540e-08 | 3.330884e-08 | 3.522579e-01 | 3.217529e-08 | 3.045689e-08 | 3.111319e-08 | 2.996089e-08 | 3.107714e-08 |
4 | 7.128483e-08 | 7.209904e-08 | 6.580494e-08 | 7.720455e-08 | 6.120852e-08 | 6.733139e-01 | 7.705214e-08 | 7.127394e-08 | 7.585665e-08 | 7.150897e-08 |
It can be seen that the 5-aminoimidazole ribonucleotide biosynthesis II pathway has a high contribution (0.64) with the group indexed by 3.
In this folder, 20 data are provided to predict, train, and evaluate metabolic pathways using the pre-trained reMap model (e.g., "reMap.pkl") or to train a new model. The data are categorized into the following three types: 1)- pathway training data, 2)- pathway test data, and 3)- other necessary data items.
The following four files can be used to train reMap. Biocyc tier 2 and 3 PGDBs were processed using prepBioCyc.
File | Description | Size |
---|---|---|
biocyc_X.pkl | A matrix file of 9257 organisms, whose information is extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. Columns (3650) for each organism, represent EC number indices filled with integer values indicating the abundance of ECs for that organism. | 25.4MB |
biocyc_Xe.pkl | A matrix file of 9257 organisms, whose information is extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. Columns (3650) of each organism, represent EC number indices filled with integer values indicating the abundance of EC number indices and embeddings for that organism. | 74.8MB |
biocyc_B.pkl | A +1/-1 matrix indicating the presence/absence of group indices (200 entries) for each of the 9257 organisms. | 7.2MB |
biocyc_y.pkl | A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each of the 9257 organisms. | 68.5MB |
biocyc_y_abun.pkl | An abundance matrix indicating the frequency pathway indices (2526 entries) for each of the 9257 organisms. | 20.1MB |
biocyc_M.pkl | A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each of the 9257 organisms. This file is generate using mlLGPR | 29.5MB |
biocyc_M_y_abun.pkl | An abundance matrix indicating the frequency pathway indices (2526 entries) for each of the 9257 organisms. This file is generate using mlLGPR | 28.5MB |
The following data can be used to perform pathway prediction and evaluation of the pre-trained reMap model. Please see the mlLGPR repository and Advanced usage on how to obtain and preprocess the data below.
Files | Description | Size |
---|---|---|
golden_X.pkl, golden_Xe.pkl, golden_B.pkl, golden_y.pkl, golden_y_abun.pkl, golden_M.pkl, golden_M_abun.pkl | This is the Golden dataset in a matrix format where rows correspond to AraCyc, EcoCyc, HumanCyc, LeishCyc, TrypanoCyc, and YeastCyc, respectively. Columns for "*Xe.pkl", "*B.pkl", and "*y.pkl" (or "*abun.pkl") correspond to 3650 EC number indices, 3778 EC number indices and embeddings, 200 group indices, and 2526 pathway indices. | 761KB |
cami_X.pkl, cami_Xe.pkl, cami_B.pkl, cami_y.pkl, cami_y_abun.pkl, cami_M.pkl, cami_M_abun.pkl | These files correspond to the CAMI low complexity data with the rows representing 40 species. Columns for "*Xe.pkl", "*B.pkl", and "*y.pkl" (or "*abun.pkl") correspond to 3650 EC number indices, 3778 EC number indices and embeddings, 200 group indices, and 2526 pathway indices. | 228KB |
reMap requires additional data items for training and transformation.
Files | Description | Size |
---|---|---|
centroid.npz | A matrix file representing the 200 groups centroids of size 128. | 79.1KB |
features.npz | A matrix file representing pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz . |
1.23MB |
rho.npz | A matrix file representing the group-group correlations of size 200. | 312KB |
pathway_group.pkl | A binary matrix indicating the association of groups indices (200 entries) in rows to pathway indices (2526 entries) in columns. | 3.85MB |
idxvocab.pkl | A file representing the pathway indices. | 19.8KB |
vocab.pkl | A dictionary file representing pathway indices as keys and MetaCyc pathway ids as values. | 52.5KB |