MassIVE-KB Download and Preprocess Workflow #335
-
Hi, I said this in my recent issue, but thank you again for developing this repository. It's exciting to study! I'm looking into using the MassIVE-KB dataset to train Casanovo from scratch, effectively replicating the original training of the best model. Following your directions in the FAQ section, I was able to download the reference table that contained file names, scan numbers, and other metadata regarding the peptides of that dataset. Though it took longer than expected - that's quite the file. :) I've been looking over the expected workflow for compiling a training/validation dataset, and it looks like the following needs to occur:
Regarding the download process, I haven't found an easy way to automate downloading those files. The information under the I've sketched out a workflow for doing so by hand, and it's not too bad, but thought I'd ask if I was missing an obvious automation strategy before diving in head first. As for number 2, did you effectively loop over the files, extract the desired scans, add the peptide sequence information, and save them as one / several MGF files in its final form, prepped and ready for casanovo input? Or is there a way to input the downloaded files directly without making an intermediary training file? Like the above, I suspect this needs to be done largely by hand, but thought I'd ask in case I was missing a ready-made workflow or script that automates any part of the process. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
Hi Caleb, Please refer to the issue #324 to directly download the MassIVE-KB data used for Casanovo training from a temporary URL. We will update FAQ when we find a permanent home for this dataset. |
Beta Was this translation helpful? Give feedback.
-
Perfect, thank you! |
Beta Was this translation helpful? Give feedback.
-
I am curious, though - was my description of the process generally on point? I'd be interested to know should I need to gather a similar dataset in the future. |
Beta Was this translation helpful? Give feedback.
-
Yes, your description of the process is largely correct. The MassIVE-KB dataset was initially compiled for the GLEAMS paper. Downloading the files was automated:
Note that some of these URLs might no longer be functional directly out of the box due to some internal changes at MassIVE, but the code should still provide a good starting point. For the second part, indeed all of the peak files were read and the relevant spectra were extracted into a single large MGF (available through the link Melih shared). Because it's so many files and spectra, this will take a bit of time as well. We are working on a full description of this process to extend our documentation and provide this dataset in an easier to use format, which should be available within the next few weeks. |
Beta Was this translation helpful? Give feedback.
-
Thank you! |
Beta Was this translation helpful? Give feedback.
Hi Caleb,
Please refer to the issue #324 to directly download the MassIVE-KB data used for Casanovo training from a temporary URL. We will update FAQ when we find a permanent home for this dataset.