MassIVE-KB Download and Preprocess Workflow #335

CCranney · 2024-05-15T20:31:55Z

CCranney
May 15, 2024

Hi,

I said this in my recent issue, but thank you again for developing this repository. It's exciting to study!

I'm looking into using the MassIVE-KB dataset to train Casanovo from scratch, effectively replicating the original training of the best model. Following your directions in the FAQ section, I was able to download the reference table that contained file names, scan numbers, and other metadata regarding the peptides of that dataset. Though it took longer than expected - that's quite the file. :)

I've been looking over the expected workflow for compiling a training/validation dataset, and it looks like the following needs to occur:

Download given files from MassIVE
Extract the necessary scan and format it in a way that Casanovo can use it as a training reference (including inserting the peptide sequence as metadata)

Regarding the download process, I haven't found an easy way to automate downloading those files. The information under the filename column of the reference table does not appear to have enough information to identify the FTP link automatically. Am I missing something, or was this dataset downloaded effectively by hand?

I've sketched out a workflow for doing so by hand, and it's not too bad, but thought I'd ask if I was missing an obvious automation strategy before diving in head first.

As for number 2, did you effectively loop over the files, extract the desired scans, add the peptide sequence information, and save them as one / several MGF files in its final form, prepped and ready for casanovo input? Or is there a way to input the downloaded files directly without making an intermediary training file?

Like the above, I suspect this needs to be done largely by hand, but thought I'd ask in case I was missing a ready-made workflow or script that automates any part of the process.

Answered by melihyilmaz

May 15, 2024

Hi Caleb,

Please refer to the issue #324 to directly download the MassIVE-KB data used for Casanovo training from a temporary URL. We will update FAQ when we find a permanent home for this dataset.

View full answer

melihyilmaz · 2024-05-15T21:23:24Z

melihyilmaz
May 15, 2024
Maintainer

Hi Caleb,

Please refer to the issue #324 to directly download the MassIVE-KB data used for Casanovo training from a temporary URL. We will update FAQ when we find a permanent home for this dataset.

0 replies

CCranney · 2024-05-15T21:29:30Z

CCranney
May 15, 2024
Author

Perfect, thank you!

0 replies

CCranney · 2024-05-15T23:34:37Z

CCranney
May 15, 2024
Author

I am curious, though - was my description of the process generally on point? I'd be interested to know should I need to gather a similar dataset in the future.

0 replies

bittremieux · 2024-05-16T05:55:21Z

bittremieux
May 16, 2024
Maintainer

Yes, your description of the process is largely correct.

The MassIVE-KB dataset was initially compiled for the GLEAMS paper. Downloading the files was automated:

Here is some code to download the peak (mzML and mzXML) files. Alternatively, it should be possible to script this downloading process using ppx.
Here is a notebook that downloads all mzTab files with the spectrum assignments.

Note that some of these URLs might no longer be functional directly out of the box due to some internal changes at MassIVE, but the code should still provide a good starting point.

For the second part, indeed all of the peak files were read and the relevant spectra were extracted into a single large MGF (available through the link Melih shared). Because it's so many files and spectra, this will take a bit of time as well.

We are working on a full description of this process to extend our documentation and provide this dataset in an easier to use format, which should be available within the next few weeks.

0 replies

CCranney · 2024-05-16T13:48:18Z

CCranney
May 16, 2024
Author

Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MassIVE-KB Download and Preprocess Workflow #335

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

MassIVE-KB Download and Preprocess Workflow #335

CCranney May 15, 2024

Replies: 5 comments

melihyilmaz May 15, 2024 Maintainer

CCranney May 15, 2024 Author

CCranney May 15, 2024 Author

bittremieux May 16, 2024 Maintainer

CCranney May 16, 2024 Author

CCranney
May 15, 2024

melihyilmaz
May 15, 2024
Maintainer

CCranney
May 15, 2024
Author

CCranney
May 15, 2024
Author

bittremieux
May 16, 2024
Maintainer

CCranney
May 16, 2024
Author