Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError upon unexpected scan title format #354

Open
MNTsnowman opened this issue Jul 7, 2024 · 7 comments · May be fixed by #369
Open

ValueError upon unexpected scan title format #354

MNTsnowman opened this issue Jul 7, 2024 · 7 comments · May be fixed by #369
Labels
bug Something isn't working

Comments

@MNTsnowman
Copy link

MNTsnowman commented Jul 7, 2024

Hi Casanovo

This is the first time i'm attempting to use casanovo, i have tried to follow your guide at : https://casanovo.readthedocs.io/en/latest/getting_started.html

I'm getting this error (see below). I'm wondering if it could have something to do with the headders of the scans in the mzML files, if this sounds like a possibility, could you please provide the command line settings you guys are using for generating the mzML files and how you name and structure the headder?

 D:\...\De Novo>casanovo sequence -m WorkDir\casanovo_massivekb.ckpt -c WorkDir\casanovo_config.yaml Data\mzML\14-2-NM_S4-A1_1_9156.mzML
WARNING: Dataloader multiprocessing is currently not supported on Windows or MacOS; using only a single thread.
Seed set to 454
INFO: Casanovo version 4.2.1
INFO: Sequencing peptides from:
INFO:   Data\mzML\14-2-NM_S4-A1_1_9156.mzML
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
INFO: Reading 1 files...
Data\mzML\14-2-NM_S4-A1_1_9156.mzML: 100%|█████████████████████████████████| 27193/27193 [00:32<00:00, 835.91spectra/s]
WARNING: Skipped 25714 spectra with invalid precursor info
Traceback (most recent call last):
  File "C:\Users\...\casanovo_env\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\...\casanovo_env\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\...\casanovo_env\Scripts\casanovo.exe\__main__.py", line 7, in <module>
  File "C:\Users\...\casanovo_env\lib\site-packages\rich_click\rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
  File "C:\Users\...\casanovo_env\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\...\casanovo_env\lib\site-packages\rich_click\rich_command.py", line 152, in main
    rv = self.invoke(ctx)
  File "C:\Users\...\casanovo_env\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\...\casanovo_env\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\...\casanovo_env\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\...\casanovo_env\lib\site-packages\casanovo\casanovo.py", line 143, in sequence
    runner.predict(peak_path, output)
  File "C:\Users\...\casanovo_env\lib\site-packages\casanovo\denovo\model_runner.py", line 160, in predict
    test_index = self._get_index(peak_path, False, "")
  File "C:\Users\...\casanovo_env\lib\site-packages\casanovo\denovo\model_runner.py", line 394, in _get_index
    return Index(index_fname, filenames, valid_charge=valid_charge)
  File "C:\Users\...\casanovo_env\lib\site-packages\depthcharge\data\hdf5.py", line 104, in __init__
    self.add_file(ms_file)
  File "C:\Users\...\casanovo_env\lib\site-packages\depthcharge\data\hdf5.py", line 195, in add_file
    metadata = self._assemble_metadata(parser)
  File "C:\Users\...\casanovo_env\lib\site-packages\depthcharge\data\hdf5.py", line 173, in _assemble_metadata
    metadata["scan_id"] = parser.scan_id
ValueError: could not broadcast input array from shape (0,) into shape (25714,)
@bittremieux bittremieux added the bug Something isn't working label Jul 7, 2024
@bittremieux
Copy link
Collaborator

bittremieux commented Jul 7, 2024

I suspect that all of the spectra were skipped:

WARNING: Skipped 25714 spectra with invalid precursor info

You already indicated that you suspected something wrong with the scan headers. Did you modify them in some way?

Normally standard mzML files produced by MSConvert, ThermoRawFileParser, etc. should all work. We do not edit the mzML files or the headers in there at all.

@MNTsnowman
Copy link
Author

Hi @bittremieux

Yes i suspect the headders as my data orriginates from a timsTOF with the IM engaged. I don't think that the IM is to blame as it is handeled in the conversion (see command below). Given that the data is from a timsTOF I do not think the ThermoRawFileParser is used at all.

For info, the CMD command i use to generate the mzML files is something along the lines of this : "C:\Users...\ProteoWizard 3.0.23167.44089af 64-bit\msconvert.exe" --combineIonMobilitySpectra --filter "peakPicking vendor msLevel=1-" --filter "scanSumming precursorTol=0.05 scanTimeTol=5 ionMobilityTol=0.1 sumMs1=0" --filter "titleMaker ... File:"""^<SourcePath^>""", NativeID:"""^<Id^>""""

So given that it skips all the scans, and that it states that the precursor info is invalid, i was wondering what your settings were to generate the scan title, in other words what is your "titlemaker" part of your conversion command. I hope this makes sense. Also, please let me know if you have other suggestions for what could be wrong. :)

@bittremieux
Copy link
Collaborator

I have limited hands-on experience with timsTOF conversion to mzML, so I don't know how the titleMaker filter should be used. But I'd be surprised if that's the problem. I suspect something about the IM actually.

Can you share the mzML file here to have a look at?

@MNTsnowman
Copy link
Author

Unfortunately I'm unable to share a file here. If you have an E-mail we could continue the conversation over we could maybe figure something out.

Alternatively I could try to compare the headers of your demo data with my data.

@bittremieux
Copy link
Collaborator

You can email me at [email protected].

@bittremieux
Copy link
Collaborator

bittremieux commented Aug 22, 2024

Ok, the issue is that the scan titles in your mzML file are in the format merged=XX frame=XX scanStart=XXX scanEnd=XXX, whereas the DepthCharge parser expects a single scan number indicated by scan=XXX. The latter is ok when working with Thermo data, which is what we've mostly been doing so far. But of course not all scan titles are formatted that way, and PASEF is then an even slightly more special case.

The good news is that should be resolved by the pending DepthCharge upgrade (#350). Until that is fully integrated, I'll keep this issue open so that we can double-check that it gets fixed.

As a workaround for now, is it possible to modify the titleMaker filter? Alternatively, converting to MGF should also work, because for MGF we don't try to extract scan information from the spectrum title.

@bittremieux bittremieux linked a pull request Aug 22, 2024 that will close this issue
@bittremieux bittremieux changed the title ValueError: could not broadcast input array from shape (0,) into shape (25714,) ValueError upon unexpected scan title format Aug 22, 2024
@bittremieux bittremieux linked a pull request Aug 22, 2024 that will close this issue
@MNTsnowman
Copy link
Author

Hi @bittremieux

Thanks a lot for getting back to me and keeping me up to date. :)

I think the solution is in the headder, thus, for now i will leave it and await you update and or solution in #369 to work. If i may add one suggestion to the process though, it is this; please read the documentation for the "titleMaker" and its commands/syntax in msconvert (https://proteowizard.sourceforge.io/tools/msconvert.html), for you guys it could be advantageous to define a format that supports timsTOF and equiptment with those commands in mind. The result should be a command resembling what i showed above. The bonus here is that you can then add that example command to your readme wherefrom others can find the information as well.

Regarding mgf files, yes it's an option, and that worked when i tested it. However, from mgf files i am unable to estimates inteseties and thus abundances.

Thanks a lot for the tool and keep up the good work, i'll be keeping an eye on it. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants