Structure and implementation details of ProteoBench that could be improved #581

hollenstein · 2025-02-12T13:26:10Z

Hi, while looking at the repository, and specifically how the settings .toml files are parsed and used, I've noticed several smaller issues. Those are some of the points that made it difficult for me to understand the proteobench project in the beginning and start working a new module (since I looked at the existing modules as a template for the new one). So I would like to discuss if it makes sense to introduce changes to the project.

(1) The ParseSettingsBuilder has default values for 'module_id' and a hidden default value for 'parse_settings_dir', both of which are specific for the "quant_lfq_ion_DDA" module. I think it would be better to remove these default values and enforce to provide these values when creating a ParseSettingsBuilder.

(2) Currently the ParseSettings gets to arguments, 'parse_settings' and 'parse_settings_module'. The module level settings contain information that is the same for all software tools, like '[species_expected_ratio]'. In all (or almost all?) of the software tools settings the following parameters are identical.

[species_mapper]
"_YEAST" = "YEAST"
"_ECOLI" = "ECOLI"
"_HUMAN" = "HUMAN"

[general]
"contaminant_flag" = "Cont_"
"decoy_flag" = true

So I guess it would make sense to move these settings to the module level file. However, there might be the case that for one software tool the settings need to be different. Therefore I would suggest to merge the information from 'parse_settings' and
'parse_settings_module', with software tool specific settings overwriting the module level settings. So in ParseSettingsBuilder.build_parser() the code would look something like:

ParseSettingsBuilder:

  self.build_parser(self, input_format: str) -> ParseSettings:
    software_tool_specific_settings = toml.load(self.PARSE_SETTINGS_FILES[input_format])
    settings = toml.load(self.PARSE_SETTINGS_FILES_MODULE)
    settings.update(software_tool_specific_settings)
    
    parser = ParseSettings(settings)
    if "modifications_parser" in settings:
        parser = ParseModificationSettings(parser, settings)

(3) There is even more duplication of the settings at the module level, with all entries except for "general.level" being identical between all "quant" modules (actually also the "[species_mapper]" is the same across modules). Would it make sense to create another "super module" level settings.toml file at the "quant" level, that is shared between all modules that use a similar three species benchmark dataset with the same ratios? Or do you expect that there might be additional "quant" modules with different species or different ratios, that reuse the "quant" module logic?

[species_expected_ratio]
[species_expected_ratio.YEAST]
"A_vs_B" = 2.0
"color" = "red"

[species_expected_ratio.ECOLI]
"A_vs_B" = 0.25
"color" = "blue"

[species_expected_ratio.HUMAN]
"A_vs_B" = 1.0
"color" = "green"

[general]
"min_count_multispec" = 1
"level" = "peptidoform"

(4) Currently the structure of the folders is for example "quant/lfq/ion/dda". This is somewhat counter intuitiv, because all "quant/lfq/dda" modules share the same raw dataset, fasta file, ... So you would use the same software results to upload your data into the 'ion' or 'peptidoform' (and potentially 'protein group') modules. Wouldn't it make more sense that the structure is "quant/lfq/dda" and then have subfolders for "ion" and "peptidoform"?

(5) Looking at other components of ProteoBench where we have a similar folder structure, like the "main module" classes. Those files and classes don't seem to be reusable across multiple modules, since file paths, module name and other parameters are hardcoded, e.g. in
DDAQuantIonModule. Since the filenames already contain all information "levels" (e.g. quant_lfq_ion_DDA.py) and there is only one specific "module class" per module, the python files could all be placed together into the 'proteobench/modules/quant' folder, instead of 'proteobench/modules/quant/lfq/ion/DDA/quant_lfq_ion_DDA.py'. I think in general, when there is the possibility to avoid nested folder structures it makes the repository easier to understand and work with.

The text was updated successfully, but these errors were encountered:

RobbinBouwmeester · 2025-02-25T12:56:46Z

Hi, while looking at the repository, and specifically how the settings .toml files are parsed and used, I've noticed several smaller issues. Those are some of the points that made it difficult for me to understand the proteobench project in the beginning and start working a new module (since I looked at the existing modules as a template for the new one). So I would like to discuss if it makes sense to introduce changes to the project.

(1) The ParseSettingsBuilder has default values for 'module_id' and a hidden default value for 'parse_settings_dir', both of which are specific for the "quant_lfq_ion_DDA" module. I think it would be better to remove these default values and enforce to provide these values when creating a ParseSettingsBuilder.

Fully agree, we might need to check if any parts of the code rely on the default or hidden parameter.

(2) Currently the ParseSettings gets to arguments, 'parse_settings' and 'parse_settings_module'. The module level settings contain information that is the same for all software tools, like '[species_expected_ratio]'. In all (or almost all?) of the software tools settings the following parameters are identical.

[species_mapper]
"_YEAST" = "YEAST"
"_ECOLI" = "ECOLI"
"_HUMAN" = "HUMAN"

[general]
"contaminant_flag" = "Cont_"
"decoy_flag" = true
So I guess it would make sense to move these settings to the module level file. However, there might be the case that for one software tool the settings need to be different. Therefore I would suggest to merge the information from 'parse_settings' and 'parse_settings_module', with software tool specific settings overwriting the module level settings. So in ParseSettingsBuilder.build_parser() the code would look something like:

ParseSettingsBuilder:

self.build_parser(self, input_format: str) -> ParseSettings:
software_tool_specific_settings = toml.load(self.PARSE_SETTINGS_FILES[input_format])
settings = toml.load(self.PARSE_SETTINGS_FILES_MODULE)
settings.update(software_tool_specific_settings)
parser = ParseSettings(settings)
if "modifications_parser" in settings:
    parser = ParseModificationSettings(parser, settings)

I believe this was done for a reason. Not sure exactly what reason. What I can definitely remember is that we build this with the idea of future-proofing. So it might not be useful now, but it might be in the future. I do agree that if there are things on the module level it should be moved there. We should already have a .toml for the module level.

(3) There is even more duplication of the settings at the module level, with all entries except for "general.level" being identical between all "quant" modules (actually also the "[species_mapper]" is the same across modules). Would it make sense to create another "super module" level settings.toml file at the "quant" level, that is shared between all modules that use a similar three species benchmark dataset with the same ratios? Or do you expect that there might be additional "quant" modules with different species or different ratios, that reuse the "quant" module logic?
[species_expected_ratio]
[species_expected_ratio.YEAST]
"A_vs_B" = 2.0
"color" = "red"

[species_expected_ratio.ECOLI]
"A_vs_B" = 0.25
"color" = "blue"

[species_expected_ratio.HUMAN]
"A_vs_B" = 1.0
"color" = "green"

[general]
"min_count_multispec" = 1
"level" = "peptidoform"

There should already be one, right? Or do you want to make another super level? Not sure about that, I would rather have duplication and clarity than complexity.

(4) Currently the structure of the folders is for example "quant/lfq/ion/dda". This is somewhat counter intuitiv, because all "quant/lfq/dda" modules share the same raw dataset, fasta file, ... So you would use the same software results to upload your data into the 'ion' or 'peptidoform' (and potentially 'protein group') modules. Wouldn't it make more sense that the structure is "quant/lfq/dda" and then have subfolders for "ion" and "peptidoform"?

@rodvrees does this make more sense or not?

(5) Looking at other components of ProteoBench where we have a similar folder structure, like the "main module" classes. Those files and classes don't seem to be reusable across multiple modules, since file paths, module name and other parameters are hardcoded, e.g. in DDAQuantIonModule. Since the filenames already contain all information "levels" (e.g. quant_lfq_ion_DDA.py) and there is only one specific "module class" per module, the python files could all be placed together into the 'proteobench/modules/quant' folder, instead of 'proteobench/modules/quant/lfq/ion/DDA/quant_lfq_ion_DDA.py'. I think in general, when there is the possibility to avoid nested folder structures it makes the repository easier to understand and work with.

Personally, I think nested folder structures are fine. What makes different file names easier to understand/work with than folders?

I do agree with the naming of the files that should be the same.

hollenstein · 2025-02-26T08:16:37Z

Regarding (5)

Personally, I think nested folder structures are fine. What makes different file names easier to understand/work with than folders?

The current folder structure looks like this, with deeply nested folders, in the end containing one (or only once two) files.

proteobench
└── modules
    └── quant
        ├── lfq
        │   ├── ion
        │   │   ├── DDA
        │   │   │   └── quant_lfq_ion_DDA.py
        │   │   └── DIA
        │   │       ├── quant_lfq_ion_DIA_AIF.py
        │   │       └── quant_lfq_ion_DIA_diaPASEF.py
        │   └── peptidoform            
        │       ├── DDA
        │       │   └── quant_lfq_peptidoform_DDA.py
        │       └── DIA
        │           └── quant_lfq_peptidoform_DIA.py
        └── quant_base
            └── quant_base_module.py

When we look at a single file, the definition of the file is repeated twice. Once in the folder structure and once in the file name.

Path: quant/lfq/ion/DDA
File: quant_lfq_ion_DDA.py

So it could be easily be simplified to a much flatter organization like this ...

proteobench
└── modules
    └── quant
        ├── quant_lfq_ion_DDA.py
        ├── quant_lfq_ion_DIA_AIF.py
        ├── quant_lfq_ion_DIA_diaPASEF.py
        ├── quant_lfq_peptidoform_DDA.py
        ├── quant_lfq_peptidoform_DIA.py
        └── quant_base_module.py

Or keeping some additional level of organization like this ...

proteobench
└── modules
    └── quant
        ├── submodules (or lfq, ...)
        │   ├── quant_lfq_ion_DDA.py
        │   ├── quant_lfq_ion_DIA_AIF.py
        │   ├── quant_lfq_ion_DIA_diaPASEF.py
        │   ├── quant_lfq_peptidoform_DDA.py
        │   └── quant_lfq_peptidoform_DIA.py
        └── base
            └── quant_base_module.py

RobbinBouwmeester · 2025-03-10T11:36:15Z

Ok, going through the list right now, unfortunately we cannot change (2) as suggested. Naming schemes for species and contaminants are tool dependent. While most of them will be the same. Currently I do not see harm in keeping it as is, also makes it more clear that this is tool specific.

RobbinBouwmeester · 2025-03-10T11:38:28Z

For (3), yes, we need to control it this way as these settings are module specific, and indeed different quant modules can have different ratios. As this also makes it clear what is exactly used and read in per module I want to keep it this way.

RobbinBouwmeester · 2025-03-10T11:41:37Z

For (4), sure, lets turn them around. I have no preference either way.

RobbinBouwmeester · 2025-03-10T11:42:42Z

For (5), I will have a look now to remove the folder structure.

hollenstein · 2025-03-10T15:58:57Z

Thanks for taking the time to go through the list.

That means (1), (2), (3) are solved (although I might make a separate issue that also concerns 2).

Once you are done with (5), I will make PR for (4).

RobbinBouwmeester · 2025-03-11T08:00:30Z

Wait, sorry, I was not clear. I am also trying to work on (4).

RobbinBouwmeester · 2025-03-11T09:01:38Z

Fixed most of them. These are still remaining:

proteobench/io/parsing/io_parse_settings/Quant/lfq/DDA/ion ...
webinterface/pages/markdown_files/Quant/lfq/DDA/ion

RobbinBouwmeester · 2025-03-11T09:28:30Z

Mostly resolved in #597

RobbinBouwmeester · 2025-03-11T09:32:58Z

@hollenstein do you think this resolves the issue completely or would you like to also do the io_parse_settings?

For the markdown files, I think the solution is fine as we have quite a few unique markdown files, and putting them into a single folder would be a bit messy...

hollenstein mentioned this issue Mar 1, 2025

Refactor/fewer default args #589

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structure and implementation details of ProteoBench that could be improved #581

Structure and implementation details of ProteoBench that could be improved #581

hollenstein commented Feb 12, 2025

RobbinBouwmeester commented Feb 25, 2025

hollenstein commented Feb 26, 2025 •

edited

Loading

RobbinBouwmeester commented Mar 10, 2025

RobbinBouwmeester commented Mar 10, 2025

RobbinBouwmeester commented Mar 10, 2025

RobbinBouwmeester commented Mar 10, 2025

hollenstein commented Mar 10, 2025 •

edited

Loading

RobbinBouwmeester commented Mar 11, 2025

RobbinBouwmeester commented Mar 11, 2025

RobbinBouwmeester commented Mar 11, 2025

RobbinBouwmeester commented Mar 11, 2025

Structure and implementation details of ProteoBench that could be improved #581

Structure and implementation details of ProteoBench that could be improved #581

Comments

hollenstein commented Feb 12, 2025

RobbinBouwmeester commented Feb 25, 2025

hollenstein commented Feb 26, 2025 • edited Loading

RobbinBouwmeester commented Mar 10, 2025

RobbinBouwmeester commented Mar 10, 2025

RobbinBouwmeester commented Mar 10, 2025

RobbinBouwmeester commented Mar 10, 2025

hollenstein commented Mar 10, 2025 • edited Loading

RobbinBouwmeester commented Mar 11, 2025

RobbinBouwmeester commented Mar 11, 2025

RobbinBouwmeester commented Mar 11, 2025

RobbinBouwmeester commented Mar 11, 2025

hollenstein commented Feb 26, 2025 •

edited

Loading

hollenstein commented Mar 10, 2025 •

edited

Loading