Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structure and implementation details of ProteoBench that could be improved #581

Open
hollenstein opened this issue Feb 12, 2025 · 11 comments

Comments

@hollenstein
Copy link
Contributor

Hi, while looking at the repository, and specifically how the settings .toml files are parsed and used, I've noticed several smaller issues. Those are some of the points that made it difficult for me to understand the proteobench project in the beginning and start working a new module (since I looked at the existing modules as a template for the new one). So I would like to discuss if it makes sense to introduce changes to the project.

(1) The ParseSettingsBuilder has default values for 'module_id' and a hidden default value for 'parse_settings_dir', both of which are specific for the "quant_lfq_ion_DDA" module. I think it would be better to remove these default values and enforce to provide these values when creating a ParseSettingsBuilder.

(2) Currently the ParseSettings gets to arguments, 'parse_settings' and 'parse_settings_module'. The module level settings contain information that is the same for all software tools, like '[species_expected_ratio]'. In all (or almost all?) of the software tools settings the following parameters are identical.

[species_mapper]
"_YEAST" = "YEAST"
"_ECOLI" = "ECOLI"
"_HUMAN" = "HUMAN"

[general]
"contaminant_flag" = "Cont_"
"decoy_flag" = true

So I guess it would make sense to move these settings to the module level file. However, there might be the case that for one software tool the settings need to be different. Therefore I would suggest to merge the information from 'parse_settings' and
'parse_settings_module', with software tool specific settings overwriting the module level settings. So in ParseSettingsBuilder.build_parser() the code would look something like:

ParseSettingsBuilder:

  self.build_parser(self, input_format: str) -> ParseSettings:
    software_tool_specific_settings = toml.load(self.PARSE_SETTINGS_FILES[input_format])
    settings = toml.load(self.PARSE_SETTINGS_FILES_MODULE)
    settings.update(software_tool_specific_settings)
    
    parser = ParseSettings(settings)
    if "modifications_parser" in settings:
        parser = ParseModificationSettings(parser, settings)

(3) There is even more duplication of the settings at the module level, with all entries except for "general.level" being identical between all "quant" modules (actually also the "[species_mapper]" is the same across modules). Would it make sense to create another "super module" level settings.toml file at the "quant" level, that is shared between all modules that use a similar three species benchmark dataset with the same ratios? Or do you expect that there might be additional "quant" modules with different species or different ratios, that reuse the "quant" module logic?

[species_expected_ratio]
[species_expected_ratio.YEAST]
"A_vs_B" = 2.0
"color" = "red"

[species_expected_ratio.ECOLI]
"A_vs_B" = 0.25
"color" = "blue"

[species_expected_ratio.HUMAN]
"A_vs_B" = 1.0
"color" = "green"

[general]
"min_count_multispec" = 1
"level" = "peptidoform"

(4) Currently the structure of the folders is for example "quant/lfq/ion/dda". This is somewhat counter intuitiv, because all "quant/lfq/dda" modules share the same raw dataset, fasta file, ... So you would use the same software results to upload your data into the 'ion' or 'peptidoform' (and potentially 'protein group') modules. Wouldn't it make more sense that the structure is "quant/lfq/dda" and then have subfolders for "ion" and "peptidoform"?

(5) Looking at other components of ProteoBench where we have a similar folder structure, like the "main module" classes. Those files and classes don't seem to be reusable across multiple modules, since file paths, module name and other parameters are hardcoded, e.g. in
DDAQuantIonModule. Since the filenames already contain all information "levels" (e.g. quant_lfq_ion_DDA.py) and there is only one specific "module class" per module, the python files could all be placed together into the 'proteobench/modules/quant' folder, instead of 'proteobench/modules/quant/lfq/ion/DDA/quant_lfq_ion_DDA.py'. I think in general, when there is the possibility to avoid nested folder structures it makes the repository easier to understand and work with.

@RobbinBouwmeester
Copy link
Contributor

Hi, while looking at the repository, and specifically how the settings .toml files are parsed and used, I've noticed several smaller issues. Those are some of the points that made it difficult for me to understand the proteobench project in the beginning and start working a new module (since I looked at the existing modules as a template for the new one). So I would like to discuss if it makes sense to introduce changes to the project.

(1) The ParseSettingsBuilder has default values for 'module_id' and a hidden default value for 'parse_settings_dir', both of which are specific for the "quant_lfq_ion_DDA" module. I think it would be better to remove these default values and enforce to provide these values when creating a ParseSettingsBuilder.

Fully agree, we might need to check if any parts of the code rely on the default or hidden parameter.

(2) Currently the ParseSettings gets to arguments, 'parse_settings' and 'parse_settings_module'. The module level settings contain information that is the same for all software tools, like '[species_expected_ratio]'. In all (or almost all?) of the software tools settings the following parameters are identical.

[species_mapper]
"_YEAST" = "YEAST"
"_ECOLI" = "ECOLI"
"_HUMAN" = "HUMAN"

[general]
"contaminant_flag" = "Cont_"
"decoy_flag" = true
So I guess it would make sense to move these settings to the module level file. However, there might be the case that for one software tool the settings need to be different. Therefore I would suggest to merge the information from 'parse_settings' and 'parse_settings_module', with software tool specific settings overwriting the module level settings. So in ParseSettingsBuilder.build_parser() the code would look something like:

ParseSettingsBuilder:

self.build_parser(self, input_format: str) -> ParseSettings:
software_tool_specific_settings = toml.load(self.PARSE_SETTINGS_FILES[input_format])
settings = toml.load(self.PARSE_SETTINGS_FILES_MODULE)
settings.update(software_tool_specific_settings)

parser = ParseSettings(settings)
if "modifications_parser" in settings:
    parser = ParseModificationSettings(parser, settings)

I believe this was done for a reason. Not sure exactly what reason. What I can definitely remember is that we build this with the idea of future-proofing. So it might not be useful now, but it might be in the future. I do agree that if there are things on the module level it should be moved there. We should already have a .toml for the module level.

(3) There is even more duplication of the settings at the module level, with all entries except for "general.level" being identical between all "quant" modules (actually also the "[species_mapper]" is the same across modules). Would it make sense to create another "super module" level settings.toml file at the "quant" level, that is shared between all modules that use a similar three species benchmark dataset with the same ratios? Or do you expect that there might be additional "quant" modules with different species or different ratios, that reuse the "quant" module logic?

[species_expected_ratio]
[species_expected_ratio.YEAST]
"A_vs_B" = 2.0
"color" = "red"

[species_expected_ratio.ECOLI]
"A_vs_B" = 0.25
"color" = "blue"

[species_expected_ratio.HUMAN]
"A_vs_B" = 1.0
"color" = "green"

[general]
"min_count_multispec" = 1
"level" = "peptidoform"

There should already be one, right? Or do you want to make another super level? Not sure about that, I would rather have duplication and clarity than complexity.

(4) Currently the structure of the folders is for example "quant/lfq/ion/dda". This is somewhat counter intuitiv, because all "quant/lfq/dda" modules share the same raw dataset, fasta file, ... So you would use the same software results to upload your data into the 'ion' or 'peptidoform' (and potentially 'protein group') modules. Wouldn't it make more sense that the structure is "quant/lfq/dda" and then have subfolders for "ion" and "peptidoform"?

@rodvrees does this make more sense or not?

(5) Looking at other components of ProteoBench where we have a similar folder structure, like the "main module" classes. Those files and classes don't seem to be reusable across multiple modules, since file paths, module name and other parameters are hardcoded, e.g. in DDAQuantIonModule. Since the filenames already contain all information "levels" (e.g. quant_lfq_ion_DDA.py) and there is only one specific "module class" per module, the python files could all be placed together into the 'proteobench/modules/quant' folder, instead of 'proteobench/modules/quant/lfq/ion/DDA/quant_lfq_ion_DDA.py'. I think in general, when there is the possibility to avoid nested folder structures it makes the repository easier to understand and work with.

Personally, I think nested folder structures are fine. What makes different file names easier to understand/work with than folders?

I do agree with the naming of the files that should be the same.

@hollenstein
Copy link
Contributor Author

hollenstein commented Feb 26, 2025

Regarding (5)

Personally, I think nested folder structures are fine. What makes different file names easier to understand/work with than folders?

The current folder structure looks like this, with deeply nested folders, in the end containing one (or only once two) files.

proteobench
└── modules
    └── quant
        ├── lfq
        │   ├── ion
        │   │   ├── DDA
        │   │   │   └── quant_lfq_ion_DDA.py
        │   │   └── DIA
        │   │       ├── quant_lfq_ion_DIA_AIF.py
        │   │       └── quant_lfq_ion_DIA_diaPASEF.py
        │   └── peptidoform            
        │       ├── DDA
        │       │   └── quant_lfq_peptidoform_DDA.py
        │       └── DIA
        │           └── quant_lfq_peptidoform_DIA.py
        └── quant_base
            └── quant_base_module.py

When we look at a single file, the definition of the file is repeated twice. Once in the folder structure and once in the file name.

Path: quant/lfq/ion/DDA
File: quant_lfq_ion_DDA.py

So it could be easily be simplified to a much flatter organization like this ...

proteobench
└── modules
    └── quant
        ├── quant_lfq_ion_DDA.py
        ├── quant_lfq_ion_DIA_AIF.py
        ├── quant_lfq_ion_DIA_diaPASEF.py
        ├── quant_lfq_peptidoform_DDA.py
        ├── quant_lfq_peptidoform_DIA.py
        └── quant_base_module.py

Or keeping some additional level of organization like this ...

proteobench
└── modules
    └── quant
        ├── submodules (or lfq, ...)
        │   ├── quant_lfq_ion_DDA.py
        │   ├── quant_lfq_ion_DIA_AIF.py
        │   ├── quant_lfq_ion_DIA_diaPASEF.py
        │   ├── quant_lfq_peptidoform_DDA.py
        │   └── quant_lfq_peptidoform_DIA.py
        └── base
            └── quant_base_module.py

@RobbinBouwmeester
Copy link
Contributor

Ok, going through the list right now, unfortunately we cannot change (2) as suggested. Naming schemes for species and contaminants are tool dependent. While most of them will be the same. Currently I do not see harm in keeping it as is, also makes it more clear that this is tool specific.

@RobbinBouwmeester
Copy link
Contributor

For (3), yes, we need to control it this way as these settings are module specific, and indeed different quant modules can have different ratios. As this also makes it clear what is exactly used and read in per module I want to keep it this way.

@RobbinBouwmeester
Copy link
Contributor

For (4), sure, lets turn them around. I have no preference either way.

@RobbinBouwmeester
Copy link
Contributor

For (5), I will have a look now to remove the folder structure.

@hollenstein
Copy link
Contributor Author

hollenstein commented Mar 10, 2025

Thanks for taking the time to go through the list.

That means (1), (2), (3) are solved (although I might make a separate issue that also concerns 2).

Once you are done with (5), I will make PR for (4).

@RobbinBouwmeester
Copy link
Contributor

Wait, sorry, I was not clear. I am also trying to work on (4).

@RobbinBouwmeester
Copy link
Contributor

Fixed most of them. These are still remaining:

  • proteobench/io/parsing/io_parse_settings/Quant/lfq/DDA/ion ...
  • webinterface/pages/markdown_files/Quant/lfq/DDA/ion

@RobbinBouwmeester
Copy link
Contributor

Mostly resolved in #597

@RobbinBouwmeester
Copy link
Contributor

@hollenstein do you think this resolves the issue completely or would you like to also do the io_parse_settings?

For the markdown files, I think the solution is fine as we have quite a few unique markdown files, and putting them into a single folder would be a bit messy...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants