Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MetaPhlAn 4 output with duplicate clade tax id is not supported #140

Open
1 task done
MajoroMask opened this issue Oct 8, 2023 · 18 comments · May be fixed by #153
Open
1 task done

[BUG] MetaPhlAn 4 output with duplicate clade tax id is not supported #140

MajoroMask opened this issue Oct 8, 2023 · 18 comments · May be fixed by #153
Assignees
Labels
bug Something isn't working

Comments

@MajoroMask
Copy link

MajoroMask commented Oct 8, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Problem description

As in title, this report is forward from nf-core/taxprofiler#396.

The MetaPhlAn 4 output I'm using are in these gists, if any help:
2612_se_metaphlan4-db.metaphlan_profile.txt and 2613_se_metaphlan4-db.metaphlan_profile.txt

I think it's the duplicated tax id (as shown below) caused the error.

cat 2612_se_metaphlan4-db.metaphlan_profile.txt | grep '165179'
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A        2|976|200643|171549|171552|838|165179       15.15712
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C        2|976|200643|171549|171552|838|165179       3.48391
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B        2|976|200643|171549|171552|838|165179       1.31197
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F        2|976|200643|171549|171552|838|165179       0.34791
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A|t__SGB1626     2|976|200643|171549|171552|838|165179|      15.15712
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C|t__SGB1644     2|976|200643|171549|171552|838|165179|      3.48391 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_TF12_30,k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_AM23_5
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B|t__SGB1613     2|976|200643|171549|171552|838|165179|      1.31197
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F|t__SGB1614     2|976|200643|171549|171552|838|165179|      0.34791

I also run taxpasta standardise on both MetaPhlAn 4 output files, taxpasta works but the result may have problem.

taxpasta standardise -p metaphlan -o standard_2612.tsv 2612_se_metaphlan4-db.metaphlan_profile.txt
[02:43:32] WARNING  Combining 122 entries with unclassified taxa in the profile.             metaphlan_profile_standardisation_service.py:94
           INFO     Write result to 'standard_2612.tsv'.

From this result 'standard_2612.tsv' I got 4 entries with the same tax id and different count:

cat standard_2612.tsv | grep '^165179\b'
165179  15157120
165179  3483910
165179  1311970
165179  347910

Code sample

Code run:

taxpasta merge \
    -p metaphlan -o metaphlan_metaphlan4-db.tsv --add-name --add-rank --add-lineage --add-id-lineage --add-rank-lineage \
    --taxonomy taxdump \
     \
    2612_se_metaphlan4-db.metaphlan_profile.txt 2613_se_metaphlan4-db.metaphlan_profile.txt

Traceback:

Traceback is too long, see this gist

At the end it says:

ValueError: Index has duplicate keys: CategoricalIndex([165179], categories=[0, 
2, 468, 469, ..., 2003188, 2082587, 2292893, 2887326], ordered=False, 
dtype='category', name='taxonomy_id')

Environment

I'm running taxpastat under local docker container, which runs quay.io/biocontainers/taxpasta:0.6.1--pyhdfd78af_0

Anything else?

No response

@MajoroMask MajoroMask added the bug Something isn't working label Oct 8, 2023
@Midnighter
Copy link
Contributor

Thank you for the detailed report. It is somewhat curious that the names of the species are distinguished by a letter suffix, but the numeric identifier is the same... And no identifiers for the strains at all. I will look into it but I'm actually not sure what the correct solution should be.

@Midnighter
Copy link
Contributor

@MajoroMask, the only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

Can you think of a better solution?

@luozhy88
Copy link

I have same error by metaphlan,how to slove it?
I download it(http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/bowtie2_indexes/mpa_vJan21_CHOCOPhlAnSGB_202103_bt2.tar)

image
image

@MajoroMask
Copy link
Author

@MajoroMask, the only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

Can you think of a better solution?

@Midnighter I got no idea... can author of MetaPhlAn 4 be reached? Maybe they have a solution for generating an ID to the output.

@d-callan
Copy link

another ex if it helps:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15143     2|1239|186801|186802|||1898207| 0.02366
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15159     2|1239|186801|186802|||1898207| 0.02308 

@Midnighter
Copy link
Contributor

Midnighter commented Jul 18, 2024

@d-callan thank you for the additional data. Do you have any thoughts on the following? I'm not clear on how to solve this at the moment.

only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

@d-callan
Copy link

@Midnighter I think that's as reasonable as anything.. if I want strains from biobakery tools I'd look to strainphlan rather than metaphlan. It's possible a warning would be good, or maybe making the behavior configurable.

@d-callan
Copy link

@Midnighter im also wondering if you have a sense for what this would take in terms of effort? i am very interested in getting this working, and would be willing to put effort to it if you wanted.

@Midnighter
Copy link
Contributor

I think, code change is minimal. 2-3 lines. Will need an extra test case or so.

@d-callan
Copy link

this one i think is fun

k__Bacteria|p__Firmicutes|c__Negativicutes      2|1239|909932   5.20485
k__Bacteria|p__Actinobacteria|c__Actinomycetia  2|201174|1760   2.05981
k__Bacteria|p__Firmicutes|c__CFGB2834   2|1239| 0.94398
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria     2|1224|28216    0.81827
k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998  0.46979
k__Bacteria|p__Firmicutes|c__CFGB1227   2|1239| 0.404
k__Bacteria|p__Firmicutes|c__CFGB3038   2|1239| 0.18149
k__Bacteria|p__Firmicutes|c__CFGB3054   2|1239| 0.16661
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 0.12308
k__Bacteria|p__Firmicutes|c__CFGB2906   2|1239| 0.03655
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    2|1224|1236     0.02883
k__Bacteria|p__Firmicutes|c__CFGB1765   2|1239| 0.02468
k__Bacteria|p__Candidatus_Melainabacteria|c__Candidatus_Melainabacteria_unclassified    2|1798710|      0.00509
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales     2|976|200643|171549     57.0883
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales        2|74201|203494|48461    7.59373
k__Bacteria|p__Firmicutes|c__Negativicutes|o__Veillonellales    2|1239|909932|1843489   5.20485
k__Bacteria|p__Firmicutes|c__CFGB2834|o__OFGB2834       2|1239||        0.94398
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales  2|1224|28216|80840      0.81827
k__Bacteria|p__Firmicutes|c__CFGB1227|o__OFGB1227       2|1239||        0.404
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales     2|201174|84998|84999    0.38515
k__Bacteria|p__Firmicutes|c__CFGB3038|o__OFGB3038       2|1239||        0.18149
k__Bacteria|p__Firmicutes|c__CFGB3054|o__OFGB3054       2|1239||        0.16661
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified|o__Firmicutes_unclassified 2|1239||        0.12308
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriaceae      2|1239|186801|186802|186806     1.31747
k__Bacteria|p__Firmicutes|c__CFGB2834|o__OFGB2834|f__FGB2834    2|1239|||       0.94398
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Clostridiaceae      2|1239|186801|186802|31979      0.85
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae        2|1224|28216|80840|995019       0.81827
k__Bacteria|p__Firmicutes|c__CFGB1227|o__OFGB1227|f__FGB1227    2|1239|||       0.404
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae        2|201174|84998|84999|84107      0.38515
k__Bacteria|p__Firmicutes|c__CFGB3038|o__OFGB3038|f__FGB3038    2|1239|||       0.18149

@harper357
Copy link

Sorry for randomly jumping in here, but I have used MetaPhlAn a fair bit. The clade tax id values come from NCBI, but the taxa/clade name are coming from their own clustering/GTDB.

I believe that the authors even kind of discourage using the tax ids.

I don't know if this would cause problems when merging across profilers, but you could add/use the last section of the clade name.

Ex:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15143     2|1239|186801|186802|||1898207| 0.02366
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15159     2|1239|186801|186802|||1898207| 0.02308 

becomes

1898207_SGB15143 0.02366
1898207_SGB15159 0.02308 

@Midnighter
Copy link
Contributor

Hi @harper357,

No need to apologize, more information is always welcome. Thank you for the explanation also, I was not aware how MetaPhlAn handles this.

Unfortunately, even though such a change looks simple from the outside, it would change taxpasta's internal logic a lot. There's not only the validation part which assumes integers, but also the whole integration with an existing taxonomy. Basically, we only maintain the identifiers and if users desire, we add back names and lineages using the identifiers to get information from a taxonomy. @harper357 do you know if they publish their taxonomy in a format that can be read by taxopy?

@harper357
Copy link

@Midnighter I am not completely sure on what the format for taxopy is. MetaPhlAn 4's second column is the NCBI TaxIDs. Are you talking about the first column that needs to be in a different format?

@Midnighter
Copy link
Contributor

We use taxopy to load taxonomies in taxdump format. That means, we normally drop all information from individual profiles except taxon identifiers and their relative abundances. If a user wishes to output names, ranks, or lineages, we retrieve that from the taxonomy.

There are two things that concern me with MetaPhlAn then. 1) You say that they use NCBI identifiers, but actually use a custom clustering. I don't know if that will practically make a big difference, but it's nonetheless misleading if true. 2) If they have their own clustering, it is straight forward to create the taxdump output, which will also assign unique identifiers that can be used.

I realize that that will not happen soon, so we still need a solution right now. While I like your suggestion @harper357, it does have big consequences for how taxpasta is built. Need to think about that. It would also mean that the way we use taxonomies would not work for MetaPhlAn.

@d-callan
Copy link

I'll put this here in case it proves a helpful reference http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html

Also, I'll comment that mapping metaphlan outputs to ncbi taxonomy seems a reasonable use case nonetheless and makes sense to support even if imperfectly.

@d-callan
Copy link

having thought about it a bit since yesterday, maybe we need this to actually be two issues? one for supporting metaphlan on the ncbi taxonomy using the solution previously suggested by @Midnighter, w any warnings/ flags necessary. that sounded like itd be easy enough to do to serve as an interim solution and is probably worth supporting anyhow. and then a second issue for supporting metaphlan using a taxonomy built on SGBs. thatd be the more complete solution..

@Midnighter Midnighter self-assigned this Oct 28, 2024
@cpauvert
Copy link

Thanks for all the work on there, and the follow-up on issues! 💪 I'm having a similar issue and wanted to be able to follow as well.
I was curious if there had been any contacts/tags with people from the metaphlan team, especially regarding the second/complete solution mentioned by @d-callan ?

@Midnighter
Copy link
Contributor

I have not initiated any closer communication/collaboration. If you can establish a relationship that will be very beneficial for everyone. Feel free to point here and at taxprofiler for use-cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants