Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Mapping ChEMBL IDs to ChEBI IDs using the 16/04/2024 metabolite bridge file produces inconsistent duplicate ChEBI IDs #45

Open
pklemmer opened this issue Apr 22, 2024 · 1 comment
Assignees
Labels

Comments

@pklemmer
Copy link

Describe the bug

Using maps() to map ChEMBL IDs from an input df like:

source identifier
Cl / CHEMBL1091
Cl / CHEMBL11
Cl / CHEMBL99

to ChEBI ID using the metabolites20240416.bridge file as loadDatabase() argument produces inconsistently mapped duplicate ChEBI IDs:

source identifier target mapping isPrimary
Cl / CHEMBL1091 / Ce / CHEBI:17609 / T
Cl / CHEMBL1091 / Ce / 17609 / F

but also both duplicate IDs being indicated as primary:

source identifier target mapping isPrimary
Cl / CHEMBL11 / Ce / CHEBI:47499 / T
Cl / CHEMBL11 / Ce / 47499 / T

or even duplicate IDs being indicated as both true and false primary IDs:

source identifier target mapping isPrimary
Cl / CHEMBL1152 / Ce / CHEBI:8380 / T
Cl / CHEMBL1152 / Ce / 8380 / F
Cl / CHEMBL1152 / Ce / 8380 / T

Provide a minimally reproducible example (reprex)

The 'identifiers' argument for the maps() function is an input dataframe such as:

source identifier
Cl / CHEMBL1091
Cl / CHEMBL11
Cl / CHEMBL99

which was generated like this:

metabolite_input <- data.frame(
source = rep("Cl", length(mapped_chembls[, 1])),
identifier = mapped_chembls[, 1]
)

where mapped_chembls is a data frame with a single column containing one CHEMBL ID in the format 'CHEMBL123' per row.

The 'mapper' argument is an absolute file path like:

"C:/Users/user/Documents/GitHub/repo/BridgeDb/metabolites_20240416.bridge"

and the 'target' argument is 'Ce' to map to ChEBI.

Expected behavior

I believe that ChEBI IDs are typically associated with single unique ChEMBL IDs, so an ideal output should look like:

source identifier target mapping isPrimary
Cl / CHEMBL1152 / Ce / CHEBI:8380 / T

With the "CHEBI:" prefix in front of the actual ID.

R Session Information

Please report the output of either sessionInfo() or
sessioninfo::session_info() here.

options(width = 120)
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_Europe.utf8  LC_CTYPE=English_Europe.utf8    LC_MONETARY=English_Europe.utf8 LC_NUMERIC=C                    LC_TIME=English_Europe.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reprex_2.1.0         curl_5.2.1           BridgeDbR_2.10.2     rJava_1.0-11         RCy3_2.20.2          rWikiPathways_1.20.0 tidyr_1.3.1          rvest_1.0.4         
 [9] gprofiler2_0.2.3     stringr_1.5.1        httr_1.4.7           dplyr_1.1.4         

loaded via a namespace (and not attached):
 [1] gtable_0.3.4        rjson_0.2.21        ggplot2_3.5.0       htmlwidgets_1.6.4   caTools_1.18.2      vctrs_0.6.5         tools_4.3.3         bitops_1.0-7       
 [9] generics_0.1.3      stats4_4.3.3        base64url_1.4       tibble_3.2.1        fansi_1.0.6         pkgconfig_2.0.3     KernSmooth_2.23-22  data.table_1.15.4  
[17] RColorBrewer_1.1-3  uuid_1.2-0          graph_1.78.0        lifecycle_1.0.4     compiler_4.3.3      gplots_3.1.3.1      munsell_0.5.1       repr_1.1.7         
[25] uchardet_1.1.1      htmltools_0.5.8.1   RCurl_1.98-1.14     lazyeval_0.2.2      plotly_4.10.4       pillar_1.9.0        crayon_1.5.2        gtools_3.9.5       
[33] tidyselect_1.2.1    digest_0.6.35       stringi_1.8.3       purrr_1.0.2         RJSONIO_1.3-1.9     fastmap_1.1.1       grid_4.3.3          colorspace_2.1-0   
[41] cli_3.6.2           magrittr_2.0.3      base64enc_0.1-3     XML_3.99-0.16.1     utf8_1.2.4          IRdisplay_1.1       withr_3.0.0         scales_1.3.0       
[49] backports_1.4.1     IRkernel_1.3.2      pbdZMQ_0.3-10       evaluate_0.23       viridisLite_0.4.2   rlang_1.1.3         glue_1.7.0          selectr_0.4-2      
[57] BiocManager_1.30.22 xml2_1.3.6          BiocGenerics_0.46.0 pkgload_1.3.4       rstudioapi_0.16.0   jsonlite_1.8.8      R6_2.5.1            fs_1.6.3           

Indicate whether BiocManager::valid() returns TRUE.

BiocManager::valid() returns
"4 packages out-of-date; 0 packages too new"

Is the package installed via bioconda?

BridgeDbR is installed via BiocManager.

@egonw egonw self-assigned this Apr 22, 2024
@egonw egonw transferred this issue from bridgedb/BridgeDbR Apr 24, 2024
@egonw
Copy link
Member

egonw commented Apr 24, 2024

Thanks for filing the issue! I need to get some details together. The problem is probably in the ID mapping file, and therefore caused by how we create it, hence the transfer.

@egonw egonw added the bug label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants