Skip to content

discrepancies in mordred values for the same improve_drug_id across different datasets #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ymahlich opened this issue Feb 4, 2025 · 1 comment
Labels
invalid This doesn't seem right

Comments

@ymahlich
Copy link
Collaborator

ymahlich commented Feb 4, 2025

After combining every drug_descriptor DF from every data set currently available (example code how I did this below) there are 155 different improve_drug_id entries that resolve to conflicting mordred values, i.e. there is a discrepancy within one or more values.

import coderdata as cd
import pandas as pd

# importing all datasets into a dict
local_path = Path('/tmp/coderdata/data_in_tmp')
data_sets_info = cd.list_datasets(raw=True)
# data_sets_info = {'beataml':0, 'ccle':1 , 'fimm': 2}
data_sets = {}
for data_set in data_sets_info.keys():
    data_sets[data_set] = cd.load(name=data_set, local_path=local_path)

# getting all formatted drug_descriptor tables and adding them into a dict
dfs_to_merge = {}
for data_set in data_sets:
    if (data_sets[data_set].experiments is not None 
        and data_sets[data_set].drug_descriptors is not None
    ):
        df_tmp = data_sets[data_set].format(data_type='drug_descriptor', shape='wide')
        df_tmp = df_tmp.drop(columns=['morgan fingerprint']).add_prefix('mordred.')
        df_tmp['data_set_origin'] = data_set
        dfs_to_merge[data_set] = df_tmp

# concatenating the individual drug_descriptor tables and dropping duplicates based on all columns but `data_set_origin`
concat_drugs = pd.concat(dfs_to_merge.values())
concat_drugs = concat_drugs.drop_duplicates(subset=concat_drugs.columns.difference(['data_set_origin']))

# getting the improve_drug_id[s] that have conflicting mordred values 
counts = concat_drugs.index.value_counts()
print(counts[counts > 1])

An example for a discrepancy for the improve_drug_id SMI_39390 looks like this (see column 1, row 2 vs row 1,3&4 for example - every column has at least 1 value that is different):

mordred.AATS0Z mordred.AATS0are mordred.AATS0d mordred.AATS0dv mordred.AATS0i mordred.AATS0m mordred.AATS0p mordred.AATS0pe mordred.AATS0s mordred.AATS0se ... mordred.piPC10 mordred.piPC2 mordred.piPC3 mordred.piPC4 mordred.piPC5 mordred.piPC6 mordred.piPC7 mordred.piPC8 mordred.piPC9 data_set_origin
24.654545454545456 6.410538181818179 3.618181818181818 8.945454545454545 163.65524747645867 98.16974362090495 1.5316283238459452 6.46704727272727 5.9595959595959584 7.861068145454542 ... 7.897340761810271 4.499809670330265 5.0891383555842 5.641907070938114 6.191595324113119 6.442838688784959 6.84587987526405 7.22501511587432 7.667313615851853 beataml
24.654545454545456 6.410538181818179 3.618181818181818 8.945454545454545 163.65524747645867 98.16974362090495 1.5316283238459452 6.46704727272727 5.9595959595959584 7.861068145454542 ... 7.897340761810271 4.499809670330265 5.0891383555842 5.641907070938114 6.191595324113119 6.442838688784959 6.84587987526405 7.22501511587432 7.667313615851853 ctrpv2
24.654545454545456 6.410538181818179 3.618181818181818 8.945454545454545 163.65524747645867 98.16974362090495 1.5316283238459452 6.46704727272727 5.9595959595959584 7.861068145454542 ... 7.897340761810271 4.499809670330265 5.0891383555842 5.641907070938114 6.191595324113119 6.442838688784959 6.84587987526405 7.22501511587432 7.667313615851853 mpnst
24.654545454545456 6.410538181818179 3.618181818181818 8.945454545454545 163.65524747645867 98.16974362090495 1.5316283238459452 6.46704727272727 5.9595959595959584 7.861068145454542 ... 7.897340761810271 4.499809670330265 5.0891383555842 5.641907070938114 6.191595324113119 6.442838688784959 6.84587987526405 7.22501511587432 7.667313615851853 mpnstpdx

Attached is the full list of 155 drug ids where discrepancies exist, as well as the full subsetted DF containing only the improve_drug_id that are causing this behaviour. The above example table can be generated by subsetting the attached DF to only one drug_id and running the resulting DF through the code snippet below:

def cols_having_unique(df):
    my_cols = []
    for col in df.columns:
        if df[col].nunique(dropna=False) > 1:
            my_cols.append(col)
    return df[my_cols].copy()

morderd-discrepancy-full-list-incl-data-set-origin.csv

morderd-discrepancy-improve_drug_id-list.csv

@ymahlich
Copy link
Collaborator Author

ymahlich commented Feb 5, 2025

drug_descriptor table currently contains "True" / "False" strings for boolean values (imho a good choice). It seems though that some models only accept numeric inputs. Maybe it is worth considering to switch to 1/0 with the changes discussed in #323.

Furthermore rdkit seems to return None for some mordred descriptors like MAXaaNH if NaaNH==0 (count of aaNH). This seems to also generate problems due to the input some models expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
Status: No status
Development

No branches or pull requests

1 participant