discrepancies in mordred values for the same improve_drug_id across different datasets #321

ymahlich · 2025-02-04T00:57:25Z

After combining every drug_descriptor DF from every data set currently available (example code how I did this below) there are 155 different improve_drug_id entries that resolve to conflicting mordred values, i.e. there is a discrepancy within one or more values.

import coderdata as cd
import pandas as pd

# importing all datasets into a dict
local_path = Path('/tmp/coderdata/data_in_tmp')
data_sets_info = cd.list_datasets(raw=True)
# data_sets_info = {'beataml':0, 'ccle':1 , 'fimm': 2}
data_sets = {}
for data_set in data_sets_info.keys():
    data_sets[data_set] = cd.load(name=data_set, local_path=local_path)

# getting all formatted drug_descriptor tables and adding them into a dict
dfs_to_merge = {}
for data_set in data_sets:
    if (data_sets[data_set].experiments is not None 
        and data_sets[data_set].drug_descriptors is not None
    ):
        df_tmp = data_sets[data_set].format(data_type='drug_descriptor', shape='wide')
        df_tmp = df_tmp.drop(columns=['morgan fingerprint']).add_prefix('mordred.')
        df_tmp['data_set_origin'] = data_set
        dfs_to_merge[data_set] = df_tmp

# concatenating the individual drug_descriptor tables and dropping duplicates based on all columns but `data_set_origin`
concat_drugs = pd.concat(dfs_to_merge.values())
concat_drugs = concat_drugs.drop_duplicates(subset=concat_drugs.columns.difference(['data_set_origin']))

# getting the improve_drug_id[s] that have conflicting mordred values 
counts = concat_drugs.index.value_counts()
print(counts[counts > 1])

An example for a discrepancy for the improve_drug_id SMI_39390 looks like this (see column 1, row 2 vs row 1,3&4 for example - every column has at least 1 value that is different):

mordred.AATS0Z	mordred.AATS0are	mordred.AATS0d	mordred.AATS0dv	mordred.AATS0i	mordred.AATS0m	mordred.AATS0p	mordred.AATS0pe	mordred.AATS0s	mordred.AATS0se	...	mordred.piPC10	mordred.piPC2	mordred.piPC3	mordred.piPC4	mordred.piPC5	mordred.piPC6	mordred.piPC7	mordred.piPC8	mordred.piPC9	data_set_origin
24.654545454545456	6.410538181818179	3.618181818181818	8.945454545454545	163.65524747645867	98.16974362090495	1.5316283238459452	6.46704727272727	5.9595959595959584	7.861068145454542	...	7.897340761810271	4.499809670330265	5.0891383555842	5.641907070938114	6.191595324113119	6.442838688784959	6.84587987526405	7.22501511587432	7.667313615851853	beataml
24.654545454545456	6.410538181818179	3.618181818181818	8.945454545454545	163.65524747645867	98.16974362090495	1.5316283238459452	6.46704727272727	5.9595959595959584	7.861068145454542	...	7.897340761810271	4.499809670330265	5.0891383555842	5.641907070938114	6.191595324113119	6.442838688784959	6.84587987526405	7.22501511587432	7.667313615851853	ctrpv2
24.654545454545456	6.410538181818179	3.618181818181818	8.945454545454545	163.65524747645867	98.16974362090495	1.5316283238459452	6.46704727272727	5.9595959595959584	7.861068145454542	...	7.897340761810271	4.499809670330265	5.0891383555842	5.641907070938114	6.191595324113119	6.442838688784959	6.84587987526405	7.22501511587432	7.667313615851853	mpnst
24.654545454545456	6.410538181818179	3.618181818181818	8.945454545454545	163.65524747645867	98.16974362090495	1.5316283238459452	6.46704727272727	5.9595959595959584	7.861068145454542	...	7.897340761810271	4.499809670330265	5.0891383555842	5.641907070938114	6.191595324113119	6.442838688784959	6.84587987526405	7.22501511587432	7.667313615851853	mpnstpdx

Attached is the full list of 155 drug ids where discrepancies exist, as well as the full subsetted DF containing only the improve_drug_id that are causing this behaviour. The above example table can be generated by subsetting the attached DF to only one drug_id and running the resulting DF through the code snippet below:

def cols_having_unique(df):
    my_cols = []
    for col in df.columns:
        if df[col].nunique(dropna=False) > 1:
            my_cols.append(col)
    return df[my_cols].copy()

morderd-discrepancy-full-list-incl-data-set-origin.csv

morderd-discrepancy-improve_drug_id-list.csv

The text was updated successfully, but these errors were encountered:

ymahlich · 2025-02-05T22:51:22Z

drug_descriptor table currently contains "True" / "False" strings for boolean values (imho a good choice). It seems though that some models only accept numeric inputs. Maybe it is worth considering to switch to 1/0 with the changes discussed in #323.

Furthermore rdkit seems to return None for some mordred descriptors like MAXaaNH if NaaNH==0 (count of aaNH). This seems to also generate problems due to the input some models expect.

ymahlich added the invalid This doesn't seem right label Feb 4, 2025

ymahlich added this to CoderData Feb 4, 2025

ymahlich mentioned this issue Feb 5, 2025

pass drug_descriptors (contains mordred descriptors) to consecutive build runs for datasets #323

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

discrepancies in mordred values for the same improve_drug_id across different datasets #321

discrepancies in mordred values for the same improve_drug_id across different datasets #321

ymahlich commented Feb 4, 2025

ymahlich commented Feb 5, 2025

Uh oh!

discrepancies in mordred values for the same improve_drug_id across different datasets #321

discrepancies in mordred values for the same improve_drug_id across different datasets #321

Comments

ymahlich commented Feb 4, 2025

ymahlich commented Feb 5, 2025

Uh oh!