You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After combining every drug_descriptor DF from every data set currently available (example code how I did this below) there are 155 different improve_drug_id entries that resolve to conflicting mordred values, i.e. there is a discrepancy within one or more values.
importcoderdataascdimportpandasaspd# importing all datasets into a dictlocal_path=Path('/tmp/coderdata/data_in_tmp')
data_sets_info=cd.list_datasets(raw=True)
# data_sets_info = {'beataml':0, 'ccle':1 , 'fimm': 2}data_sets= {}
fordata_setindata_sets_info.keys():
data_sets[data_set] =cd.load(name=data_set, local_path=local_path)
# getting all formatted drug_descriptor tables and adding them into a dictdfs_to_merge= {}
fordata_setindata_sets:
if (data_sets[data_set].experimentsisnotNoneanddata_sets[data_set].drug_descriptorsisnotNone
):
df_tmp=data_sets[data_set].format(data_type='drug_descriptor', shape='wide')
df_tmp=df_tmp.drop(columns=['morgan fingerprint']).add_prefix('mordred.')
df_tmp['data_set_origin'] =data_setdfs_to_merge[data_set] =df_tmp# concatenating the individual drug_descriptor tables and dropping duplicates based on all columns but `data_set_origin`concat_drugs=pd.concat(dfs_to_merge.values())
concat_drugs=concat_drugs.drop_duplicates(subset=concat_drugs.columns.difference(['data_set_origin']))
# getting the improve_drug_id[s] that have conflicting mordred values counts=concat_drugs.index.value_counts()
print(counts[counts>1])
An example for a discrepancy for the improve_drug_idSMI_39390 looks like this (see column 1, row 2 vs row 1,3&4 for example - every column has at least 1 value that is different):
mordred.AATS0Z
mordred.AATS0are
mordred.AATS0d
mordred.AATS0dv
mordred.AATS0i
mordred.AATS0m
mordred.AATS0p
mordred.AATS0pe
mordred.AATS0s
mordred.AATS0se
...
mordred.piPC10
mordred.piPC2
mordred.piPC3
mordred.piPC4
mordred.piPC5
mordred.piPC6
mordred.piPC7
mordred.piPC8
mordred.piPC9
data_set_origin
24.654545454545456
6.410538181818179
3.618181818181818
8.945454545454545
163.65524747645867
98.16974362090495
1.5316283238459452
6.46704727272727
5.9595959595959584
7.861068145454542
...
7.897340761810271
4.499809670330265
5.0891383555842
5.641907070938114
6.191595324113119
6.442838688784959
6.84587987526405
7.22501511587432
7.667313615851853
beataml
24.654545454545456
6.410538181818179
3.618181818181818
8.945454545454545
163.65524747645867
98.16974362090495
1.5316283238459452
6.46704727272727
5.9595959595959584
7.861068145454542
...
7.897340761810271
4.499809670330265
5.0891383555842
5.641907070938114
6.191595324113119
6.442838688784959
6.84587987526405
7.22501511587432
7.667313615851853
ctrpv2
24.654545454545456
6.410538181818179
3.618181818181818
8.945454545454545
163.65524747645867
98.16974362090495
1.5316283238459452
6.46704727272727
5.9595959595959584
7.861068145454542
...
7.897340761810271
4.499809670330265
5.0891383555842
5.641907070938114
6.191595324113119
6.442838688784959
6.84587987526405
7.22501511587432
7.667313615851853
mpnst
24.654545454545456
6.410538181818179
3.618181818181818
8.945454545454545
163.65524747645867
98.16974362090495
1.5316283238459452
6.46704727272727
5.9595959595959584
7.861068145454542
...
7.897340761810271
4.499809670330265
5.0891383555842
5.641907070938114
6.191595324113119
6.442838688784959
6.84587987526405
7.22501511587432
7.667313615851853
mpnstpdx
Attached is the full list of 155 drug ids where discrepancies exist, as well as the full subsetted DF containing only the improve_drug_id that are causing this behaviour. The above example table can be generated by subsetting the attached DF to only one drug_id and running the resulting DF through the code snippet below:
drug_descriptor table currently contains "True" / "False" strings for boolean values (imho a good choice). It seems though that some models only accept numeric inputs. Maybe it is worth considering to switch to 1/0 with the changes discussed in #323.
Furthermore rdkit seems to return None for some mordred descriptors like MAXaaNH if NaaNH==0 (count of aaNH). This seems to also generate problems due to the input some models expect.
After combining every
drug_descriptor
DF from every data set currently available (example code how I did this below) there are 155 differentimprove_drug_id
entries that resolve to conflicting mordred values, i.e. there is a discrepancy within one or more values.An example for a discrepancy for the
improve_drug_id
SMI_39390
looks like this (see column 1, row 2 vs row 1,3&4 for example - every column has at least 1 value that is different):Attached is the full list of 155 drug ids where discrepancies exist, as well as the full subsetted DF containing only the
improve_drug_id
that are causing this behaviour. The above example table can be generated by subsetting the attached DF to only one drug_id and running the resulting DF through the code snippet below:morderd-discrepancy-full-list-incl-data-set-origin.csv
morderd-discrepancy-improve_drug_id-list.csv
The text was updated successfully, but these errors were encountered: