Skip to content

unexpected data in *_drugs.tsv #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ymahlich opened this issue Feb 3, 2025 · 2 comments
Open

unexpected data in *_drugs.tsv #319

ymahlich opened this issue Feb 3, 2025 · 2 comments
Labels
invalid This doesn't seem right

Comments

@ymahlich
Copy link
Collaborator

ymahlich commented Feb 3, 2025

Three improve_drug_ids resolve to two different canSMILES strings:

  • SMI_20644
  • SMI_9830
  • SMI_55606

SMI_20644

improve_drug_id canSMILES
SMI_20644 C1CN(P(=O)(OC1)NCCCl)CCCl
SMI_20644 NaN

SMI_9830

improve_drug_id canSMILES
SMI_9830 CC1C(C(CC(O1)OC2C(OC(CC2O)OC3C(OC(CC3O)OC4CCC5...
SMI_9830 NaN

SMI_55606

improve_drug_id canSMILES
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C

Three improve_drug_ids resolve to NaN canSMILES:

improve_drug_id canSMILES
SMI_56588 NaN
SMI_9830 NaN
SMI_20644 NaN
  • SMI_9830 & SMI_20644 are an overlap with the "two canSMILES" per drug_id
  • SMI_56588 only resolves to NaN

Once I have figured out which dataset this information is coming from I will update the Issue.

@ymahlich ymahlich added the invalid This doesn't seem right label Feb 3, 2025
@ymahlich
Copy link
Collaborator Author

ymahlich commented Feb 3, 2025

Dataset origin for the NaN values:

see last column of the table

improve_drug_id chem_name pubchem_id formula weight InChIKey canSMILES data_set_origin
SMI_56588 mek1/2 inhibitor NaN NaN NaN NaN NaN beataml
SMI_56588 vargetef NaN NaN NaN NaN NaN beataml
SMI_56588 baiclein NaN NaN NaN NaN NaN beataml
SMI_9830 3-((2,6-dideoxy-4-o-[2,6-dideoxy-4-o-(2,6-dide... NaN NaN NaN NaN NaN mpnst
SMI_20644 n,3-bis(2-chloroethyl)tetrahydro-2h-1,3,2-oxaz... NaN NaN NaN NaN NaN mpnst

canSMILES for SMI_55606:

The outlier that produces the second canSMILES string is from the mpnst dataset (see row 8 / second last row)

improve_drug_id canSMILES data_set_origin
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C... beataml
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C... ccle
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C... ctrpv2
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C... fimm
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C... gcsi
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C... gdscv1
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C... gdscv2
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C mpnst
SMI_55606 COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C... prism

@ymahlich
Copy link
Collaborator Author

ymahlich commented Feb 6, 2025

as per Ryan the SMILES string for SMI_56588 should be C1=CC=C(C=C1)C2=CC(=O)C3=C(O2)C=C(C(=C3O)O)O. Once the issue has been fixed on the build backend we can check to make sure that SMI_56588 is populated accrodingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
Status: No status
Development

No branches or pull requests

1 participant