Skip to content

Commit 00d2dd1

Browse files
authored
feat: add ld50 data (#529)
1 parent 61ca373 commit 00d2dd1

File tree

2 files changed

+181
-0
lines changed

2 files changed

+181
-0
lines changed

data/tabular/ld50_catmos/meta.yaml

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
---
2+
name: ld50_catmos
3+
description: |-
4+
Acute toxicity LD50 measures
5+
the most conservative dose that can lead to lethal adverse effects.
6+
The higher the dose, the more lethal of a drug.
7+
We aggregated the data from multiple SMILES by computing the mean.
8+
targets:
9+
- id: CATMoS_LD50_mgkg
10+
description: Acute Toxicity LD50.
11+
units: mg/kg
12+
type: continuous
13+
names:
14+
- noun: acute oral toxicity rat LD50
15+
- noun: acute oral toxicity (LD50 in rats)
16+
uris:
17+
- http://www.bioassayontology.org/bao#BAO_0002117
18+
significant_digits: 1
19+
- id: log10_LD50
20+
description: Acute Toxicity LD50.
21+
units: log10(mg/kg)
22+
type: continuous
23+
names:
24+
- noun: log10 acute oral toxicity rat LD50
25+
- noun: log10 acute oral toxicity (LD50 in rats)
26+
- noun: log10 LD50 in rats (oral exposure)
27+
- noun: log10 rat LD50 (oral exposure)
28+
significant_digits: 2
29+
- id: num_ghose_violations
30+
description: Ghose filter violations
31+
type: ordinal
32+
significant_digits: 0
33+
names:
34+
- noun: Ghose filter violations
35+
- noun: violations of the Ghose filter
36+
- id: num_lead_likeness_violations
37+
description: Lead likeness filter violations
38+
type: ordinal
39+
significant_digits: 0
40+
names:
41+
- noun: lead likeness filter violations
42+
- noun: violations of the lead likeness filter
43+
- id: num_lipinski_violations
44+
description: Lipinski filter violations
45+
type: ordinal
46+
significant_digits: 0
47+
names:
48+
- noun: Lipinski rule violations
49+
- noun: violations of the Lipinski rules
50+
- id: molecular_mass
51+
description: Molecular mass
52+
type: continuous
53+
units: g/mol
54+
names:
55+
- noun: molecular mass
56+
- noun: molecular weight
57+
- id: num_carbon_atoms
58+
description: Number of carbon atoms
59+
type: ordinal
60+
significant_digits: 0
61+
names:
62+
- noun: carbon atoms
63+
- id: num_oxygen_atoms
64+
description: Number of oxygen atoms
65+
type: ordinal
66+
significant_digits: 0
67+
names:
68+
- noun: oxygen atoms
69+
identifiers:
70+
- id: SMILES
71+
type: SMILES
72+
description: SMILES
73+
license: CC BY 4.0
74+
links:
75+
- url: https://ehp.niehs.nih.gov/doi/full/10.1289/EHP8495#supplementary-materials
76+
description: corresponding publication
77+
num_points: 9032
78+
bibtex:
79+
- |-
80+
@article{Mansouri_2021, title={CATMoS: Collaborative Acute Toxicity Modeling Suite},
81+
volume={129},
82+
ISSN={1552-9924},
83+
url={http://dx.doi.org/10.1289/EHP8495},
84+
DOI={10.1289/ehp8495},
85+
number={4},
86+
journal={Environmental Health Perspectives},
87+
publisher={Environmental Health Perspectives},
88+
author={Mansouri, Kamel and Karmaus, Agnes L. and Fitzpatrick, Jeremy
89+
and Patlewicz, Grace and Pradeep, Prachi and Alberga, Domenico and
90+
Alepee, Nathalie and Allen, Timothy E.H. and Allen, Dave and Alves, Vinicius M.
91+
and Andrade, Carolina H. and Auernhammer, Tyler R. and Ballabio, Davide and
92+
Bell, Shannon and Benfenati, Emilio and Bhattacharya, Sudin and
93+
Bastos, Joyce V. and Boyd, Stephen and Brown, J.B. and Capuzzi, Stephen J. and
94+
Chushak, Yaroslav and Ciallella, Heather and Clark, Alex M. and
95+
Consonni, Viviana and Daga, Pankaj R. and Ekins, Sean and Farag, Sherif and
96+
Fedorov, Maxim and Fourches, Denis and Gadaleta, Domenico and Gao, Feng and
97+
Gearhart, Jeffery M. and Goh, Garett and Goodman, Jonathan M. and
98+
Grisoni, Francesca and Grulke, Christopher M. and Hartung, Thomas and
99+
Hirn, Matthew and Karpov, Pavel and Korotcov, Alexandru and
100+
Lavado, Giovanna J. and Lawless, Michael and Li, Xinhao and
101+
Luechtefeld, Thomas and Lunghini, Filippo and Mangiatordi, Giuseppe F. and
102+
Marcou, Gilles and Marsh, Dan and Martin, Todd and Mauri, Andrea and
103+
Muratov, Eugene N. and Myatt, Glenn J. and Nguyen, Dac-Trung and
104+
Nicolotti, Orazio and Note, Reine and Pande, Paritosh and
105+
Parks, Amanda K. and Peryea, Tyler and Polash, Ahsan H. and
106+
Rallo, Robert and Roncaglioni, Alessandra and Rowlands, Craig and
107+
Ruiz, Patricia and Russo, Daniel P. and Sayed, Ahmed and Sayre, Risa and
108+
Sheils, Timothy and Siegel, Charles and Silva, Arthur C. and Simeonov, Anton and
109+
Sosnin, Sergey and Southall, Noel and Strickland, Judy and Tang, Yun and
110+
Teppen, Brian and Tetko, Igor V. and Thomas, Dennis and Tkachenko, Valery and
111+
Todeschini, Roberto and Toma, Cosimo and Tripodi, Ignacio and
112+
Trisciuzzi, Daniela and Tropsha, Alexander and Varnek, Alexandre and
113+
Vukovic, Kristijan and Wang, Zhongyu and Wang, Liguo and
114+
Waters, Katrina M. and Wedlake, Andrew J. and Wijeyesakere, Sanjeeva J. and
115+
Wilson, Dan and Xiao, Zijun and Yang, Hongbin and Zahoranszky-Kohalmi, Gergely and
116+
Zakharov, Alexey V. and Zhang, Fagen F. and Zhang, Zhen and Zhao, Tongan and
117+
Zhu, Hao and Zorn, Kimberley M. and Casey, Warren and Kleinstreuer, Nicole C.},
118+
year={2021}, month=apr }
119+
templates:
120+
- The {#molecule|chemical|compound!} with the {SMILES__description} {#representation of |!}{SMILES#} {#shows|exhibits|displays!} an {CATMoS_LD50_mgkg__names__noun} of {CATMoS_LD50_mgkg#} {CATMoS_LD50_mgkg__units}.
121+
- The {#molecule|chemical|compound!} with the {SMILES__description} {#representation of |!}{SMILES#} {#shows|exhibits|displays!} a {log10_LD50__names__noun} of {log10_LD50#} {log10_LD50__units}.
122+
- |
123+
Task: Determine the acute oral toxicity and molecular properties of a {#molecule|chemical|compound!} given the {SMILES__description}.
124+
Input: {SMILES#}
125+
Desired Output: {CATMoS_LD50_mgkg__names__noun}, {log10_LD50__names__noun}, {num_ghose_violations__names__noun}, {num_lead_likeness_violations__names__noun}, {num_lipinski_violations__names__noun}, {molecular_mass__names__noun}, {num_carbon_atoms__names__noun}, {num_oxygen_atoms__names__noun}
126+
Output: {CATMoS_LD50_mgkg#} {CATMoS_LD50_mgkg__units}, {log10_LD50#} {log10_LD50__units}, {num_ghose_violations#}, {num_lead_likeness_violations#}, {num_lipinski_violations#}, {molecular_mass#} {molecular_mass__units}, {num_carbon_atoms#}, {num_oxygen_atoms#}
127+
- |
128+
Context: You are {#an assistant|researcher|scientist!} in a pharmaceutical company. Your {#boss|superior|department head!} has asked you to {#design|create|synthesize!} a new drug.
129+
User: The {#drug|compound|chemical!} should have a {CATMoS_LD50_mgkg__names__noun} of {CATMoS_LD50_mgkg#} {CATMoS_LD50_mgkg__units}, {num_ghose_violations#} {num_ghose_violations__names__noun}, {num_lead_likeness_violations#} {num_lead_likeness_violations__names__noun}, {num_lipinski_violations#} {num_lipinski_violations__names__noun}, {molecular_mass#} {molecular_mass__names__noun} {molecular_mass__units}, {num_carbon_atoms#} {num_carbon_atoms__names__noun}, and {num_oxygen_atoms#} {num_oxygen_atoms__names__noun}.
130+
Assistant: {#Happy to help!|Sure!|Of course!} The {#molecule|chemical|compound!} with the {SMILES__description} {#representation of |!}{SMILES#} {#shows|exhibits|displays!} the desired properties.
131+
- |
132+
User: I need a {#drug|compound|chemical!} with a {log10_LD50__names__noun} of {log10_LD50#} {log10_LD50__units}.
133+
Assistant: {#Happy to help!|Sure!|Of course!} Can you provide me with more {#constraints|details|information!}?
134+
User: The {#drug|compound|chemical!} should have {num_ghose_violations#} {num_ghose_violations__names__noun}, {num_lead_likeness_violations#} {num_lead_likeness_violations__names__noun}, {num_lipinski_violations#} {num_lipinski_violations__names__noun}, {num_carbon_atoms#} {num_carbon_atoms__names__noun}, and {num_oxygen_atoms#} {num_oxygen_atoms__names__noun}.
135+
Assistant: The {#molecule|chemical|compound!} with the {SMILES__description} {#representation of |!}{SMILES#} {#shows|exhibits|displays!} the desired properties.
136+
- |
137+
User: I need a {#drug|compound|chemical!} with a {CATMoS_LD50_mgkg__names__noun} of {CATMoS_LD50_mgkg#} {CATMoS_LD50_mgkg__units}.
138+
Assistant: {#Happy to help!|Sure!|Of course!} Can you provide me with more {#constraints|details|information!}?
139+
User: The {#drug|compound|chemical!} should have a {num_carbon_atoms#} {num_carbon_atoms__names__noun}, {num_oxygen_atoms#} {num_oxygen_atoms__names__noun}, and a {molecular_mass__names__noun} of {molecular_mass#} {molecular_mass__units}. Could you please only provide me with the {SMILES__description} and return no other information?
140+
Assistant: {SMILES#}
141+
- |
142+
User: I am looking for a {#drug|compound|chemical!} with a {log10_LD50__names__noun} of {log10_LD50#} {log10_LD50__units}.
143+
Assistant: {#That's interesting!|Interesting!|I see!} Can you provide me with more {#constraints|details|information!}?
144+
User: The {#drug|compound|chemical!} should have {num_ghose_violations#} {num_ghose_violations__names__noun}, {num_lead_likeness_violations#} {num_lead_likeness_violations__names__noun}, {num_lipinski_violations#} {num_lipinski_violations__names__noun}, {num_carbon_atoms#} {num_carbon_atoms__names__noun}, and {num_oxygen_atoms#} {num_oxygen_atoms__names__noun}. Please return only the {SMILES__description} wrapped as follows [ANSWER]<SMILES>[/ANSWER].
145+
Assistant: [ANSWER]{SMILES#}[/ANSWER]

data/tabular/ld50_catmos/transform.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import pandas as pd
2+
from huggingface_hub import hf_hub_download
3+
4+
5+
def process():
6+
file = hf_hub_download(
7+
repo_id="kjappelbaum/chemnlp-ld50catmos",
8+
filename="cleaned_ld50.csv",
9+
repo_type="dataset",
10+
)
11+
df = pd.read_csv(file)
12+
print(len(df))
13+
df[
14+
[
15+
"num_ghose_violations",
16+
"num_lead_likeness_violations",
17+
"num_lipinski_violations",
18+
"num_carbon_atoms",
19+
"num_oxygen_atoms",
20+
]
21+
] = df[
22+
[
23+
"num_ghose_violations",
24+
"num_lead_likeness_violations",
25+
"num_lipinski_violations",
26+
"num_carbon_atoms",
27+
"num_oxygen_atoms",
28+
]
29+
].astype(
30+
int
31+
)
32+
df.to_csv("data_clean.csv", index=False)
33+
34+
35+
if __name__ == "__main__":
36+
process()

0 commit comments

Comments
 (0)