-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AcTOR query and img function #247
base: master
Are you sure you want to change the base?
Conversation
#' @import httr xml2 | ||
#' | ||
#' @param query character; search term. | ||
#' @param from character; type of input. Only "cas". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAS are the only parameters to be used in the query AFAIK.
chemical_name <- trimws(xml_text(xml_nodes(site, ".chemicalNameFont"))) | ||
cas_dsstox <- xml_nodes(site, "#dsstoxSubstanceIdContainerId") | ||
cas <- trimws(xml_text(xml_node(cas_dsstox, "#casrnId"))) | ||
dsstox <- trimws(xml_text(xml_child(cas_dsstox[[1]], 3))) # error prone | ||
inchi <- trimws(xml_text(xml_node(site, "#inchiContainerId"))) | ||
inchi <- trimws(sub("InChi: InChI=", "", inchi, fixed = TRUE)) | ||
inchikey <- trimws(xml_text(xml_node(site, "#inchiKeyContainerId"))) | ||
inchikey <- trimws(sub("InChi Key:", "", inchikey)) | ||
formula <- trimws(xml_text(xml_node(site, "#molFormulaContainerId"))) | ||
formula <- trimws(sub("Molecular Formula:", "", formula)) | ||
molecularweight <- trimws(xml_text(xml_node(site, "#molWeightContainerId"))) | ||
molecularweight <- trimws(sub("Molecular Weight:", "", molecularweight)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only retrieves basic parameters named here. AcTOR however, contains a lot more data, though in very unstructured formats that are imho not easily parsable (links to documents, various data formats etc.).
Yet, I think it's quite useful to retrieve nice common names and the DSSTOX-ID which is used in the Comptox apllication as an identifier.
#' | ||
#' } | ||
#' | ||
actor <- function(query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit like a get_*()
function, but it retrieves no actual AcTOR-ID (there is none). Hence I sticked to the get_*()
style, apart from the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this more like a function we would call after get_*()
? There are many common endings, e.g. *_query()
, *_compinfo()
, *_prop()
, *_convert()
. Apart from molecular weight the rest are IDs, so I assume this function would mostly be used for ID conversions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. It's probably better to name it get_actor()
. Ok? I wouldn't name it get_actorid()
since there is no AcTOR id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think get_*()
style should be used, because there is no ACToR specific ID as you said. I was thinking more like actor_query()
, or actor_compinfo()
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, misunderstood you. I'm fine with actor_query()
too.
#' actor_img(comp) | ||
#' } | ||
#' | ||
actor_img = function(query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added this functionality because it was possible. So tell me what you think, whether we need such image functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opened Issue #249 for this discussion.
Ohoh, I searched a little more and found a robots.txt here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andschar! While looking for more information on ACToR I found this link: https://actor.epa.gov/actor/download.xhtml and the "Details" document points to a web service! The schema seems quite complex, I haven't looked into it in detail, but maybe it would help us access more structured data?
#' | ||
#' } | ||
#' | ||
actor <- function(query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this more like a function we would call after get_*()
? There are many common endings, e.g. *_query()
, *_compinfo()
, *_prop()
, *_convert()
. Apart from molecular weight the rest are IDs, so I assume this function would mostly be used for ID conversions?
#' actor_img(comp) | ||
#' } | ||
#' | ||
actor_img = function(query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt Since they've gone through the trouble of making the database available in their "download" tab, they probably don't mind web scraping, but I think it would be better to ask. |
I have sent them a mail. Let's just wait for the reply. |
Do you think this affects the ACToR web service as well? https://actorws.epa.gov/actorws/ |
Holy, I have completely not seen the AcTOR webservice! Where have you found it? Seems not too well documented^^. I think the robots.txt doesn't have an influence on a webservice. |
It was difficult to find, I admit. I found it through this link: https://actor.epa.gov/actor/download.xhtml |
Now that's really confusing. I have been at this site several times and always thought that there, one could only download the SQL dump: actor_2015q3.sql.gz. Have never looked into Details. Damn. More eyes see definitely more :) I wrote them a second mail and asked them about the current state of the web service. |
Pull Request
That's the first part of the the PR to include the AcTOR data source into webchem (Issue #209). It's not yet finished (documentation etc. missing) and here for discussion.
I haven't found any non-allowances and generally the EPA has rather open policies about their data, though this is still web scraping and not an official API. Probably best to ask them.
Once we have decided on this source, I will update the PR.
PR task list:
devtools::document()